Summary of Research
The use of quantitative modelling techniques in Computer Science has grown
rapidly in the last ten years, largely thanks to stochastic extensions
of many traditional formalisms, such as process algebra and Petri nets.
In particular, the Performance Evaluation Process Algebra (PEPA) [1]
has been highly successful, due to its compositionality, straightforward
semantics and large tool-base. However, the current approaches to formal
analyses of large-scale distributed systems all involve direct use of
such abstract modelling languages. This leads to the question of how
to relate such models to the implementations of the systems they concern.
In particular, we wish to investigate techniques for analysing the program
code directly; in other words, the ability to automatically derive a
stochastic model from the code.
In the
world of qualitative model checking, there have been great advances
to this end. The work on SLAM [2] at Microsoft Research is a prime example
of this, leading to the release of the Static Driver Verifier as a practical
tool. Using predicate abstraction, they produce a highly abstract model,
which is then iteratively refined with the aid of counter-examples from
the concrete program. This works because the properties concern just
possibilities of certain behaviours, rather than probabilities,
or durations. The challenge we face is to develop corresponding techniques
that can be applied in a quantitative setting.
Whilst
our immediate goal is to derive a PEPA model that represents the performance
characteristics of a program, we must stress that this is not the ultimate
aim of the research (although it is an essential part of it). A model
is only as good as the analysis it can undergo, and its structure will
vary depending on the properties we are interested in. We must therefore
be careful to guide the model extraction with respect to its intended
analysis, and also relate the results of this analysis back to the original
program. Bearing this in mind, we can divide the process into four distinct
phases; namely instrumentation, abstraction, refinement and analysis.
To begin with, there is too little information in a program to decide the purpose of the model. Therefore we must start by instrumenting the code, annotating the important elements of its state so that we can refer to them in the model. We can then compute relevant performance properties by assigning rewards to these states. Furthermore, we need an abstract interface for our code, to avoid analysing unnecessary details such as library and system calls. It is therefore vital for us to develop an appropriate framework for these annotations.
The biggest
challenge when dealing with real code is how to manage the size of its
state space. To do this, we need to develop abstractions with
which we can calculate the probabilities of control-flow decisions (branches).
In particular, it is vital that we can bound the error in our approximations.
Predicate abstraction is useful for handling discrete state elements
(such as binary flags, or enumeration types), but not in general for
integer variables. On the other hand, certain classes of variable, such
as loop counters, are readily modelled using intervals. Determining
the most appropriate abstraction for each variable, and reasoning about
the errors involved, is one of the most important challenges we face.
Control
flow abstraction is similarly important, for which static analysis techniques
are required. In particular, great care must be taken when assuming
independence between variables. At present, path-based abstractions
look to be promising, but we must investigate this more thoroughly.
There is some (though surprisingly little) relevant work from the branch
prediction community, such as value-range propagation [3], which
provides a good starting point for deriving intervals for local variables.
Unlike
SLAM, we have no obvious way in which to add more detail to an over-simplified
model, since altering any one probability (or rate) affects the entire
model’s behaviour. Instead, we need to take the opposite view of
refinement, by increasingly simplifying a detailed initial model.
The only way to handle such a large model to begin with is by exploiting
its structure. Remaining in the process algebra representation (and
so avoiding expansion of cooperation points), we have a number of potential
techniques for refinement. Structurally, we can split the model at interaction
points, using response time approximation [4] to analyse one
side at a time. Behaviourally, we can look for quasi-lumpable
states, or perform time scale decomposition [4] to separately
treat fast and slow activity rates. We need to investigate the utility
of these methods, and if necessary develop more appropriate refinement
techniques.
One important
advantage of using PEPA as our modelling language is its alternative
semantics. Traditionally, a PEPA process maps onto a Continuous Time
Markov Chain (CTMC), but the size of this grows exponentially in the
number of concurrent processes. More recent work, in [5], describes
how to map a process onto a set of ordinary differential equations (ODEs).
This avoids the state space explosion problem by recording just the
number of processes in each local state, rather than the detailed
state of the entire system. This is particularly useful for analysing
the large-scale behaviour when many components interact.
Since
we are interested in application domains such as communications protocols,
it is important to deal with real-world programming languages such as
C. Thankfully, there exist tools such as CCured [6] and the C Intermediate
Language (CIL) [7], which provide a type-safe framework for analysis.
The success
of this research would lead to a great many applications. A practical
tool would considerably widen the application of performance analysis
techniques, and could significantly improve upon current methods for
non-functional testing. Indeed, the ability to track performance criteria
from the design stage through the entire development, without having
to maintain a separate model, would be invaluable. Furthermore, the
use of performance contracts is of great interest to many application
domains, such as web services, which would welcome a tool for deriving
the resource-usage distributions of code in different contexts.
There
is certainly a significant gap in the current research, which we hope
this work can fill. We believe that now is the time to bring existing
work together, to allow quantitative, and not just qualitative, model
checking of real code.
[1] J.
Hillston. A Compositional Approach to Performance Modelling.
Cambridge University Press, 1996.
[2] T.
Ball and S. K. Rajamani. The SLAM toolkit. In Proceedings of CAV
2001.
[3] J.
R. C. Patterson. Accurate static branch prediction by value range propagation.
In SIGPLAN Conference on Programming Language Design and Implementation,
1995.
[4] V.
Mertsiotakis. Approximate Analysis
Methods for Stochastic Process Algebras. PhD thesis, Institut für
Mathematische Maschinen und Datenverarbeitung, 1998.
[5] J.
Hillston. Fluid flow approximation of PEPA models. In Proceedings
of the Second International Conference on the Quantitative Evaluation
of Systems, 2005.
[6] G.
C. Necula, S. McPeak and W. Weimer. CCured: Type-safe retrofitting of
legacy code. In Proceedings of POPL 2002.
[7] G. C. Necula, S. McPeak, S. P. Rahul and W. Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proceedings of Conference on Compiler Construction, 2002.