Reliability Specifications Reliability Growth Models Role of Software Development process in reliability Factors influencing Reliability Brief introduction of tools to measure Reliability
Definition: Probability of failure-free operation of a computer program in a specified environment for a specified time, for a given purpose. Differs from hardware reliability in that it reflects the design perfection, rather than manufacturing perfection. Software Reliability is not a direct function of time. Informally, reliability is a measure of how well system users think it provides the services they require Speaks about products trustworthiness and dependability
product
To exactly measure the reliability of the product To determine whether software can be released. To predict resources required to bring software to required reliability. To determine impact of insufficient resources on operational profile. To prioritize the testing/inspection of the modules having highest estimated fault content. To develop fault avoidance techniques to Minimize number of faults inserted. Prevent insertion of specific types of faults
measure of the likelihood that the system will fail when a service request is made POFOD = 0.001 means 1 out of 1000 service requests result in failure Relevant for safety-critical or non-stop systems
occurrence of unexpected behaviour ROCOF of 0.02 means 2 failures are likely in each 100 operational time units Relevant for operating systems, transaction processing systems
of the time between observed failures MTTF of 500 means that the time between failures is 500 time units Relevant for systems with long transactions e.g. CAD systems
Availability
Measure
of how likely the system is available for use. Availability of 0.998 means software is available for 998 out of 1000 time units Relevant for continuously running systems e.g. telephone switching systems
Description Occurs only with certain inputs Occurs with all inputs System can recover without operator intervention Operator intervention needed to recover from failure Failure does not corrupt system state or data Failure corrupts system state or data No erratic results. Usage may be hobbled.
Steps for reliability specification:Identify different types of system failure and analyse the consequences of failures. Partition failures into appropriate classes Define reliability requirement using appropriate reliability metric
Metrics POFOD Reliability Specifications For systems where (critical services happen in an unpredictable way, or when there is a long time interval between consecutive requests For systems where (critical) services are demanded in more regular way For systems involving long transactions, during which a guarantee of service continuity and delivery should be expected For systems where continuous service delivery is a major concern
POFOD 1 in 1000 transactions Unquantifiable ! Should never happen in the lifetime of the system.
The objective is to determine reliability rather than discover errors. Users may find the system more reliable than it really is ! Statistical testing combined with reliability growth models helps in
-This can be determined by analyzing the usage pattern. Manually select or automatically generate a set of test data: -corresponding to the operational profile. Apply test cases to the program: -record execution time between each failure -it may not be appropriate to use raw execution time After a statistically significant number of failures have been observed: -reliability can be computed.
Statistical testing is not easy to apply due to following Operational profile uncertainty:
This is a particular problem for new systems with no operational history. High costs of operational profile generation: Costs are very dependent on what usage information is collected by the organisation which requires the profile Statistical uncertainty when high reliability is
specified :
Difficult to estimate level of confidence in operational profile and also usage pattern of software may change with time.
Basics of reliability theory:Unbounded probability distribution function (pdf) that reflects the concept that the failure time happen purely randomly. f(t)=epow(- t)
The distribution function F(t) is the probability of failure between time interval 0 and t, stated as:
we define the reliability function R(t) as R(t)=1-F(t) The mean time to failure (MTTF) is the mean of the probability density function. We can calculate the mean of the pdf f(t) as Predict the ith failure:
therefore useful to know the mean time to repair (MTTR) for the component that has failed
Availability is the probability that a component is operating at a
given point in time. Pressman defines Available as Availability: MTTF * 100 (MTTF+MTTR)
number of failures that may be encountered after the software has shipped and thus as an indication of whether the software is ready to ship. Thus SRMs aids planning and decision making process.
Two types of software reliability models:
Defect density models Uses design parameters like LOC, loop nesting, external references etc. Software Reliability Growth models - attempt to statistically correlate defect detection data with known functions such as an exponential function.
improves as errors are detected and repaired. Different Software Reliability Growth Models differ in their attempts towards how they predict error rate, and thereby the reliability.
These models attempt to statistically correlate defect detection data
with known functions such as an exponential function. If the correlation is good, the known function can be used to predict future behavior. Two relevant types of data for software reliability growth models. > time at which the defect was discovered > the second is the number of defects discovered Execution (CPU) time is the best measure of the amount of testing. Using calendar time or number of test cases to measure the amount of testing did not provide credible results.
means estimating its parameters from the data. Parameter estimation can be done using:
- Maximum
During the test, the model should predict the additional test effort required to achieve a quality level (as measured by number of remaining defects) that we deem suitable for customer use. The model must become stable during the test period, i.e., the predicted number of total defects should not vary significantly from week to week. The model must provide a reasonably accurate prediction of the number of defects that will be discovered in field use.
JelinskiMoranda Model:
Assumes that at time 0 the software has a fixed (and finite)
Poisson process with a rate (t) that may vary with time as for some constant c (t) decreases by c whenever an error occurs and the bug that caused it is corrected, and is constant between errors
The reliability at time t, or the probability of a error-free
the conditional probability that the following interval of length t, namely [ , + t] will be error-free is
As the software runs for longer and longer, more bugs are caught and purged from the system, and so the error rate declines and the future reliability increases
Assumes Initially an infinite (or at least very large) number of initial bugs in the software, M(t)the number of bugs discovered and corrected during time [0, t]. 0 : the error rate at time 0 C : a const / failure rate decay parameter (t) : the expected number of errors experienced during [0, t] ((t) = E(M(t))) Under this model, the error rate after testing for a length of time t is given by
the error rate varies with time for the Musa Okumoto model. To get the error rate of software down to a sufficiently low point, clearly requires a significant amount of testing.
output with a typical set of input data from the user environment. Reliability of a program is expressed as a function of its components
Working:
The program is seen as a graph Assumes there is one entry node and one exit node Every transition from node Ni to node Nj has a probability of Pij
P21
Input
N1
P31
P12
N2
P14 P24
P13
N3
P34
N4 Output
C program return correct output F if any module have a fault, the program do not return correct
R2P21
N1
R3P31
N2
R2P24
1-R2
F
1-R4 1-R3
N3
R3P34
N4
R4
and that the system enters the absorbing states j{C,F} at or before the nth step The reliability of the system is the probability that it gets to state C, when starting at node,N1 and n R = Pn(N1,C)
Pn(i,j) probability that the system goes from the starting state i,
Definition and Criticality of Failures: Fixed Number of Faults: All Faults Have the Same Failure Rate: All Software Faults are Always Exposed Testing is Homogeneous: Failure Rate is (only) Proportional to Error Content Number of Failures in Disjoint Intervals is
Experiments revealed that about 70% of software faults result from problems introduced during the requirements definition and design phases The software development process consists of five phases: Analysis, Design, Coding, Testing, Operation
Analysis:
-Better training, software tools and programmer and resource time management, staff . -Cost effectiveness.
All the above led to very significant improvements in reliability for most software. Reliability in a software system can be achieved using two strategies: Fault avoidance: Fault tolerance:
It relies on many factors. Availability of a precise(preferably formal) system specification Adoption of an approach to software design and implementation which is based on information hiding and encapsulation Use of strongly typed programming language Restrictions on the use of programming constructs, such as pointers, GOTO Statements, Dynamic memory allocation, Recursion, Interrupts.
system Facilities are provided in the software to allow operation to continue when these faults cause system failures. Software fault-tolerance techniques helps to mitigate the impact of software defects (bugs) failures. Single-Version Fault Tolerance Software Rejuvenation N-Version Programming Recovery Block Approach Exception-Handling
In this section, we consider ways by which individual pieces of software can be made more robust. Wrappers: piece of software that encapsulates the given program when it is being executed. Wrapper Software
Wrapped Entity Examples of their use. Dealing with Buffer Overflow. Checking the Correctness of the Scheduler Using a Wrapper to Check for Correct Output
Definition: proactively halting the process, cleaning up its internal state, and then restarting it. Rejuvenation Levels: One can rejuvenate at either the
application or at the processor level.
N(t): Expected number of errors over an interval of length t(without rejuvenation) Ce: Cost of each error Cost of each rejuvenation C r: P: Inter-rejuvenation period cost of rejuvenation over a period P, denoted by The cost per unit time, Crate(P), is then given by
Crejuv(P):(P)Ce+ Cr
To minimize this quantity, we find P such that dCrate(P)/dP = 0 (and d2Crate(P)/dP2 > 0). Differentiating, we find the optimal value of the rejuvenation period, denoted by P*, to be\
P* = (Cr/Ce)
To set the period P appropriately, we need to know the values of the parameters Cr /Ce and N (t).
software to the same specifications N versions of software are then run in parallel, and their output is voted on
Version 1 Version 2 Version 3 Output Comparator
develop
Example: The airbus 320 uses N-version Programming on its onboard software.
the recovery block approach also uses multiple versions of software. only one version runs at any one time. If this version fails, execution is switched to a backup.
The success of the recovery block approach depends on: the extent to which the primary and various secondaries fail on the same inputs (correlated bugs) the quality of the acceptance test. These clearly vary from one application to the next
f= P{E}, s = P{T|E}, = P{E|T}. For the scheme to succeed, it must succeed at some stage i, (0 <i <= n). This will happen if the test fails stages 1, . . . i 1 and at stage i the versions output is correct and it passes the test. We now have
Effective exception-handling can make a significant contribution to system fault tolerance Exceptions can be used to deal with Domain or range error An out-of-the ordinary event (not failure) that needs special attention, or a timing failure.
Factor Program Complexity Amount of programming effort Difficulty of Programming Programmer skills Documentation Testing effort Testing coverage Testing tools Frequency of program specification change Development team size
Metric KLOC Man-years man years/years2 PSKL =I / n -------------Man-years TCVG =Sc / ST ------------------------------Number of people
(Intersolv), DDTS (Qualtrack) etc. Test coverage evaluation: GCT (Testing Foundation), PureCoverage (Relational), ATAC (Bellcore) Reliability growth modeling: SMERFS (NSWC), CASRE (NASA), ROBUST (CSU) etc. Defect density estimation: ROBUST (CSU). Coverage-based reliability modeling: ROBUST (CSU) Markov reliability evaluation: HARP (NASA), HiRel (NASA), PC Availability (Management Sciences) etc.
and developers of software. As reliability is defined based on failure, it is impossible to measure before development is complete. Nevertheless, if data is carefully collected on interfailure times, we can make accurate forecasts of software reliability
Michael R.
Lyu. Future of Software Engineering(FOSE'07) Fault Tolerant systems by Israel Koren and C.Mani Krishnan.(Chapter 5) An analysis of factors affecting software reliability by Xuemei Zhang, Hoang Pham(the journals of Systems and Software) IEEE TRANSACTIONS ON RELIABILITY, VOL. R-34, NO. 3 ,On the Software Reliability Models of Jelinski-Moranda and Littlewood CRITERIA FOR SOFTWARE RELIABILITY MODEL COMPARISONS A. lannino, J . D. Musa, K . Okumoto, Bell Laboratories. J.D.Musa,A theory of software reliability and its applications,IEEE Trans.Software Engineering.
Thank You.