Anda di halaman 1dari 49

What is Reliability? Importance of Reliability How do we measure it?

Reliability Specifications Reliability Growth Models Role of Software Development process in reliability Factors influencing Reliability Brief introduction of tools to measure Reliability

Definition: Probability of failure-free operation of a computer program in a specified environment for a specified time, for a given purpose. Differs from hardware reliability in that it reflects the design perfection, rather than manufacturing perfection. Software Reliability is not a direct function of time. Informally, reliability is a measure of how well system users think it provides the services they require Speaks about products trustworthiness and dependability

It estimates the lifetime of the concerned product. Example:


A company X has developed a program for a Web server with the failure intensity objective of 1 failure/100,000 transactions. During testing, the program runs for 50 hours, handling 10,000 transactions per hour on average with no failures occurring. How confident are we that the program has met its objective? Can we release the software now?

It helps to determine unit cost of product


Example: Unit manufacturing cost of a software product is $50. The company decides to offer one year free update to its customers. Suppose that failure intensity of the product at the release time is I = 0.01 failures/month. What should be the unit cost of the product including warranty services?

The more is the reliability, the more is the growth of the

product

To exactly measure the reliability of the product To determine whether software can be released. To predict resources required to bring software to required reliability. To determine impact of insufficient resources on operational profile. To prioritize the testing/inspection of the modules having highest estimated fault content. To develop fault avoidance techniques to Minimize number of faults inserted. Prevent insertion of specific types of faults

Probability of failure on demand(POFOD)


This is a

measure of the likelihood that the system will fail when a service request is made POFOD = 0.001 means 1 out of 1000 service requests result in failure Relevant for safety-critical or non-stop systems

Rate of fault occurrence (ROCOF)


Frequency of

occurrence of unexpected behaviour ROCOF of 0.02 means 2 failures are likely in each 100 operational time units Relevant for operating systems, transaction processing systems

Mean time to failure


Measure

of the time between observed failures MTTF of 500 means that the time between failures is 500 time units Relevant for systems with long transactions e.g. CAD systems

Availability
Measure

of how likely the system is available for use. Availability of 0.998 means software is available for 998 out of 1000 time units Relevant for continuously running systems e.g. telephone switching systems

Failure Class Transient Permanent Recoverable Unrecoverable Non-corrupting Corrupting Cosmetic

Description Occurs only with certain inputs Occurs with all inputs System can recover without operator intervention Operator intervention needed to recover from failure Failure does not corrupt system state or data Failure corrupts system state or data No erratic results. Usage may be hobbled.

Steps for reliability specification:Identify different types of system failure and analyse the consequences of failures. Partition failures into appropriate classes Define reliability requirement using appropriate reliability metric
Metrics POFOD Reliability Specifications For systems where (critical services happen in an unpredictable way, or when there is a long time interval between consecutive requests For systems where (critical) services are demanded in more regular way For systems involving long transactions, during which a guarantee of service continuity and delivery should be expected For systems where continuous service delivery is a major concern

ROCOF MTTF AVAIL

Example: Reliability specification for bank ATMs


Failure Class Permanent, Non-corrupting Example The system fails to operate with any card which is input. Software must be restarted to correct failure The magnetic stripe data cannot be read on an undamaged card which is input A pattern of transactions across the network causes database corruption Reliability Metric ROCOF 1 occurance/1000 days

Transient, Non corrupting Transient ,corrupting

POFOD 1 in 1000 transactions Unquantifiable ! Should never happen in the lifetime of the system.

The objective is to determine reliability rather than discover errors. Users may find the system more reliable than it really is ! Statistical testing combined with reliability growth models helps in

predicting when the final system reliability can be achieved.

The steps involved in statistical testing are:


Determine the operational profile of the software:

-This can be determined by analyzing the usage pattern. Manually select or automatically generate a set of test data: -corresponding to the operational profile. Apply test cases to the program: -record execution time between each failure -it may not be appropriate to use raw execution time After a statistically significant number of failures have been observed: -reliability can be computed.

Statistical testing is not easy to apply due to following Operational profile uncertainty:
This is a particular problem for new systems with no operational history. High costs of operational profile generation: Costs are very dependent on what usage information is collected by the organisation which requires the profile Statistical uncertainty when high reliability is

specified :

Difficult to estimate level of confidence in operational profile and also usage pattern of software may change with time.

Basics of reliability theory:Unbounded probability distribution function (pdf) that reflects the concept that the failure time happen purely randomly. f(t)=epow(- t)

Probability of failure between time interval t1 and t2 is given by:

The distribution function F(t) is the probability of failure between time interval 0 and t, stated as:

we define the reliability function R(t) as R(t)=1-F(t) The mean time to failure (MTTF) is the mean of the probability density function. We can calculate the mean of the pdf f(t) as Predict the ith failure:

therefore useful to know the mean time to repair (MTTR) for the component that has failed
Availability is the probability that a component is operating at a

Once the failure occurs there is a need to repair the fault. It is

given point in time. Pressman defines Available as Availability: MTTF * 100 (MTTF+MTTR)

number of failures that may be encountered after the software has shipped and thus as an indication of whether the software is ready to ship. Thus SRMs aids planning and decision making process.
Two types of software reliability models:
Defect density models Uses design parameters like LOC, loop nesting, external references etc. Software Reliability Growth models - attempt to statistically correlate defect detection data with known functions such as an exponential function.

Software reliability models can be used as an indication of the

improves as errors are detected and repaired. Different Software Reliability Growth Models differ in their attempts towards how they predict error rate, and thereby the reliability.
These models attempt to statistically correlate defect detection data

A SRG model is a mathematical model of how software reliability

with known functions such as an exponential function. If the correlation is good, the known function can be used to predict future behavior. Two relevant types of data for software reliability growth models. > time at which the defect was discovered > the second is the number of defects discovered Execution (CPU) time is the best measure of the amount of testing. Using calendar time or number of test cases to measure the amount of testing did not provide credible results.

means estimating its parameters from the data. Parameter estimation can be done using:
- Maximum

Fitting the software reliability growth model function to the data

Likelihood Estimation - Classical Least Squares - Alternate Least Squares

During the test, the model should predict the additional test effort required to achieve a quality level (as measured by number of remaining defects) that we deem suitable for customer use. The model must become stable during the test period, i.e., the predicted number of total defects should not vary significantly from week to week. The model must provide a reasonably accurate prediction of the number of defects that will be discovered in field use.

JelinskiMoranda Model:
Assumes that at time 0 the software has a fixed (and finite)

number N(0) of bugs, out of which N(t) bugs remain at time t.


Assumes that all bugs contribute equally to the error rate The error process is a nonhomogeneous Poisson process i.e., a

Poisson process with a rate (t) that may vary with time as for some constant c (t) decreases by c whenever an error occurs and the bug that caused it is corrected, and is constant between errors
The reliability at time t, or the probability of a error-free

operation during [0, t] is therefore

Given an error occurred at time , the conditional future reliability, or

the conditional probability that the following interval of length t, namely [ , + t] will be error-free is

As the software runs for longer and longer, more bugs are caught and purged from the system, and so the error rate declines and the future reliability increases

Assumes Initially an infinite (or at least very large) number of initial bugs in the software, M(t)the number of bugs discovered and corrected during time [0, t]. 0 : the error rate at time 0 C : a const / failure rate decay parameter (t) : the expected number of errors experienced during [0, t] ((t) = E(M(t))) Under this model, the error rate after testing for a length of time t is given by

From the definition of (t) and (t), we have

The solution of this differential equation is and

The reliability R(t) can now be calculated as

the error rate varies with time for the Musa Okumoto model. To get the error rate of software down to a sufficiently low point, clearly requires a significant amount of testing.

output with a typical set of input data from the user environment. Reliability of a program is expressed as a function of its components

Probability that the program will give the correct

user environment R = Pi L qiri

qi be the probability that Pi will be generated in a

Working:
The program is seen as a graph Assumes there is one entry node and one exit node Every transition from node Ni to node Nj has a probability of Pij

If no connection between Ni and Nj, then Pij= 0

P21

Input

N1
P31

P12

N2
P14 P24

P13

N3

P34

N4 Output

Two new exit states are added


output

C program return correct output F if any module have a fault, the program do not return correct
R2P21

N1
R3P31

R1P12 R1P14 R1P13

N2
R2P24

1-R2

F
1-R4 1-R3

N3

R3P34

N4

R4

and that the system enters the absorbing states j{C,F} at or before the nth step The reliability of the system is the probability that it gets to state C, when starting at node,N1 and n R = Pn(N1,C)

Pn(i,j) probability that the system goes from the starting state i,

Let S be an n by n matrix such that,

Let W=I-Q, it can be shown that,

And finally the reliability of the software can be calculated by

Definition and Criticality of Failures: Fixed Number of Faults: All Faults Have the Same Failure Rate: All Software Faults are Always Exposed Testing is Homogeneous: Failure Rate is (only) Proportional to Error Content Number of Failures in Disjoint Intervals is

Independent Failure Data Collection is Accurate.

Experiments revealed that about 70% of software faults result from problems introduced during the requirements definition and design phases The software development process consists of five phases: Analysis, Design, Coding, Testing, Operation

Analysis:
-Better training, software tools and programmer and resource time management, staff . -Cost effectiveness.

Programming for Reliability


Improved programming techniques, Better programming languages and Better quality management

All the above led to very significant improvements in reliability for most software. Reliability in a software system can be achieved using two strategies: Fault avoidance: Fault tolerance:

Objective of producing fault free systems

It relies on many factors. Availability of a precise(preferably formal) system specification Adoption of an approach to software design and implementation which is based on information hiding and encapsulation Use of strongly typed programming language Restrictions on the use of programming constructs, such as pointers, GOTO Statements, Dynamic memory allocation, Recursion, Interrupts.

This strategy assumes that residual faults remain in the

system Facilities are provided in the software to allow operation to continue when these faults cause system failures. Software fault-tolerance techniques helps to mitigate the impact of software defects (bugs) failures. Single-Version Fault Tolerance Software Rejuvenation N-Version Programming Recovery Block Approach Exception-Handling

In this section, we consider ways by which individual pieces of software can be made more robust. Wrappers: piece of software that encapsulates the given program when it is being executed. Wrapper Software

Wrapped Entity Examples of their use. Dealing with Buffer Overflow. Checking the Correctness of the Scheduler Using a Wrapper to Check for Correct Output

Definition: proactively halting the process, cleaning up its internal state, and then restarting it. Rejuvenation Levels: One can rejuvenate at either the
application or at the processor level.

Timing of Rejuvenation: Software rejuvenation can be


based on either time or prediction. Time-based rejuvenation consists of rejuvenating at constant intervals. Determining an optimal inter-rejuvenation period is critical.

N(t): Expected number of errors over an interval of length t(without rejuvenation) Ce: Cost of each error Cost of each rejuvenation C r: P: Inter-rejuvenation period cost of rejuvenation over a period P, denoted by The cost per unit time, Crate(P), is then given by

Crejuv(P):(P)Ce+ Cr

Crate=Crejuv(P)P = (N(P)Ce + Cr)/P

To minimize this quantity, we find P such that dCrate(P)/dP = 0 (and d2Crate(P)/dP2 > 0). Differentiating, we find the optimal value of the rejuvenation period, denoted by P*, to be\

P* = (Cr/Ce)
To set the period P appropriately, we need to know the values of the parameters Cr /Ce and N (t).

Prediction based Rejuvenation:


This involves monitoring the system characteristics (amount of memory allocated, number of file locks held, and so on), and predicting when the system will fail. Example, if a process is consuming memory at a certain rate, the system can estimate when it will run out of memory. Rejuvenation then takes place just before the predicted crash

N independent teams of programmers

software to the same specifications N versions of software are then run in parallel, and their output is voted on
Version 1 Version 2 Version 3 Output Comparator

develop

Example: The airbus 320 uses N-version Programming on its onboard software.

the recovery block approach also uses multiple versions of software. only one version runs at any one time. If this version fails, execution is switched to a backup.

The success of the recovery block approach depends on: the extent to which the primary and various secondaries fail on the same inputs (correlated bugs) the quality of the acceptance test. These clearly vary from one application to the next

Success probability calculation:


E - the event that the output of a version is erroneous T - the event that the test reports that the output is wrong f - the failure probability of a version s - the test sensitivity - the test specificity n - the number of available software versions (primary plus secondaries

f= P{E}, s = P{T|E}, = P{E|T}. For the scheme to succeed, it must succeed at some stage i, (0 <i <= n). This will happen if the test fails stages 1, . . . i 1 and at stage i the versions output is correct and it passes the test. We now have

Effective exception-handling can make a significant contribution to system fault tolerance Exceptions can be used to deal with Domain or range error An out-of-the ordinary event (not failure) that needs special attention, or a timing failure.

Factor Program Complexity Amount of programming effort Difficulty of Programming Programmer skills Documentation Testing effort Testing coverage Testing tools Frequency of program specification change Development team size

Metric KLOC Man-years man years/years2 PSKL =I / n -------------Man-years TCVG =Sc / ST ------------------------------Number of people

Defect tracking: BugBase (Archimides), DVCS Tracker

(Intersolv), DDTS (Qualtrack) etc. Test coverage evaluation: GCT (Testing Foundation), PureCoverage (Relational), ATAC (Bellcore) Reliability growth modeling: SMERFS (NSWC), CASRE (NASA), ROBUST (CSU) etc. Defect density estimation: ROBUST (CSU). Coverage-based reliability modeling: ROBUST (CSU) Markov reliability evaluation: HARP (NASA), HiRel (NASA), PC Availability (Management Sciences) etc.

and developers of software. As reliability is defined based on failure, it is impossible to measure before development is complete. Nevertheless, if data is carefully collected on interfailure times, we can make accurate forecasts of software reliability

Software reliability is a key problem for may users

Software Reliability Engineering: A Roadmap by

Michael R.

Lyu. Future of Software Engineering(FOSE'07) Fault Tolerant systems by Israel Koren and C.Mani Krishnan.(Chapter 5) An analysis of factors affecting software reliability by Xuemei Zhang, Hoang Pham(the journals of Systems and Software) IEEE TRANSACTIONS ON RELIABILITY, VOL. R-34, NO. 3 ,On the Software Reliability Models of Jelinski-Moranda and Littlewood CRITERIA FOR SOFTWARE RELIABILITY MODEL COMPARISONS A. lannino, J . D. Musa, K . Okumoto, Bell Laboratories. J.D.Musa,A theory of software reliability and its applications,IEEE Trans.Software Engineering.

Thank You.

Anda mungkin juga menyukai