ProbabilityandStatisticsin MicrosoftExcel
ProbabilityandStatisticsinMicrosoftExcel
Excelprovidesmorethan100functionsrelatingtoprobabilityandstatistics.Italsohasafacilityfor constructingawiderangeofchartsandgraphsfordisplayingdata.Thisleafletprovidesaquickreference guidetoassistyouinharnessingExcelsstatisticalcapability.Exceptwhereindicated,thefeaturesincluded hereareavailableinExcelVersions4.0andabove.Almostalltheinstructionsherealsoapplytothe spreadsheetfacilityinOpenOffice(http://openoffice.orgsuite.com/);anyslightvariationsincommands shouldbeobvioustotheuser. Excelisnotdesignedforstatisticalcomputing.Ifyourequirestatisticalanalysisbeyonddatavalidationand manipulation,tabulation,presentationandcalculationofsummarystatistics,youareadvisedtousea bespokestatisticalpackagesuchasMinitaborSPSS. ExcelhasanAnalysisToolpakoptionaladdinfacilitythatincludesmacrosforcarryingoutmany elementarystatisticalanalyses.TheinstructionsforinstallationofthisaddinvarywiththeversionofExcel usetheHelpfacilityinExcelforfurtherinformationonthis.Thisaddinfacilityisnotusedinthisleaflet. Therearetworeasonswhythisaddinshouldbeusedwithcare: Unlikeotherspreadsheetfunctionality,whichensuresthatcalculationsautomaticallyupdateinthe lightofchangeselsewhereintheworkbook,theoutputfromtheaddinisnotdynamicallylinkedtothe sourcedata.Henceifanyofthedatachangetheaddinmustberunagaintoobtainupdatedoutput. Outputfromtheaddincanbemisleading(seehttp://support.microsoft.com/kb/829252forexample). ThereareothercommerciallyavailableaddinsthatmakeuseofExcelsfamiliaruserinterfacebut supplementitsstatisticalfunctionality.Examplesinclude: Analyseit RExcel Unistat XLSTAT StatTools http://www.analyseit.com/ http://rcom.univie.ac.at/ http://www.unistat.com/ http://www.xlstat.com/en/home/ http://www.palisade.com/stattools/
Usingthisleaflet
Supposeyouhaveasampleofthreedata,10.4,11.2and16.4,thatyouhaveenteredintocellsA2:A4ona worksheet.InExcelafunction,e.g.SUM,canbeappliedtothesedatainoneoffourways: =SUM(10.4,11.2,16.4) =SUM(A2,A3,A4) =SUM(A2:A4) =SUM(x) wherexisthenameattachedtorangeA2:A4. Inthisleaflet,forsimplicity,wehavechosentorefertonamed ranges.Tonamearange,simplyhighlighttherangeofcells,clickin theNameBoxonthefarleftoftheFormulaBar,typeintherequired name,e.g.x,thenpressEnter.InExcel2007namescanbemanaged viaFormulas>NameManager. Ifyouprefernottousenamestheninwhatfollowssimplyreplacethe nameoftherange,e.g.x,bytherangeaddress,e.g.A2:A4.
DescriptiveStatistics
Assumingasampleofdatainrangex Sampletotal,x =SUM(x) Samplesize,n =COUNT(x) Samplemean,x/n =AVERAGE(x) 2 Samplevariance,s =VAR(x) Samplestandarddeviation,s =STDEV(x) Meansquareddeviation =VARP(x) Rootmeansquareddeviation =STDEVP(x) Correctedsumofsquares,Sxx =DEVSQ(x) 2 Rawsumofsquares,x =SUMSQ(x) Minimumvalue =MIN(x) Maximumvalue =MAX(x) Range =MAX(x)MIN(x) LowerQuartile,Q1* =QUARTILE(x,1) Median,Q2 =MEDIAN(x) UpperQuartile,Q3* =QUARTILE(x,3) Interquartilerange,IQR =QUARTILE(x,3)QUARTILE(x,1) th K Percentile =PERCENTILE(x,K%) whereKisanumberbetween0and100 Mode =MODE(x) *Note:Thereareseveraldifferentdefinitionsfortheupperandlowerquartiles,sothevaluescalculatedby Excelmaynotagreewithyourtextbookorotherstatisticalcalculationtools. Boxplot Seehttp://www.coventry.ac.uk/ec/~nhunt/boxplot.htm
GroupedFrequencyData
Assumingafrequencydistributionwithclassmidpointsstoredinrangexandfrequenciesinrangef: Samplesize,n Sampletotal,fx Samplemean,fx/n Correctedsumofsquares,Sxx Samplevariance,s2 Samplestandarddeviation,s =SUM(f) =SUMPRODUCT(f,x) =SUMPRODUCT(f,x)/SUM(f) =SUMPRODUCT(f,x,x)SUMPRODUCT(f,x)^2/SUM(f) =(SUMPRODUCT(f,x,x)SUMPRODUCT(f,x)^2/SUM(f))/(SUM(f)1) =SQRT(Samplevariance)
GraphicalRepresentations
Exceloffersawiderangeofcharttypesfordisplayingdata.Manyoftheseareoverelaborate.In particular,3Deffectscanbemisleadingandshouldbeavoided. InExcel2007toconstructachartforyourdata: 1. Selecttherangecontainingyourdata,includinganyroworcolumnlabels. 2. Onthemainribbon,clickontheInserttab. 3. UndertheChartsgroupoficons,selectthecharttyperequired,thenthepreferredchartsubtype. 4. UnderChartToolsonthemainribbon,usetheDesign,LayoutandFormattabstocustomisethechart. InearlierversionsofExcel,selectthedatarangeandthenInsert>CharttoinvoketheChartWizard.
PermutationsandCombinations
Numberofdifferentcombinationsofmobjectsselectedfromnobjects n C m =COMBIN(n,m) Numberofdifferentpermutationsofmobjectsselectedfromnobjects n P m =PERMUT(n,m)
StandardProbabilityDistributions
AssumingarandomvariableXandconstantsaandb Binomial Bin(n,p) P(X=a) =BINOMDIST(a,n,p,FALSE) P(Xa) =BINOMDIST(a,n,p,TRUE)
Normal f(a) P(Xa) P(aXb) P(Xb) Exponential f(a) P(Xa) P(aXb) P(Xb) Gamma f(a) FALSE) P(Xa) P(aXb) TRUE) TRUE) P(Xb) TRUE) N(, 2) =NORMDIST(a,mu,sigma,FALSE) =NORMDIST(a,mu,sigma,TRUE) =NORMDIST(b,mu,sigma,TRUE) NORMDIST(a,mu,sigma,TRUE) =1NORMDIST(b,mu,sigma,TRUE) Expon() =EXPONDIST(a,theta,FALSE) =EXPONDIST(a,theta,TRUE) =EXP(a*theta)EXP(b*theta) =EXP(b*theta) Ga(,) =GAMMADIST(a,alpha,beta,
TestStatisticsforPopularSignificanceTests
Onesampletestofamean Assumingasampleofdatainrangex,drawnfromapopulationwithmeanandstandarddeviation: H0:=0H1:0 Teststatistic,z =(AVERAGE(x)mu0)/(sigma/SQRT(COUNT(x))) assumingknown Teststatistic,t =(AVERAGE(x)mu0)/(STDEV(x)/SQRT(COUNT(x))) assumingunknown Onesampletestofavariance Assumingasampleofdatainrangex,drawnfromapopulationwithmeanandstandarddeviation: H0:2=02H1:2> 02 Teststatistic,2 =DEVSQ(x)/sigma0^2 Twosampletestofdifferencebetweenmeans Assumingtwosamplesofdatainrangesxandy,drawnfrompopulationswithmeans1and2andequal variances: H0:12=cH1:12c Estimatetheunknowncommonstandarddeviationbythepooledestimate: s =SQRT((DEVSQ(x)+DEVSQ(y))/(COUNT(x)+COUNT(y)2)) Teststatistic,t =(AVERAGE(x)AVERAGE(y)c)/(s*SQRT(1/COUNT(x)+1/COUNT(y))) Twosampletestofratioofvariances Assumingtwosamplesofdatainrangesxandy,drawnfrompopulationswithvariances12and22: H0:12=22H1:12>22 Teststatistic,F =VAR(x)/VAR(y) Chisquaredtestofassociation Assumingatwowaycontingencytableofobservedfrequencies. H0:rowfactorindependentofcolumnfactor H1:someassociationbetweenrowandcolumnfactors Thesuggestedlayoutbelowfora4x2tablecaneasilybemodifiedfortablesofothersizes.
A1: A3: C1: G3: C8: C9: C10: =SUM(C3:D6) =SUM(C3:D3) =SUM(C3:C6) =$A3*C$1/$A$1 =CHITEST(C3:D6,G3:H6) =(COUNT(A3:A6)1)*(COUNT(C1:D1)1) =CHIINV(C8,C9) copydowntoA6 copyacrosstoD1 copyintoG3:H6
CriticalValuesandPvaluesforStatisticalTests
Therearetwoapproachestoconductingsignificancetests.Someanalystsliketocomparetheteststatistic withthecriticalvalueforagivensignificancelevel;othersprefertocalculatethePvaluecorrespondingto theteststatistic.Excelcanbeusedforeithermethod. Assumingsignificancelevel,(typically=5%or0.05): Twotailedztest Uppertailcriticalvalue =NORMSINV(1alpha/2) Pvalueforgivenz =2*(1NORMSDIST(ABS(z))) Twotailedttestwithvdegreesoffreedom Uppertailcriticalvalue =TINV(alpha,v) Pvalueforgivent =TDIST(ABS(t),v,2) Onetailed 2testwithvdegreesoffreedom Uppertailcriticalvalue =CHIINV(alpha,v) Pvalueforgivenchisquared=CHIDIST(chisquared,v) OnetailedFtestwithv1degreesoffreedomin thenumeratorandv2inthedenominator Uppertailcriticalvalue =FINV(alpha,v1,v2) PvalueforgivenF =FDIST(F,v1,v2)
ConfidenceLimits
Assumingdegreeofconfidence100(1)% (e.g.for95%confidence=0.05): Onesamplestatistics,withdatainrangex For (known) Lowerlimit=AVERAGE(x)NORMSINV(1alpha/2)*sigma/SQRT(COUNT(x)) or =AVERAGE(x)CONFIDENCE(alpha,sigma,COUNT(x)) Upperlimit=AVERAGE(x)+NORMSINV(1alpha/2)*sigma/SQRT(COUNT(x)) or =AVERAGE(x)+CONFIDENCE(alpha,sigma,COUNT(x)) For (unknown) For 2 Lowerlimit=(DEVSQ(x)/CHIINV(alpha/2,COUNT(x))1) Upperlimit=(DEVSQ(x)/CHIINV(1alpha/2,COUNT(x))1) Lowerlimit=AVERAGE(x)TINV(alpha, COUNT(x)1)*STDEV(x)/SQRT(COUNT(x)) Upperlimit=AVERAGE(x)+TINV(alpha, COUNT(x)1)*STDEV(x)/SQRT(COUNT(x))
Twosamplestatistics,withdataforthefirstsampleinrangex,andthesecondsampleinrangey For x y ( xknown, y known) Lowerlimit =AVERAGE(x)AVERAGE(y)NORMSINV(1alpha/2)*SQRT(sigmax^2/COUNT(x)+sigmay^2/COUNT(y)) Upperlimit =AVERAGE(x)AVERAGE(y)+NORMSINV(1alpha/2)*SQRT(sigmax^2/COUNT(x)+sigmay^2/COUNT(y)) For x y ( xand y unknownbutassumedequal) Estimatetheunknowncommonstandarddeviationbythepooledestimate: s =SQRT((DEVSQ(x)+DEVSQ(y))/(COUNT(x)+COUNT(y)2)) Lowerlimit =AVERAGE(x)AVERAGE(y)TINV(alpha,COUNT(x)+COUNT(y)2)*s*SQRT(1/COUNT(x)+1/COUNT(y)) Upperlimit =AVERAGE(x)AVERAGE(y)+TINV(alpha,COUNT(x)+COUNT(y)2)*s*SQRT(1/COUNT(x)+1/COUNT(y)) For x2/ y2 Lowerlimit=DEVSQ(x)/DEVSQ(y)/FINV(alpha/2,COUNT(x)1,COUNT(y)1) Upperlimit(DEVSQ(x)/DEVSQ(y)/FINV(1alpha/2,COUNT(x)1,COUNT(y)1)
SimpleLinearRegression
InExcelVersions5andabove,aregressionline(ortrendline)canbeaddedtoascatterplotbyrightclicking ononeoftheplottedpointsandselectingAddTrendlinefromtheshortcutmenu.Bothlinearandavariety ofnonlinearmodelsmaybefittedtothedata.Theequationofthefittedmodelmaybedisplayed, togetherwiththevalueofthecoefficientofdetermination,R2.Therearealsooptionstoextrapolatethe trendlineineitherdirection,ortoforcethetrendlinetohaveaspecificintercept.
Thetrendlineapproachispurelygraphical.Tocalculatepredictions,regressionfunctionsmustbeused.
Assumingasampleofvaluesoftheindependentvariableinrangex,andcorrespondingvaluesofthe dependentvariableinrangey: Leastsquaresestimateofintercept,a =INTERCEPT(y,x) Leastsquaresestimateofslope,b =SLOPE(y,x) Sxy =SUMPRODUCT(x,y)COUNT(x)*AVERAGE(x)*AVERAGE(y) Sxx =DEVSQ(x) Syy =DEVSQ(y) Samplecovariance,Cov(x,y) =COVAR(x,y)*COUNT(x)/(COUNT(x)1) Estimateof,s =STEYX(y,x) Predictionofyatx=x0,=a+bx0 =FORECAST(x0,y,x) Estimatedstandarderrorofindividualpredictedyatx=x0 =STEYX(y,x)*SQRT(1+1/COUNT(x)+(x0AVERAGE(x))^2/DEVSQ(x)) Estimatedstandarderrorofmeanpredictedyatx=x0 =STEYX(y,x)*SQRT(1/COUNT(x)+(x0AVERAGE(x))^2/DEVSQ(x))
Correlation
Assumingtwosamplesofpaireddatainrangesxandy: Pearsonproductmoment correlationcoefficient,r =CORREL(x,y)
RankCorrelation
Assumingtwosamplesofpaireddatainrangesxandywithnoties: Rankofithvalueinrangex =RANK(INDEX(x,i),x,1) Assumingtwosamplesofpaireddatainrangesxandywithsometiedvalues: Rankofithvalueinrangex =(RANK(INDEX(x,i),x,1)RANK(INDEX(x,i),x,0)+COUNT(x)+1)/2 Assumingthattherangesrxandrycontaintheranksofthedatainxandyrespectively: Spearmanrankcorrelationcoefficient,rS=CORREL(rx,ry)
Intheexampleabove: D2: E2: F2: F9: =RANK(B2,$B$2:$B$7,1) =RANK(B2,$B$2:$B$7,0) =(D2E2+COUNT($B$2:$B$7)+1)/2 =CORREL(C2:C7,F2:F7) copydowntoD7 copydowntoE7 copydowntoF7 adjustedforties
TimeSeries
Theexamplesbelowrefertothreeyearsofobservedquarterlydata. Forecastsaremadeforafurtherfourquarters(oneextrayear). Levelonly
Exponentiallyweightedmovingaverage E2: =B2 initiallevelestimate E3: =$G2*B3+(1$G2)*E2 copydowntoE13 E14: =E$13 copydowntoE17 ThechartwasdrawnbyhighlightingB1:B17andE1:E17thenusingInsert>Charts>Line>2DLine. Levelandconstanttrend
Levelandchangingtrend
C2: C3: D2: D3: E3: E14: =B2 =$F2*B3+(1$F2)*(C2+D2) =B3B2 =$G2*(C3C2)+(1$G2)*D2 =C2+D2 =C$13+(A14A$13)*D$13 initiallevelestimate copydowntoC13 initialtrendestimate copydowntoD13 copydowntoE13 copydowntoE17
Level,changingtrendandseasonality
C5: C6: D5: D6: E2: E6: F6: F14: =AVERAGE(B2:B5) =G$2*B6/E2+(1G$2)*(C5+D5) =(AVERAGE(B6:B9)C5)/4 =H$2*(C6C5)+(1H$2)*D5 =B2/C$5 =I$2*B6/C6+(1I$2)*E2 =(C5+D5)*E2 =(C$13+(A14A$13)*D$13)*E10 initiallevelestimate copydowntoC13 initialtrendestimate copydowntoD13 copydowntoE5,initialseasonalestimates copydowntoE13 copydowntoF13 copydowntoF17 Version9,9June2009