Anda di halaman 1dari 24

BoxplotsandOutliers

EarlFGlynn

BeginnersWorkshop
4Oct2014

Gist:https://gist.github.com/EarlGlynn/a13b651289eff61a2201
Outline
Fivenum quartilesummary
Interquartilerange(IQR)
Boxplot:visualdisplayoffivenum summary
Outliers
Notchedboxplots
Examplesofusingboxplotstoidentifyoutliers
fivenumQuartileSummary

IQR=InterquartileRange=Middle50%
Boxplot:VisualDisplayoffivenumsummary
Boxplot
1.0
0.5

IQR
0.0
-0.5
-1.0

Tukey,JohnW.,ExploratoryDataAnalysis,AddisonWesleyPublishingCompany,1977,
Section2B,"Hingesand5numbersummaries."
BoxplotwithOutlier
Whathappensifonevalueisbad?Letsreplacex[3]withvalue20.
Boxplot with Outlier
20
15
10

Outlier
5
0
Medianismorerobustmeasure
ofcentraltendencythanmean
IQRismorerobustmeasure
ofdispersionthanstandarddeviation
Boxplot(median,IQR)vs.
NormalDistribution(mean,standarddeviation)
Boxplotmakesnoassumptionsaboutprobabilitydistribution.
IQRcontains50%ofdata.
Ifnormaldata, 1standarddeviationcontains~68%ofdata.
Median,IQRmorerobustthanmean,standarddeviation
HowareOutliersDefined?
Lookatthecode:boxplot.stats
HowareOutliersDefined?
Simplifiedversion
NotchedBoxplots
NotchedBoxplots
N = 50 N = 100

4
4

3
3

2
2

1
1

-1 0
-2 -1 0

-3
a b c d e f a b c d e f

Thenotch=TRUEoptionallowssignificancetestingofthedifferenceinmedians.Wherethe
notchesdonotoverlapthemediansaresignificantlydifferentatan=0.05significancelevel.
Whenthenotchesoverlap,thereisnosignificantdifferencebetweenthemedians.
NotchedBoxplots

FromInnoCentive.comsubmission
ExamplesofUsingBoxplotstoIdentifyOutliers
IdentifyproblemimagesinKaggle competitionfacialimages
Congressionaldisbursements
PULSEdiagramstostudypoliticalmoney
ShawneeCounty,KSpublicsalaries
Kaggle Competition:FacialKeypoints Detection
http://www.kaggle.com/

Images:
7049Train
1783Test

96x96pixels
Howcanproblemimagesbeidentified?
Normal
EyesUnusually
FarApart

Normal

EyesUnusually
CloseTogether
Eyesunusuallyclosetogether
Eyesunusuallyfarapart
CongressionalDisbursements

Whatexpensesarenormal?
CongressionalDisbursements

Whatexpensesarenormal?
PULSEDiagramstoStudyPoliticalMoney
NolongeronlineatFollowTheMoney.org

PULSE=PoliticalContributionLogarithmicScatterplotProfile

KansasSenateWinnersandLosersin2008
ShawneeCounty,KansasPublicSalaries
ShawneeCounty,KansasPublicSalaries
TakeAway
Greattoolsforexploratorydataanalysis:
fivenum summary(orusequantile function)
boxplotsummary
MedianandIQRrobuststatisticsforanydistribution
Outliers:baddataorpossiblysomethinginteresting

Anda mungkin juga menyukai