Anda di halaman 1dari 10

PCA101Tutorial 1 SpiderFinancialCorp,2013

Tutorial:PrincipalComponent101
Thisisthefirstentryinwhatwillbecomeanongoingseriesonprincipalcomponentsanalysis(PCA).In
thistutorial,wewillstartwiththegeneraldefinition,motivationandapplicationsofaPCA,andthenuse
NumXLtocarryonsuchanalysis.Next,wewillcloselyexaminethedifferentoutputelementsinan
attempttodevelopasolidunderstandingofPCA,whichwillpavethewaytoamoreadvanced
treatmentinfutureissues.
Inthistutorial,wewillusethesocioeconomicdataprovidedbyHarman(1976).Thefivevariables
representtotalpopulation(Population),medianschoolyears(School),totalemployment
(Employment),miscellaneousprofessionalservices(Services),andmedianhousevalue(House
Value).EachobservationrepresentsoneoftwelvecensustractsintheLosAngelesStandard
MetropolitanStatisticalArea.
DataPreparation
First,letsorganizeourinputdata.First,weplacethevaluesofeachvariableinaseparatecolumn,and
eachobservation(i.e.censustractinLA)onaseparaterow.

Notethatthescales(i.e.magnitude)ofthevariablesvarysignificantly,soanyanalysisofrawdatawillbe
biasedtowardthevariableswithalargerscale,anddownplaytheeffectofoneswithalowerscale.
Tobetterunderstandtheproblem,letscomputethecorrelationmatrixforthe5variables:

PCA101Tutorial 2 SpiderFinancialCorp,2013

Thefive(5)variablesarehighlycorrelated,soonemaywonder:
1. Ifweweretousethosevariablestopredictanothervariable,doweneedthe5variables?
2. Aretherehiddenforces(driversorotherfactors)thatmovethose5variables?
Inpractice,weoftenencountercorrelateddataseries:commoditypricesindifferentlocations,future
pricesfordifferentcontracts,stockprices,interestrates,etc.
InplainEnglish,whatisprincipalcomponentanalysis(PCA)?
PCAisatechniquethattakesasetofcorrelatedvariablesandlinearlytransformsthosevariablesintoa
setofuncorrelatedfactors.
Toexplainitfurther,youcanthinkaboutPCAasanaxissystemtransformation.Letsexaminethisplot
oftwocorrelatedvariables:

PCA101Tutorial 3 SpiderFinancialCorp,2013

Simplyput,fromthe(X,Y)Cartesiansystem,thedatapointsarehighlycorrelated.Bytransforming
(rotating)theaxisinto(Z,W),thedatapointsarenolongercorrelated.
Intheory,thePCAfindsthatthosetransformations(oftheaxis)ofdatapointswilllookuncorrelated
withtheirrespect.
OK,nowwherearetheprincipalcomponents?
Totransformthedatapointsfromthe(X,Y)Cartesiansystemto(Z,W),weneedtocomputethezandw
valuesofeachdatapoint:

1 1
2 2
i i i
i i i
z x y
w x y
o |
o |
= +
= +

Ineffect,wearereplacingtheinputvariables ( , )
i i
x y withthoseof ( , )
i i
z w .The ( , )
i i
z w valuesareones
werefertoastheprincipalcomponents.

PCA101Tutorial 4 SpiderFinancialCorp,2013

Alright,howdowereducethedimensionsofthevariables?
Whenwetransformthevaluesofthedatapoints ( , )
i i
x y intothenewaxissystem( , )
i i
z w ,wemayfind
thatafewaxescapturemoreofthevaluesvariationthanothers.Forinstance,inourexampleabove,
wemayclaimthatall
i
w valuesareplainzeroanddontreallymatter.
1 1
2 2
i i i
i i i
x z w
y z w


= +
= +

1
2
i i
i i
x z
y z

=
=

Ineffect,thetwodimensionalsystem ( , )
i i
z w isreducedtoaonedimensionalsystem(
i
z ).
Ofcourse,forthisexample,droppingthe W factordistortsourdata,butforhigherdimensionsitmay
notbesobad.
Whichcomponentshouldwedrop?
Inpractice,weorderthecomponents(akafactors)intermsoftheirvariance(highestfirst)andexamine
theeffectofremovingtheonesoflowervariance(rightmost)inanefforttoreducethedimensionof
thedatasetwithminimallossofinformation.
Whyshouldwecareaboutprincipalcomponents?
Ariskmanagercanquantifytheiroverallriskintermsofaportfolioaggregateexposuretoahandfulof
drivers,insteadoftensofhundredsofcorrelatedsecuritiesprices.Furthermore,designinganeffective
hedgingstrategyisvastlysimplified.
Fortraders,quantifyingtradesintermsoftheirsensitivities(e.g.delta,gamma,etc.)tothosedrivers
givestraderoptionstosubstitute(ortrade)onesecurityforanother,constructatradingstrategy,
hedge,synthesizeasecurity,etc.
Adatamodelercanreducethenumberofinputvariableswithminimallossofinformation.
Process
NowwearereadytoconductourPCA.First,selectanemptycellinyourworksheetwhereyouwishthe
outputtobegenerated,thenlocateandclickonthePCAiconintheNumXLtab(ortoolbar).

TheRegressionWizardpopsup.

PCA101Tutorial 5 SpiderFinancialCorp,2013

Selectthecellsrangeforthefiveinputvariablevalues.
Notes:
1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput
tableswhereitreferencesthosevariables.
2. Theinputvariables(i.e.X)arealreadygroupedbycolumns(eachcolumnrepresentsavariable),
sowedontneedtochangethat.
3. LeavetheVariableMaskfieldblankfornow.Wewillrevisitthisfieldinlaterentries.
4. Bydefault,theoutputcellsrangeissettothecurrentselectedcellinyourworksheet.
Finally,onceweselecttheXandYcellsrange,theOptionsandMissingValuestabsbecome
available(enabled).
Next,selecttheOptionstab.

PCA101Tutorial 6 SpiderFinancialCorp,2013

Initially,thetabissettothefollowingvalues:
- StandardizeInputischecked.Thisoptionineffectreplacethevaluesofeachvariablewithits
standardizedversion(i.e.subtractthemeananddividebystandarddeviation).Thisoption
overcomesthebiasissuewhenthevaluesoftheinputvariableshavedifferentmagnitude
scales.Leavethisoptionchecked.
- PrincipalComponentOutputischecked.ThisoptioninstructsthewizardtogeneratePCA
relatedtables.Leaveitchecked.
- UnderPrincipalComponent,checktheValuesoptiontodisplaythevaluesforeachprincipal
component.
- Thesignificancelevel(aka o )issetto5%.
- TheInputVariablesisunchecked.Leaveituncheckedfornow.
Now,clickontheMissingValuestab.

PCA101Tutorial 7 SpiderFinancialCorp,2013

Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(XandY).Bydefault,any
missingvaluefoundinXorYinanyobservationwouldexcludetheobservationfromtheanalysis.
Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged.
Now,clickOKtogeneratetheoutputtables.

Analysis
1. PCAStatistics

PCA101Tutorial 8 SpiderFinancialCorp,2013

1. Theprincipalcomponentsareordered(andnamed)accordingtotheirvarianceinadescending
order,i.e.PC(1)hasthehighestvariance.
2. Inthesecondrow,theproportionstatisticsexplainthepercentageofvariationintheoriginal
dataset(5variablescombined)thateachprincipalcomponentcapturesoraccountsfor.
3. Thecumulativeproportionisameasureoftotalvariationexplainedbytheprincipalcomponents
uptothiscurrentcomponent.
Note:Inourexample,thefirstthreePCaccountfor94.3%ofthevariationofthe5variables.
4. NotethatthesumofvariancesofthePCshouldyieldthenumberofinputvariables,whichin
thiscaseisfive(5).
2. Loadings
Intheloadingtable,weoutlinetheweightsofalineartransformationfromtheinputvariable
(standardized)coordinatesystemtotheprincipalcomponents.

Forexample,thelineartransformationfor
1
PC isexpressedasfollows:

1 1 2 3 4 5
0.27 0.503 0.339 0.56 0.516 PC X X X X X = + + + +
Note:
1. Thesquaredloadings(column)addsuptoone.

5
2
1
1
j
j
|
=
=

80%
60%
40%
20%
0%
20%
40%
60%
80%
population medianschoolyrs totalemployment miscprofessional
services
medianhousevalue
Loadings
PC(1)
PC(2)
PC(3)

PCA101Tutorial 9 SpiderFinancialCorp,2013

2. Inthegraphabove,weplottedtheloadingsforourinputvariablesinthefirstthree
components.
3. Themedianschoolyears,misc.professionalservicesandmedianhousevaluevariableshave
comparableloadingsinPC(1),nextcomestotalemploymentloadingandfinally,population.One
mayproposethisasaproxyforthewealth/incomefactor.
4. Interpretingtheloadingsfortheinputvariablesintheremainingcomponentsprovetobemore
difficult,andrequireadeeperlevelofdomainexpertise.
5. Finally,computingtheinputvariablesbackfromthePCcanbeeasilydonebyapplyingthe
weightsintherowinsteadofthecolumn.Forexample,thepopulationfactorisexpressedas
follows:
1 1 2 3 4 5
0.227 0.657 0.64 0.308 0.109 X PC PC PC PC PC = +
6. WelldiscussthePCloadinglaterinthistutorial.
3. PrincipalComponentValues

InthePCvaluestable,wecalculatethetransformationoutputvalueforeachdimension(i.e.
component),sothe1
st
rowcorrespondstothe1
st
datapoint,andsoon.
ThevarianceofeachcolumnmatchesthevalueinthePCAstatisticstable.UsingExcel,computethe
biasedversionofthevariancefunction(VARA).
Bydefinition,thevaluesinthePCsareuncorrelated.Toverify,wecancalculatethecorrelationmatrix:

PCA101Tutorial 10 SpiderFinancialCorp,2013

Conclusion
Inthistutorial,weconvertedasetoffivecorrelatedvariablesintofiveuncorrelatedvariableswithout
anylossofinformation.
Furthermore,weexaminedtheproportion(andcumulativeproportion)ofeachcomponentasa
measureofvariancecapturedbyeachcomponent,andwefoundthatthefirstthreefactors
(components)accountfor94.3%ofthefivevariablesvariation,andthefirstfourcomponentsaccount
for98%.
Whatdowedonow?
OneoftheapplicationsofPCAisdimensionreduction;asin,canwedroponeormorecomponentsand
yetretaintheinformationintheoriginaldatasetformodelingpurposes?
Inoursecondentry,wewilllookatthevariationofeachinputvariablecapturedbyprincipal
components(microlevel)andcomputethefittedvaluesusingareducedsetofPCs.
Wewillcoverthisparticularissueinaseparateentryofourseries.