Anda di halaman 1dari 9

PCA102Tutorial 1 SpiderFinancialCorp,2013

Tutorial:PrincipalComponent102
Thisisthesecondentryinourprincipalcomponentsanalysis(PCA)series.Inthistutorial,wewillresume
ourdiscussionondimensionreductionusingasubsetoftheprincipalcomponentswithaminimallossof
information.WewilluseNumXLandExceltocarryoutouranalysis,closelyexaminingthedifferent
outputelementsinanattempttodevelopasolidunderstandingofPCA,whichwillpavethewaytoa
moreadvancedtreatmentinfutureissues.
Inthistutorial,wewillcontinuetousethesocioeconomicdataprovidedbyHarman(1976).Thefive
variablesrepresenttotalpopulation(Population),medianschoolyears(School),totalemployment
(Employment),miscellaneousprofessionalservices(Services),andmedianhousevalue(House
Value).EachobservationrepresentsoneoftwelvecensustractsintheLosAngelesStandard
MetropolitanStatisticalArea.

Process
Nowwearereadytoconductourprincipalcomponentanalysis.First,selectanemptycellinyour
worksheetwhereyouwishtheoutputtobegenerated,thenlocateandclickontheprincipal
component(PCA)iconintheNumXLtab(ortoolbar).

TheRegressionWizardwillappear.

PCA102Tutorial 2 SpiderFinancialCorp,2013

Selectthecellsrangeforthefiveinputvariablevalues.
Notes:
1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput
tableswhereitreferencesthosevariables.
2. Theinputvariables(i.e.X)arealreadygroupedincolumns(eachcolumnrepresentsavariable),
sowedontneedtochangethat.
3. LeavetheVariablesMaskfieldblankfornow.Wewillrevisitthisfieldinlaterentries.
Next,selecttheOptionstab.

PCA102Tutorial 3 SpiderFinancialCorp,2013

Initially,thetabissettothefollowingvalues:
- StandardizeInputsischecked.Leavethisoptionchecked.
- PrincipalComponentOutputischecked.Uncheckit.
- Thesignificancelevel(aka. o )issetto5%.
- InputVariablesisunchecked.Checkthisoption.
- SetNo.ofPCsincludedto3.Thisactioncanbedonenoworalteredlaterintheoutputtables,
asourformulasaredynamic.
- UnderInputVariables,checktheValuesoption,sothegeneratedoutputtablesincludea
fittedvaluefortheinputvariablesusingareducedsetofcomponents.
Now,clicktheMissingValuestab.

Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(XandY).Bydefault,any
missingvaluefoundinanyobservationwouldexcludetheobservationfromtheanalysis.
Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged.
Now,clickOKtogeneratetheoutputtables.

PCA102Tutorial 4 SpiderFinancialCorp,2013

Analysis
1. Statistics

PCA102Tutorial 5 SpiderFinancialCorp,2013

Inthistable,weshowthepercentageofvarianceofeachinputvariableaccountedfor(akafinal
communality)usingthefirstthree(3)factors.Unlikethecumulativeproportion,thisstatisticisrelated
tooneinputvariableatatime.
Usingthistable,wecandetectwhichinputvariablesarepoorlypresented(i.e.adverselyaffected)by
ourdimensionreduction.Inthisexample,themedianschoolyearshasthelowestvalue,yetthefinal
communalityisstillaround92%.
2. Loadings
Intheloadingtable,weoutlinetheweightsoftheprincipalcomponentineachinputvariable:
.
TocomputethevaluesofaninputvariableusingPCvalues,weusetheweightsabovetolinearly
transformthemback.Forexample,thepopulationfactorisexpressedasfollows:
1 1 2 3
0.23 0.66 0.64 X PC PC PC =
Notes:
- Thistableisbasicallythetransposedtable(rowturnedintocolumns)thatwesawinthe
variablesloadingsforPCs.
- Thesumofsquaresofeachrowmustbe1.
-
1 2 3
, , ,..,
m
PC PC PC PC areuncorrelated,sotocomputethevarianceof
1
X (standardized)
usingfirstkcomponents:
2 2 2 2 2 2
1 1 2 2

Var[ ]= ...
i k k
X o o o + + +
i
i i
i
x
x x
X
o

= ,thus Var[ ]=1


i
X
2 2 2 2 2 2
1 1 2 2
2 2 2 2 2 2
1 1 2 2
... 1
... 1
k k
m m
o o o
o o o
+ + + s
+ + + =

PCA102Tutorial 6 SpiderFinancialCorp,2013

Where:
-
k
istheloadingofthekthprincipalcomponentfortheinputvariable
i
X
-
2
k
o isthevarianceofthekthprincipalcomponent.
-

i
X istheestimateforthestandardizedinputvariableusingthefirstmcomponents.
- Bydefinition,the

Var[ ]
i
X isthefinalcommunality.
- Thevarianceofthefittedinputvariableintheoriginalscale(nonstandardized)isexpressedas
follows:

2 2 2 2 2 2 2
1 1 2 2
Var[ ]=( ... )
i
i k k x
x o o o o + + +
- Reducingthedimension,inessence,reducesthevarianceoftheinputvariables(lowpassfilter).
- Whataboutthecorrelationamongoriginalvariables?Howaretheyaffectedbyreducingthe
dimensions?

1 1 2 2
1 1 2 2
2 2 2 2
1 1 1 2 2 2
1
2
1
2

1

...

...

Cov[ , ]= [ ] [ ... ]
Cov[X ,X ]=
i j
i j
i j i j
i k k
j k k
k
i j x x i j k k k l l l
l
m
i j x x l l l
l
m
x x x x l l l
l k
X PC PC PC
X PC PC PC
X X E X X E PC PC PC

e e e
e e e eo
eo
eo
=
=
= +
= + + +
= + + +
= = + + + =
=
=

Where:
-

i
X isthestandardizedfittedithvariable.
- X
i
istheoriginalstandardizedithvariable.
-
i
x istheoriginalnonstandardizedithvariable.
-
i
x isthefittednonstandardizedithvariable.
Thecorrelationbetweenthefittedvariables(i.e.reduceddimensions)maybeslightlydifferent
dependingonthevarianceofthedroppedfactorsandtheloadingofthosefactorsineach
variable.
- Howaboutcovariancebetweenthevariables(nonstandardized)?

PCA102Tutorial 7 SpiderFinancialCorp,2013

2

1
2
1


Cov[ . ]=E[( )( )]= [ ]
Cov[ . ] ( )
Cov[ . ] Cov[ . ]-
i
i
i j
i j i j i j i j
i j
i i
i i i i x
x
i j i i j j x x i j
m
i j x x x x x x x x l l l
l k
m
i j i j x x l l l
l k
x x
X x x X
x x x x x x E X X
x x
x x x x
o
o
o o
o o o o eo
o o eo
= +
= +

= =

= =
=

Insum,therelativechangeincovarianceisequaltothechangeincorrelationbetweenthetwo
variables.
Note:Reducingthenumberoffactorsaltersthestatisticalcharacteristicsoftheunderlyingdataset,so
extremecaremustbetaken.
3. FittedInputValues
Usingthefirstthree(3)principalcomponents,NumXLcalculatesthefittedvalueforeachinputvariable:

Note:AlthoughthePCAusesthestandardizedversionoftheinputvariables,itcomputesthefitted
valuesintheoriginalnonstandardizedformat.

PCA102Tutorial 8 SpiderFinancialCorp,2013

Letsplotpopulation(highestfinalcommunality)andmedianschoolyears(lowestfinalcommunality)for
theoriginaldataandforthefittedone.


0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6 7 8 9 10 11 12
Population
population
fitted
8
9
10
11
12
13
14
15
1 2 3 4 5 6 7 8 9 10 11 12
Medianschoolyrs.
medianschoolyrs
fitted

PCA102Tutorial 9 SpiderFinancialCorp,2013

Conclusion
Inthistutorial,weexaminedthedimensionreductionpropositionfrom5PCsto3PCswithout
significantlossofinformation.
Whatdowedonow?
Inthefirsttwotutorials,wefocusedondeliveringthekeyideasbehindtheprincipalcomponentanalysis
and,tosomeextent,therationalebehindthedimensionreductionproposition.Thecrosssectionsocio
economicsampledata,althoughnotatimeseries,servedtodemonstratethetheoryandtoshow
NumXLsdifferentoutputtables.
Inthethirdentryofthisseries,wearereadytolookintoasetofcorrelatedtimeseries,applyPCA
techniquetoderiveareducedcoresetofuncorrelateddrivers.Next,weforecastthevalues(meanand
standarderror)fortheuncorrelateddrivers,andusingthePCALoadings,implythecorresponding
forecast(meananderror)foreachinputvariable.

Anda mungkin juga menyukai