Anda di halaman 1dari 5



3UREOHP6WDWHPHQW$PHULFDQ([SUHVV*OREDO&DPSXV

Home

Dashboard
Rules
FAQ
Quiz
ProblemStatement
GuidelinesandSubmission
Leaderboard

Aboutus

WhatWeDo
Competition

WelcomeRahulandraghav
Logout

ProblemStatement
DownloadFiles
Data_Dictionary.xlsx | Training_Dataset.csv | Leaderboard_Dataset.csv |
Final_Dataset.csv | 2015_Submission_Deck.pptx
Welcome!AmericanExpressCampusAnalyzeThisisthethirdeditionofafirstofitskindPanIITdata
analyticscompetitionbyAmericanExpress.Throughthisgame,youwillgetafirsthandexperienceof
thevariousfacetsoftheexcitingfieldofDataSciences.
Bytheendofthis10daynervewracking,nailbiting,rollercoasterridewearesureyouwouldagree
thatDataanalyticsisasaddictiveasgaming.
GearupandGameOn!!!
Thesectionsbelowhavedetailsonthe

1. Background
2. ProblemStatement
3. DataforAnalysis
4. CluesandMilestones
5. TipsonDataAnalysis
6. PopularDataAnalysisTechniques

Background
TheIslandofHoennisgearingupforupcomingpolls.Citizensarewaitingwithbatedbreathasnews
agenciesrevealtheirpredictionsonwhichpartyislikelytoemergevictorious.
Muchtothedisappointmentofallthecitizens,thereisdiscrepancyinthesepollpredictionsamongst
newschannels.
KWWSVLQD[SFDPSXVFRP$QDO\]H7KLVFDPSXVDFWLYLW\SUREOHPVWDWHPHQWSKS





3UREOHP6WDWHPHQW$PHULFDQ([SUHVV*OREDO&DPSXV

Somanyinconsistentpredictionsdidntgodownwellwithaninquisitivebunchofstudents.Wondering
howdifficultitmightbetocrackit,comestheir'Eureka'moment.
AnideatocreatetheirownStartUptoanalyzepollsentimentsandpredictthewinner.Astartup
calledAnalyzeThis.
TheygatherdataforasampleofcitizensofIslandofHoennandgetstarted.
Informationonhistoricalvotingpattern,rallyattendance,partyagendaanddemographicsiswhatthey
haveathandtopredictthewinneramongstthe5competingparties.
Canyouhelpthesestudentscrackthispuzzle?DoyouhaveitinyoutostartyourownAnalyzeThis?

ProblemStatement
Basedonthedatathestudentsabovehavecollected,youhaveto:
1. Identifywhichcitizenislikelytoremainloyalandvoteforthesamepartyastheyvotedforinthelast
polls.
2. IdentifywhichcitizenislikelytoswitchvotesandNOTvoteforthesamepartyastheyvotedforinthe
lastpolls.
3. Identifyoverallwinnerofthepolls.

DataforAnalysis
Thefollowingfilescanbedownloadedforyouranalysis

1. Training_Dataset.csv:Thisdataisforhistoricalvotesandvotingpreferenceofasampleofcitizens.
Thedatahasinformationon:
a. Pasthistoryofvotingforallcitizens
b. Whotheywillvoteforinupcomingpolls
c. Donation,RallyAttendance,Demographicsofthesecitizens
2. Leaderboard_Dataset.csv:Ithasinformationonthehistoricalvoteforallcitizensalongwithdonation,
rallyattendanceanddemographics.
3. Final_Dataset.csv:Ithasinformationonthehistoricalvoteforallcitizensalongwithdonation,rally
attendanceanddemographics.
4. Data_Dictionary.xlsx:Thissheetwillgiveyouthedescriptionsofallthevariablescontainedinthe3
KWWSVLQD[SFDPSXVFRP$QDO\]H7KLVFDPSXVDFWLYLW\SUREOHPVWDWHPHQWSKS





3UREOHP6WDWHPHQW$PHULFDQ([SUHVV*OREDO&DPSXV

datasetsabove.

PleasenotethattheLeaderboarddatasubmissionsarerestrictedtoonly10submissionsperdayper
teamandfortheFinaldatasetyoucansubmitonlyonesolution.Forfurtherdetails,pleaserefertothe
submissionguidelinesdocumentavailableatthelinkbelow:
https://in.axpcampus.com/AnalyzeThis/campusactivity/guidelinesandsubmission.php

CluesandMilestones
Duringthegame1cluewillbereleasedinordertohelpyousolvetheproblembetter.
Thiscluehasthepotentialtogiveanextraboosttoyoursolutionsprovidedyoucanusethemtoyour
advantage.Checkyourmailsregularlyforinformationaboutthecluesandkeepaneyeonthewebsite
too.
Wehavedefined1MilestoneonthescorescalculatedfortheLeaderBoardSubmissions.
Thefirst2teamstocrossamilestonewillbeawarded.Themilestonewillappearontheleaderboardon
thesite.
MilestoneScore:600,000

TipsonDataAnalysis
Followingaresometipsfortheuninitiatedonhowyoucanapproachthisdataanalysisgame.
Anyexerciseinthefieldofdataanalyticswouldstartwithunderstandingthedata.So,startoffby
understandingthedatasetsanddescriptionsprovidedtoyou.
Onceyouarefamiliarwiththedata,trytoanswerthesequestions:
1.WhatalldatadoIhave?
2.Whatalldataisusefulandwhatisjunk?
3.HowcanIorganizethisdatatosolvemyproblem?

Then,trytobuildthevariablesonthetrainingdataset,definedependentandindependentvariablesand
thenstartmodelingontheTrainingDataset.Youneedtomatchthecitizenschoiceofvote.
Onceyouaresatisfiedwithyourmodel,useitontheLeaderboardandcomeupwithyourestimatesof
whichpollresultsforeachcitizen.Followthesubmissionguidelinesanduploadyourestimates.Your
submissionwillbeevaluatedrealtimeandyoucancomparehowwellyouhaveestimatedagainstother
participants.
KWWSVLQD[SFDPSXVFRP$QDO\]H7KLVFDPSXVDFWLYLW\SUREOHPVWDWHPHQWSKS





3UREOHP6WDWHPHQW$PHULFDQ([SUHVV*OREDO&DPSXV

Keepfinetuningyourestimatesbytryingtoincreaseyourleaderboardscores.Keepaneyeonthe
cluestobetteryoursolution.Oncesatisfied,usethesamelogictoestimatethevotepreferenceof
citizensinthefinaldataset.
Youcanuseanytool,writeyourownalgorithms,andimplementanypredictivemodeling/Dataanalysis
methodsyoumaywantto.Foryourfinalsubmission,youwillhavetoprovidedetailsofthetechniques
youhaveused.

PopularDataAnalysisTechniques
1. Regression:
Regressionisamathematicalprocessusedtofindafunctionthatcloselyfitsaseriesofdata.The
analysisinvolvesdefiningthefunctionthatminimizesthedifferencebetweenthedatapointandthe
valuepredictedbythefunction.Thereareseveraldifferenttechniques,themostcommonbeingby
themethodofleastsquares.
Forexample,sayyouwantedtofindanequationthatdictatedacertainstock'sperformance.You
couldtaketheclosingpriceofthatstockforeverydayinthelastyear.Youthenwouldbetryingto
figureoutwhatequationsatisfiesallthosepoints.Theequationcouldbeusedtotrytopredictfuture
performance.
2. LogisticRegression:
Say,youwanttofigureoutwhetherthestockpriceforacertaindaywouldgoupornot.Youwould
againhavetheclosingpriceofthatstockforeverydayinthelastyear.WecandothisusingLogistic
Regression.Itgivesyoutheprobabilityofstockpricerising.
3. SupportVectorMachine:
Imaginethepreviousscenario.Inadditiontoclosingpricewehavesaysomemoreindicatorslike
volumetradedaswell,andwehaveareasontobelievethattheprice(asisoftenthecase)isa
complexfunctionoftheseindicators.Then,topredicttheupwardordownwardtrends,SVMcouldbe
abettertechniqueforthesolution.
4. NeuralNetworks:
Again,referringtothepreviousexample,let'ssay,thatwehavecertainindicatorswhichare
themselvescomplexfunctionsofseveraldifferentvariables,andsupposewewanttousethemfor
thefinalprediction.Insuchascenario,neuralnetworksmaygiveabettersolution.
Apointtonote,aswegodownthishierarchywemightendupoverfittingthedata.
5. Clusteringalgorithms:
Clusteringalgorithmsareusedinsearchenginesthattrytogroupsimilarobjectsinoneclusterand
thedissimilarobjectsfarfromeachother.Itprovidesresultforthesearcheddataaccordingtothe
nearestsimilarobjectwhichareclusteredaroundthedatatobesearched.
Asanillustration,GoogleusesclusteringalgorithmstoclassifydifferentcontentsasNewsbyparsing
thoughthematterandexaminingthekeywords.
KWWSVLQD[SFDPSXVFRP$QDO\]H7KLVFDPSXVDFWLYLW\SUREOHPVWDWHPHQWSKS





3UREOHP6WDWHPHQW$PHULFDQ([SUHVV*OREDO&DPSXV

6. Recommendationengines:
Amazon/Flipkart/Netflixusecollaborativefilteringforrecommendation.
Inessence,thealgorithmrepresentseachcustomerasavectorofallitemsonsale.Eachentryinthe
vectorispositiveifthecustomerboughtorratedtheitem,negativeifthecustomerdislikedtheitem,
oremptyifthecustomerhasnotmadehisorheropinionknown.Mostoftheentriesareemptyfor
mostofthecustomers.Thealgorithmthencreatesitsrecommendationsbycalculatingasimilarity
valuebetweenthecurrentcustomerandeveryoneelse.
7. NaveBayesianTextClassifier:
ThebestknownuseofNaveBayesianclassificationisspamfiltering.Itisaprobabilisticclassifier
basedonBayes'theorem.
Forexample,EmailsuseBayes'formulaforcalculatingtheprobabilityofanemailtobeclassifiedasa
spam,givenalreadyexistingspams.Thiscanbedonebycalculatingprobabilitiesassociatedwith
eachwordofthetexttobeclassifiedasaspam.

PrivacyStatement Terms&Conditions
AllusersofouronlineservicessubjecttoPrivacyStatementandagreetobeboundbyTerms&
Conditions.Pleasereview.
2015AmericanExpressCompany.

KWWSVLQD[SFDPSXVFRP$QDO\]H7KLVFDPSXVDFWLYLW\SUREOHPVWDWHPHQWSKS