Anda di halaman 1dari 5

CSE6242/CX4242:DataandVisualAnalytics|GeorgiaTech|Spring2015

Homework4:DecisionTreeandWeka
Due:Friday,April17,2015,11:55PMEST

PreparedbyMeeraKamath,YichenWang,AmirAfsharinejad,ChrisBerlind,PoloChau

In this homework, you will implement decision trees and learn to use Weka.
Weexpectyou
willspendabout10hoursonTask1and1houronTask2(however,pleasenote:ittakes~7
hourstorun10foldcrossvalidationusingSMOonthedataset).
You will submit a single archive file; see detailed submission instructions in the last section.
Points may be taken off if instructions are not followed .

Whilecollaborationisallowedforhomeworkassignments,eachstudent
must
writeuptheir
ownanswers.AllGTstudentsmustobserve
thehonorcode
.

Task1:DecisionTrees(70points)
Inthistask,youwillimplementawellknowndecisiontreeclassifier.Theperformanceofthe
classifierwillbeevaluatedby10foldcrossvalidationonaprovideddataset.Decisiontrees
andcrossvalidationwerecoveredinclass(
slides
).

YouwillimplementadecisiontreeclassifierfromscratchusingPythonwiththeskeletoncode
weprovideorotherprogramminglanguagesofyourchoice(e.g.,Python,Java,C++).You
maynotuseanyexistingmachinelearningordecisiontreelibrary.

Downloadthedataset
here
.Itisacommondatasetforevaluatingclassificationalgorithms1 ,
wheretheclassificationtaskistodetermineiftheclientwillsubscribeforatermdeposit
(variabley)basedontheresultofabankmarketingcampaign.Thedataisstoredina
tabseparatedvalue(TSV)file,whereeachlinerepresentsaperson.Foreachline,thereare
21columns:thefirst20columnsrepresentacustomerscharacteristics(
details
),andthelast
columnisagroundtruthlabeloftheirdecision(eitheryesorno).Youmustnotusethelast
columnasinputfeatureswhenyouclassifythedata.

A.ImplementingDecisionTree(25pt)
Whileimplementingyourdecisiontree,youwilladdressthefollowingchallenges:
Choosingtheattributeforsplitting(page2426intheslides)
Youcanuseentropyandinformationgaincoveredintheclass.
1

ThisdatasetisprovidedbyUCImachinelearningrepository.Wepreprocessedthedatasetforthe
homework.

Whentostopsplitting(page27)
Implementingadvancedtechniques,suchaspruningorregularization(page
28),iscompletelyoptional.
Youdonotneedtostrictlyfollowthethreeconditionsinpage27youmayuse
similar(orbetter)approaches.

Inyourdecisiontreeimplementation,youmaychooseanyvariantofdecisiontree(e.g.
entropy,Giniindex,orothermeasuresbinarysplitormultiwaysplit).However,youmust
explainyourapproaches
andtheireffectsontheclassificationperformanceinatextfile
description.txt
.

Weprovideaskeletoncode
here
writteninPython.Ithelpsyousetuptheenvironment
(loadingthedataandevaluatingyourmodel).Youmaychoosetousethisskeleton,write
yourowncodefromscratchinPythonorotherlanguages(e.g.,Java,R,C++).

B.EvaluationusingCrossValidation(15pt)
Youwillevaluateyourdecisiontreeusing10foldcrossvalidation.Pleaseseethelecture
slidesfordetails,butbasicallyyouwillfirstmakeasplitoftheprovideddatainto10parts.
Thenholdout1partasthetestsetandusetheremaining9partsfortraining.Trainyour
decisiontreeusingthetrainingsetandusethetraineddecisiontreetoclassifyentriesinthe
testset.Repeatthisprocessforall10parts,sothateachentrywillbeusedasthetestset
exactlyonce.Togetthefinalaccuracyvalue,taketheaverageofthe10foldsaccuracies.

Withcorrectimplementationofbothparts(decisiontreeandcrossvalidation),your
classificationaccuracyshouldbearound0.85orhigher.

C.Improvements(25pt)
Trytoimproveyourdecisiontreealgorithm.Someexamplesofstrategiesare:
Differentsplittingstrategies
Differentstoppingcriteria
Pruningtreeaftersplitting
Userandomnessand/orbagging

Yourimprovementcanbesimpleusingasfewasoneortwosimpleheuristicsorideasis
acceptable.Thegoalhereisforyoutotrydifferentwaystoimprovetheclassifier.Youdonot
needtoimplementseparatecodeforthisimprovement.Itisokaytobuildontopofyourinitial
implementationandonlysubmitthebestversionofyourdecisiontrees.Butyoushould
discusswhatyouhavetriedandwhyyouthinkitperformsbetterinthe
description.txt
file.

Inthetextfile,youwillalso
reporttheperformancecomparison
betweentheimplementation
2

youhavetried.Forexample,
Initial(insectionA):0.85
Afterapplyingstrategy1:0.89
Afterapplyingstrategy2:0.91

Deliverables
1. sourcecode
:Asourcefile(s)ofyourprogram.Thesourcefilesshouldcontainbrief
comments(whatisthissectionfor,suchascalculatinginformationgain).Ifyoudont
usetheskeleton,pleaseincludeareadme.txtexplaininghowtocompileandrunthe
codeontheprovideddata.Itshouldberunnableusingasinglecommandvia
terminal.Pleasedonotcompressthesourcefiles.

2. description.txt
:Thisfileshouldinclude:
a. Howyouimplementedtheinitialtree(SectionA)andwhyyouchoseyour
approaches(<50words)
b. Accuracyresultofyourinitialimplementation(withcrossvalidation)
c. Explainimprovementsthatyoumade(SectionC)andwhyyouthinkitworks
better(orworse)(<100words)
d. Accuracyresultsforyourimprovements(Itwillbegradedonacurve.Fullcredit
(10pt)willbegiventostudentswithrelativelyhighaccuracy.)

Youshouldimplementyourowncode.Youcanreferencesomeimplementationsonthe
Web,butyoumustnotcopyandpastethosecodes,whichconstitutesplagiarism.

Task2:UsingWeka(30points)
YouwilluseWekatotrainclassifiersforthesamedataasTask1,andcomparethe
performanceofyourimplementationwithWekas.

Downloadandinstall
Weka
.NotethatWekarequiresJavaRuntimeEnvironment(JRE)to
run.Wesuggestyouinstallthe
latestJRE
,toavoidJavaorruntimerelatedissues.

HowtouseWeka:
Loaddatainto
WekaExplorer
:Wekasupportsfileformatssuchasarff,csv,xls.
Preprocessing:youcanviewyourdata,selectattributes,andapplyfilters.
Classify:under
Classifier
youcanselectthedifferentclassifiersthatWekaoffers.You
canadjusttheinputparameterstomanyofthemodelsbyclickingonthetexttothe
rightofthe
Choose
buttonintheClassifiersection.
ThesearejustsomefundamentalsofWekausage.Youcanfindmanytutorialsonhowtouse
wekaontheInternet.
3

A.Experiment(15pt)
Runthefollowingexperiments.Aftereachexperiment,reportyour
parameters,running
time,confusionmatrix,predictionaccuracy
.AnexampleisprovidedintheDeliverable
part.
1. DecisionTree
C4.5
(J48)iscommonlyusedandsimilartowhatyouimplementedin
Task1.Under
classifiers>trees
,selectJ48.FortheTestoptions,choose
10fold
crossvalidation
,whichshouldbethesameforA2andA3.(5pt)
2. SupportVectorMachineUnderthe
classifiers>functions
select
SMO
.(5pt)
3. YourchoicechooseanyclassifieryoulikefromthenumerousclassifiersWeka
provides.Youcanusepackagemanagertoinstalltheonesyouneed.(5pt)

B.Discussion(15pt)
Discussinyourreportandanswerfollowingquestions:
1. ComparetheDecisionTreeresultfromA1toyourimplementationinTask1and
discusspossiblereasonsforthedifferenceinperformance.(<50words,5pt)
2. DescribetheclassifieryouchooseinA3:whatitis,howitworks,whatareitsstrengths
&weaknesses.(<50words,5pt)
3. Comparethe3classificationresultsinSectionA:runningtime,accuracy,confusion
matricesandexplainwhy.Ifyouchangeanyoftheparameters,brieflyexplainwhat
youhavechangedandwhytheyimprovethepredictionaccuracy.(<100words,5pt)

Deliverables
report.txtatextfilecontainingtheWekaresultandyourdiscussionforallquestionsabove.
Forexample:

SectionA
1.
J48C0.25M2
Timetakentobuildmodel:3.73seconds
Overallaccuracy:86.0675%
ConfusionMatrix:
ab<classifiedas
332732079|a=no
44016757|b=yes
2.

SectionB
4

1.TheresultofWekais86.1%comparedtomyresult<accuracy>because
2.Ichoose<classifier>whichis<algorithm>
...

Submission Guidelines
Submit the deliverables as a single
zip
file named
hw4-
Lastname-Firstname.zip
(should
start with lowercase hw4). Please specify the name(s) of any students you have collaborated
with on this assignment, using the text box on the T-Square submission page.
The directory structure of the zip file should be exactly as below (the unzipped file should
look like this):
Task1/
description.txt
tree.py
OR
tree.xxx
additional_source_files(ifyouhave)
readme.txt(unlessyouusedtree.py)

Task2/
report.txt

You must follow the naming convention specified above.

Anda mungkin juga menyukai