Homework4:DecisionTreeandWeka
Due:Friday,April17,2015,11:55PMEST
PreparedbyMeeraKamath,YichenWang,AmirAfsharinejad,ChrisBerlind,PoloChau
In this homework, you will implement decision trees and learn to use Weka.
Weexpectyou
willspendabout10hoursonTask1and1houronTask2(however,pleasenote:ittakes~7
hourstorun10foldcrossvalidationusingSMOonthedataset).
You will submit a single archive file; see detailed submission instructions in the last section.
Points may be taken off if instructions are not followed .
Whilecollaborationisallowedforhomeworkassignments,eachstudent
must
writeuptheir
ownanswers.AllGTstudentsmustobserve
thehonorcode
.
Task1:DecisionTrees(70points)
Inthistask,youwillimplementawellknowndecisiontreeclassifier.Theperformanceofthe
classifierwillbeevaluatedby10foldcrossvalidationonaprovideddataset.Decisiontrees
andcrossvalidationwerecoveredinclass(
slides
).
YouwillimplementadecisiontreeclassifierfromscratchusingPythonwiththeskeletoncode
weprovideorotherprogramminglanguagesofyourchoice(e.g.,Python,Java,C++).You
maynotuseanyexistingmachinelearningordecisiontreelibrary.
Downloadthedataset
here
.Itisacommondatasetforevaluatingclassificationalgorithms1 ,
wheretheclassificationtaskistodetermineiftheclientwillsubscribeforatermdeposit
(variabley)basedontheresultofabankmarketingcampaign.Thedataisstoredina
tabseparatedvalue(TSV)file,whereeachlinerepresentsaperson.Foreachline,thereare
21columns:thefirst20columnsrepresentacustomerscharacteristics(
details
),andthelast
columnisagroundtruthlabeloftheirdecision(eitheryesorno).Youmustnotusethelast
columnasinputfeatureswhenyouclassifythedata.
A.ImplementingDecisionTree(25pt)
Whileimplementingyourdecisiontree,youwilladdressthefollowingchallenges:
Choosingtheattributeforsplitting(page2426intheslides)
Youcanuseentropyandinformationgaincoveredintheclass.
1
ThisdatasetisprovidedbyUCImachinelearningrepository.Wepreprocessedthedatasetforthe
homework.
Whentostopsplitting(page27)
Implementingadvancedtechniques,suchaspruningorregularization(page
28),iscompletelyoptional.
Youdonotneedtostrictlyfollowthethreeconditionsinpage27youmayuse
similar(orbetter)approaches.
Inyourdecisiontreeimplementation,youmaychooseanyvariantofdecisiontree(e.g.
entropy,Giniindex,orothermeasuresbinarysplitormultiwaysplit).However,youmust
explainyourapproaches
andtheireffectsontheclassificationperformanceinatextfile
description.txt
.
Weprovideaskeletoncode
here
writteninPython.Ithelpsyousetuptheenvironment
(loadingthedataandevaluatingyourmodel).Youmaychoosetousethisskeleton,write
yourowncodefromscratchinPythonorotherlanguages(e.g.,Java,R,C++).
B.EvaluationusingCrossValidation(15pt)
Youwillevaluateyourdecisiontreeusing10foldcrossvalidation.Pleaseseethelecture
slidesfordetails,butbasicallyyouwillfirstmakeasplitoftheprovideddatainto10parts.
Thenholdout1partasthetestsetandusetheremaining9partsfortraining.Trainyour
decisiontreeusingthetrainingsetandusethetraineddecisiontreetoclassifyentriesinthe
testset.Repeatthisprocessforall10parts,sothateachentrywillbeusedasthetestset
exactlyonce.Togetthefinalaccuracyvalue,taketheaverageofthe10foldsaccuracies.
Withcorrectimplementationofbothparts(decisiontreeandcrossvalidation),your
classificationaccuracyshouldbearound0.85orhigher.
C.Improvements(25pt)
Trytoimproveyourdecisiontreealgorithm.Someexamplesofstrategiesare:
Differentsplittingstrategies
Differentstoppingcriteria
Pruningtreeaftersplitting
Userandomnessand/orbagging
Yourimprovementcanbesimpleusingasfewasoneortwosimpleheuristicsorideasis
acceptable.Thegoalhereisforyoutotrydifferentwaystoimprovetheclassifier.Youdonot
needtoimplementseparatecodeforthisimprovement.Itisokaytobuildontopofyourinitial
implementationandonlysubmitthebestversionofyourdecisiontrees.Butyoushould
discusswhatyouhavetriedandwhyyouthinkitperformsbetterinthe
description.txt
file.
Inthetextfile,youwillalso
reporttheperformancecomparison
betweentheimplementation
2
youhavetried.Forexample,
Initial(insectionA):0.85
Afterapplyingstrategy1:0.89
Afterapplyingstrategy2:0.91
Deliverables
1. sourcecode
:Asourcefile(s)ofyourprogram.Thesourcefilesshouldcontainbrief
comments(whatisthissectionfor,suchascalculatinginformationgain).Ifyoudont
usetheskeleton,pleaseincludeareadme.txtexplaininghowtocompileandrunthe
codeontheprovideddata.Itshouldberunnableusingasinglecommandvia
terminal.Pleasedonotcompressthesourcefiles.
2. description.txt
:Thisfileshouldinclude:
a. Howyouimplementedtheinitialtree(SectionA)andwhyyouchoseyour
approaches(<50words)
b. Accuracyresultofyourinitialimplementation(withcrossvalidation)
c. Explainimprovementsthatyoumade(SectionC)andwhyyouthinkitworks
better(orworse)(<100words)
d. Accuracyresultsforyourimprovements(Itwillbegradedonacurve.Fullcredit
(10pt)willbegiventostudentswithrelativelyhighaccuracy.)
Youshouldimplementyourowncode.Youcanreferencesomeimplementationsonthe
Web,butyoumustnotcopyandpastethosecodes,whichconstitutesplagiarism.
Task2:UsingWeka(30points)
YouwilluseWekatotrainclassifiersforthesamedataasTask1,andcomparethe
performanceofyourimplementationwithWekas.
Downloadandinstall
Weka
.NotethatWekarequiresJavaRuntimeEnvironment(JRE)to
run.Wesuggestyouinstallthe
latestJRE
,toavoidJavaorruntimerelatedissues.
HowtouseWeka:
Loaddatainto
WekaExplorer
:Wekasupportsfileformatssuchasarff,csv,xls.
Preprocessing:youcanviewyourdata,selectattributes,andapplyfilters.
Classify:under
Classifier
youcanselectthedifferentclassifiersthatWekaoffers.You
canadjusttheinputparameterstomanyofthemodelsbyclickingonthetexttothe
rightofthe
Choose
buttonintheClassifiersection.
ThesearejustsomefundamentalsofWekausage.Youcanfindmanytutorialsonhowtouse
wekaontheInternet.
3
A.Experiment(15pt)
Runthefollowingexperiments.Aftereachexperiment,reportyour
parameters,running
time,confusionmatrix,predictionaccuracy
.AnexampleisprovidedintheDeliverable
part.
1. DecisionTree
C4.5
(J48)iscommonlyusedandsimilartowhatyouimplementedin
Task1.Under
classifiers>trees
,selectJ48.FortheTestoptions,choose
10fold
crossvalidation
,whichshouldbethesameforA2andA3.(5pt)
2. SupportVectorMachineUnderthe
classifiers>functions
select
SMO
.(5pt)
3. YourchoicechooseanyclassifieryoulikefromthenumerousclassifiersWeka
provides.Youcanusepackagemanagertoinstalltheonesyouneed.(5pt)
B.Discussion(15pt)
Discussinyourreportandanswerfollowingquestions:
1. ComparetheDecisionTreeresultfromA1toyourimplementationinTask1and
discusspossiblereasonsforthedifferenceinperformance.(<50words,5pt)
2. DescribetheclassifieryouchooseinA3:whatitis,howitworks,whatareitsstrengths
&weaknesses.(<50words,5pt)
3. Comparethe3classificationresultsinSectionA:runningtime,accuracy,confusion
matricesandexplainwhy.Ifyouchangeanyoftheparameters,brieflyexplainwhat
youhavechangedandwhytheyimprovethepredictionaccuracy.(<100words,5pt)
Deliverables
report.txtatextfilecontainingtheWekaresultandyourdiscussionforallquestionsabove.
Forexample:
SectionA
1.
J48C0.25M2
Timetakentobuildmodel:3.73seconds
Overallaccuracy:86.0675%
ConfusionMatrix:
ab<classifiedas
332732079|a=no
44016757|b=yes
2.
SectionB
4
1.TheresultofWekais86.1%comparedtomyresult<accuracy>because
2.Ichoose<classifier>whichis<algorithm>
...
Submission Guidelines
Submit the deliverables as a single
zip
file named
hw4-
Lastname-Firstname.zip
(should
start with lowercase hw4). Please specify the name(s) of any students you have collaborated
with on this assignment, using the text box on the T-Square submission page.
The directory structure of the zip file should be exactly as below (the unzipped file should
look like this):
Task1/
description.txt
tree.py
OR
tree.xxx
additional_source_files(ifyouhave)
readme.txt(unlessyouusedtree.py)
Task2/
report.txt