Anda di halaman 1dari 26

Tanagra

1 Introduction
Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under
Linux.
Thegainchartisanalternativetoconfusionmatrixfortheevaluationofaclassifier.Itsnameissometimes
different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.). The main idea is to
elaborateascatterplotwheretheXcoordinatesisthepercentofthepopulationandtheYcoordinatesis
the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing
domainwherewewanttodetectpotentialcustomers,butitcanbeusedinothersituations.
The construction of the gain chart is already outlined in a previous tutorial (see http://datamining
tutorials.blogspot.com/2008/11/liftcurvecoilchallenge2000.html). In this tutorial, we extend the
descriptiontootherdataminingtools(Knime,RapidMiner,WekaandOrange).Thesecondoriginalityofthis
tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 see http://data
miningtutorials.blogspot.com/2009/01/tanagraunderlinux.html for the installation and the utilization of
TanagraunderLinux).Thethirdoriginalityisthatwehandlealargedatasetwith2.000.000examplesand
41variables.Itwillbeveryinterestingtostudythebehaviorofthesetoolsinthisconfiguration,especially
becauseourcomputerisnotreallypowerful(Celeron,2.53GHz,1MBRAM).

Weadoptthesamewayforeachtool.Inafirsttime,wedefinethesequenceofoperationsandthesettings

21/01/09

1/26

Tanagra
onasampleof2.000examples.Then,inasecondtime,wemodifythedatasource,wehandlethewhole
dataset.Wemeasurethecomputationtimeandthememoryoccupation.Wenotethatsometoolsfailed
theanalysisonthecompletedataset.
About the learning method, we use a linear discriminant analysis with a variable selection process for
Tanagra. For the other tools, this approach is not available. So, we use a Naive Bayes method which is a
linearclassifieralso.

2 Dataset
WeuseamodifiedversionoftheKDDCUP99dataset.Theaimoftheanalysisisthedetectionofanetwork
intrusion (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). We want to detect a normal
connection (binary problem "normal network connection or not"). More precisely, we ask the following
problem:"ifwesetascoretoexamplesandrankthemaccordingtothisscore,whatistheproportionof
positive(normal)connectiondetectedifweselectonly20percentofthewholeconnections?"
Wehavetwodatafiles.Thefirstcontains2.000.000examples(full_gain_chart.arff).Itisusedtocompare
the ability of the tools to handle a large dataset. The second is a sample with 2.000 examples
(sample_gain_chart.arff).Itisusedforspecifyingallthesequenceofoperations.Thesefilesaresettogether
inanarchivehttp://eric.univlyon2.fr/~ricco/tanagra/fichiers/dataset_gain_chart.zip

3 Tanagra
3.1

Defining the diagram

Creatingadiagramandimportingthedataset.AfterwelaunchTanagra,weclickonFILE/NEWmenuin
order to create a diagram and import the dataset. In the dialog settings, we select the data file
SAMPLE_GAIN_CHART.ARFF(Wekafileformat).WesetthenameofthediagramasGAIN_CHART.TDM.

21/01/09

2/26

Tanagra

41variablesand2.000casesareloaded.

21/01/09

3/26

Tanagra
Partitioning the dataset. We want to use 10 percent of the dataset for the learning phase, and the
remainderforthetestphase.10percentcorrespondto200examplesonoursample.Itseemsverylimited.
Butwemusttorememberthatthisfirstanalysisisusedonlyasapreparationphase.Inthenextstep,we
applytheanalysisonthewholedataset.Sothetruelearningsetsizeis10percentof2millionsi.e.200.000
examples.Itseemsenoughtocreateareliableclassifier.
Inordertodefinethetrainandtestsets,weusetheSAMPLINGcomponent(INSTANCESELECTIONtab).We
clickonthePARAMETERSmenu.WesetPROPORTIONSIZEas10%.

WhenweclickonthecontextualVIEWmenu,Tanagraindicatesthat200examplesarenowselected.
Definingthevariablestypes.WeinserttheDEFINESTATUScomponent(usingtheshortcutintothetoolbar)
inordertodefinetheTARGETattribute(CLASSE)andtheINPUTattributes(theotherones).

21/01/09

4/26

Tanagra

Automatic feature selection. Some of the INPUT attributes are irrelevant or redundant. We insert the
STEPDISC (FEATURE SELECTION tab) component in order to detect the relevant subset of predictive
variables.Settingthesizeofthesubsetisveryhard.Statisticalcutvaluesarenotreallyusefulherebecause
weapplytheprocessonalargedataset,allthevariablesseemsignificant.
So, the simplest way is the trial and error approach. We observe the decreasing of the WILKS Lambda
statisticwhenweaddanewpredictivevariable.ThegoodchoiceseemsSUBSETSIZE=6.Whenweadda
new variable after this step, the decreasing of the Lambda is comparatively weak. This corresponds
approximately to the elbow of the curve underlining the relationship between the number of predictive
variableandtheWilks'lambda.Wesetthesettingsaccordingtothispointofview.

21/01/09

5/26

Tanagra

Learningphase.WeaddtheLINEARDISCRIMINANTANALYSIScomponent(SPVLEARNINGtab).Weobtain
thefollowingresults.

21/01/09

6/26

Tanagra
Computingthescorecolumn.Wecomputenowthescore(thescorevalueisnotreallyaprobability,butit
allowstoranktheexamplesastheconditionalprobabilityP[CLASS=positive/INPUTattributes])ofeach
exampletobeapositiveone.Itiscomputedonthewholedataseti.e.boththetrainandthetestset.
WeaddtheSCORINGcomponent(SCORINGtab).Wesetthefollowingparametersi.e.the"positive"value
oftheclassattributeis"normal".AnewcolumnSCORE_1isaddedtothedataset.

Creatingthegainchart.InordertocreatetheGAINCHART,wemustdefinetheTARGET(CLASSE)attribute
andtheSCOREattributes(INPUT).WeusetheDEFINESTATUScomponentfromthetoolbar.Wenotethat
we can set several SCORE columns, this option will be useful when we want to compare various score
columns(e.g.whenwewanttocomparetheefficiencyofvariouslearningalgorithmswhichcomputethe
score).

21/01/09

7/26

Tanagra

Then,weaddtheLIFTCURVEcomponent(SCORINGtab).Thesettingscorrespondtothecomputationofthe
curveonthe"normal"valueoftheclassattribute,andontheunselectedexamplesi.e.thetestset.

21/01/09

8/26

Tanagra
Weobtainthefollowingchart.

In the HTML tab, we have the details of the results. We see that among the first 20 percent of the
populationwehaveatruepositiverateof97.80%i.e.97.80%ofthepositiveexamplesofthedataset.

21/01/09

9/26

Tanagra

3.2

Running the diagram on the complete dataset

Thisfirststepallowsustodefinethesequenceofdataanalysis.Now,wewanttoapplythesameframework
onthecompletedataset.Wecreatetheclassifieronalearningsetwith200,000instances,andevaluateits
performance,usingagainchart,onatestsetwith1,800,000examples.
WithTanagra,thesimplestwayistosavethediagram(FILE/SAVE)andclosetheapplication.Then,weopen
thefilediagramGAIN_CHART.TDMinatexteditor.Wereplacethedatasourcereference.Wesetthename
ofthecompletedataseti.e.full_gain_chart.arff.

WelaunchTanagraagain.Weopenthemodifieddiagramgain_chart.tdm(FILE/OPENmenu).Thenew
datasetisautomaticallyloaded.Tanagraisreadytoexecuteeachnodeofthediagram.

21/01/09

10/26

Tanagra

Theprocessingtimeis180seconds(3minutes).

21/01/09

11/26

Tanagra
We execute all operations with the DIAGRAM / EXECUTE menu. We get the table associated to the gain
chart.Amongthe20%firstindividuals(accordingtothescorevalue),wefind98.68%ofpositives(thetrue
positiverateis98.68%).

Hereisthegainchart.

21/01/09

12/26

Tanagra
Weconsiderbelowthecomputationtime.WeseemainlythatthememoryoccupationofTanagraincreased
slightlywiththecalculations.Thesearethedataloadedintomemorywhichmainlydeterminethememory
occupation.

4 Knime
4.1 Defining the workflow on the sample
The installation and the implementation of Knime (http://www.knime.org/, version 2.0.0) are very easy
underUbuntu.Wecreateanewworkflow(FILE/NEW).WeaskforaKnimeProject.
Aswesayabove,theLinearDiscriminantmethodisnotavailable.SoweusetheNaiveBayesClassifier.It
corresponds to a linear classifier. We expect to obtain similar results. Unfortunately, Knime does not
proposeafeatureselectionprocessadaptedtothisapproach.Thus,weevaluatetheclassifiercomputedon
allthepredictivevariables.
Wecreatethefollowingworkflow:

21/01/09

13/26

Tanagra

Wedescribebelowthesettingsofsomecomponents.

ARFF READER: The dialog box appears with the CONFIGURE menu. We set the file name i.e.
sample_gain_chart.arff.AnotherveryimportantparameterisintheGENRALNODESETTINGStab.
Ifthedatasetistoolarge(higherthan100,000examplesaccordingtothedocumentation),itcanbe
swappedonthedisk.Itistheonlysoftwareofthistutorialwhichdoesnotloadtheentiredataset
inmemoryduringtheprocessing.Inthispointofview,itissimilartothecommercialtoolsthatcan
handle very large databases. Certainly, it is an advantage. We see the consequences of this
technicalchoicewhenwehavetohandlethecompletedatabase.

21/01/09

14/26

Tanagra

ForPARTITIONNINGnode,weselect10%oftheexamples.

ForNAIVEBAYESPREDICTORnode,weaskthatthecomponentproducesthescorecolumnforthe
valuesoftheclassattribute.LikeTanagra,weneedtothiscolumnforrankingtheexamplesduring
theconstructionofthegainchart.

Last, for the LIFT CHART, we specify the class attribute (CLASSE), the positive value of the class

21/01/09

15/26

Tanagra
attribute(NORMAL).Thewidthoftheinterval(INTERVALWIDTH)fortheconstructionofthetable
is10%oftheexamples.

When we click on the EXECUTE AND OPEN VIEW contextual menu, we obtain the following gain chart
(CUMULATIVEGAINCHARTtab).

21/01/09

16/26

Tanagra

4.2

Calculations on the complete database

Wearereadytoworkonthecompletebase.WeactivateagainthedialogsettingsoftheARFFREADERnode
(CONFIGURE menu). We select the full_gain_chart.arff data file. We can execute the workflow. We
obtainthefollowinggainchart.

Theresultsareverysimilartothoseobtainedonthesample(2,000examples).
Thecomputationtimehoweverisnotreallythesame.ItisveryhighcomparedtothatofTanagra.Westudy
indetailsbelowthecomputationtimeofeachstep.
Aboutthememoryoccupation,thesystemmonitorshowsthatKnimeuses"only"193Mo(Figure1).Itis
clearlylowerthanTanagra;itwillbelowerthanallothertoolsweanalyzeinthistutorial.
Wecanclearlyseetheadvantagesanddisadvantagesofswappingdataonthedisk:thememoryisunder
control, but the computation time increases. If we are in the phase of final production of the predictive
model, it is not a problem. If we are in the exploratory phase, where we use trial and error in order to
definethebeststrategy,prohibitivecomputationtimeisdiscouraging.
Inthisconfiguration,theonlysolutionistoworkonsample.Inourstudy,wenotethattheresultsobtained
21/01/09

17/26

Tanagra
on 2,000 cases are very close to the results obtained on 2,000,000 examples. Of course, it is an unusual
situation. The underlying concept between the class attribute and the predictive attribute is very simple.
But,itisobviousthat,ifwecandeterminethesufficientsamplesizefortheproblem,thisstrategyallowsus
toexploredeeplythevarioussolutionswithoutaprohibitivecalculationstime.

Figure1MemoryoccupationofKnimeduringtheprocessing

5 RapidMiner
The installation and the implementation of RapidMiner (http://rapidi.com/content/blogcategory/38/69/,
Community Edition, version 4.3) are very simple under Ubuntu. We add on the desktop a shortcut which
runsautomaticallytheJavaVirtualMachinewithrapidminer.jarfile.

5.1

Working on the sample

As the other tools, we set the sequence of operations on the sample (2,000 examples). There is little
documentationonthesoftware.However,itcomeswithmanypredefinedexamples.Weoftenfindquickly
anexampleofanalysiswhichissimilartoours.
About the gain chart, we get the example file sample/06_Visualization/16_LiftChar.xlm. We adapt the
21/01/09

18/26

Tanagra
sequenceofoperationsonourdataset.

Idonotreallyunderstandall.TheoperatorsIORetrieverandIOStorerseemquitemysterious.Anyway,the
importantthingisthatRapidMinerprovidesthedesiredresults.Anditis.

21/01/09

19/26

Tanagra

Weseeintothischart:thereare1,800examplesintothetestset;thereare324positiveexamples(CLASSE
=NORMAL).Amongthefirst20%ofthedataset(360examples)rankedaccordingtothescorevalue,there
are298positiveexamplesi.e.thetruepositiverateis298/324#92%.
Weusethefollowingsettingsforthisanalysis:

SIMPLEVALIDATION:SPLIT_RATIO=0.1;

LIFTPARETOCHART:NUMBEROFBINS=10

WeusetheNaiveBayesclassifierforRapidminerbecausethelineardiscriminantanalysisisnotavailable.

21/01/09

20/26

Tanagra

5.2

Calculations on the complete database

To run the calculations on the complete database, we modify the settings of ARFF EXAMPLE SOURCE
component.Then,weclickonthePLAYbuttonintothetoolbar.

Thecalculationwasfailed.Thebasewasloadedin17minutes;thememoryisincreasedto700MBwhen
thecrashoccurred.Themessagesentbythesystemisthefollowing

21/01/09

21/26

Tanagra

6 Weka
Wekaisawellknownmachinelearningsoftware(http://www.cs.waikato.ac.nz/ml/weka/,Version360).It
runsunderaJRE(JavaRuntime).Theinstallationandtheimplementationareeasy.
There are various ways to use Weka. We choose the KNOWLEDGEFLOW mode in this tutorial. It is very
similartoKnime.

6.1

Working on the sample

As the other tools, we work on the sample in a first time in order to define the entire sequence of
operations.
We see below the diagram. We voluntarily separate the components. Thus we see, in blue, the type of
information transmitted from an icon to another. It is very important. It determines the behavior of
components.

21/01/09

22/26

Tanagra

Wedescribeheresomenodesandsettings:

ARFFLOADERloadsthedatafilesample_gain_chart.arff;

CLASSASSIGNERspecifiestheclassattributeclasse,alltheothervariablesarethedescriptors;

CLASSVALUEPICKERdefinesthepositivevalueoftheclassattributei.e.CLASSE=NORMAL;

TRAINTESTSPLITMAKERsplitsthedataset,weselect10%forthetrainset;

NAIVE BAYES is the learning method, we connect to this component both the TRAINING SET and the
TESTSETconnections;

CLASSIFIERPERFORMANCEEVALUATORevaluatestheclassifierperformance;

MODELPERFORMANCECHARTcreatesvariouschartsfortheclassifierevaluationonthetestset.

Tostartthecalculations,weclickontheSTARTLOADINGmenuoftheARFFLOADERcomponent.Theresults
areavailablebyclickingontheSHOWCHARTmenuoftheMODELPERFORMANCECHARTcomponent.
Weka offers a clever tool for visualizing charts. We can select interactively the xcoordinates and the y
21/01/09

23/26

Tanagra
coordinates. But,unfortunately,ifwe canchoosetherightvaluesfortheY coordinates,we cannotselect
thepercentofthepopulationforthexcoordinates.Weusethefalsepositiveratebelowinordertodefine
theROCcurve.

6.2

Calculations on the complete database

Toprocessthecompletedatabase,weconfiguretheARFFLOADER.Weselectthefull_gain_chart.arffdata
file.ThenweclickontheSTARTLOADINGmenu.
After 20 minutes, the calculation is not completed. Weka disappears suddenly. We tried to modify the
settingsoftheJRE(JavaRuntime)(e.g.setXmx1024mfortheheapsize).Butthisisnotfruitful.
WekaunderWindowshasthesameprobleminthesamecircumstance.Wecannothandleadatabasewith
41variablesand2,000,000examples.

21/01/09

24/26

Tanagra

7 Conclusion
Ofcourse,maybeIhavenotsettheoptimalsettingsforthevarioustools.Buttheyarefree,thedatasetis
availableonline,anyonecanmakethesameexperiments,andtryothersettingsinordertoincreasethe
efficiency of the processing, both about the computation time and the memory occupation (e.g. the
settingsoftheJREJavaRuntimeEnvironment).
Computationtimeandmemoryoccupationarereportedinthefollowingtable.
Tool
Load the dataset
Learning phase + Creating the gain chart
Memory occupation (MB)

Tanagra
3 mn
25 sec
342 MB

Knime
6 mn 40 sec
30 mn
193 MB

Weka
> 20 mn (erreur)
> 700 MB (erreur)

RapidMiner
17 mn
erreur
700 MB (erreur)

ThefirstresultwhichdrawsourattentionisthequicknessofTanagra.AlthoughitusesWineunderLinux,
Tanagraisstillveryfast.Thecomparisonisespeciallyvalidforloadingdata.Forthelearningphase,because
wedonotusethesameapproach,thecomputationtimesarenotreallycomparable.
Thisresultseemscuriousbecausewemadethesamecomparison(loadingalargedataset)underWindows
(http://dataminingtutorials.blogspot.com/2008/11/decisiontreeandlargedataset.html).Thecomputation
timesweresimilaratthisoccasion.Perhaps,itistheJREunderLinuxwhichisnotefficient?
ThesecondimportantresultofthiscomparisonistheexcellentmemorymanagementofKnime.Ofcourse,
we use a very simplistic method (Naive Bayes Classifier). However, we note that the memory occupation
remains very stable during the whole process. Obliviously, Knime can handle withoutanyproblemavery
largedatabase.
Incompensation,thecomputationtimeofKnimeisclearlyhigher.Incertaincircumstance,duringthestep
wherewetrytodefinetheoptimalsettingsusingatrialanderrorapproach,itcanbeadrawback.
Finally,tobequitecomprehensive,animportanttoolmissesinthiscomparison.ItisOrangewhichIlikea
lot because it is very powerful and user friendly (http://www.ailab.si/orange/). Unfortunately, even if I
followed attentively the description of the installation process, I could not install (compile) Orange under
Ubuntu8.10(http://www.ailab.si/orange/downloadslinux.asp#orange1.0ubuntu).
The Windows version works without any problems. We show below the result of the process under the
Windows version. We do not give the computation time and the memory occupation here because the
systemsarenotcomparable.
21/01/09

25/26

Tanagra

21/01/09

26/26

Anda mungkin juga menyukai