Anda di halaman 1dari 26

Data Mining using Mahout

Team No. 8 Pratibha Rani Prashant Sethia Manisha Verma

What is Mahout?
Subproject of Apache Lucene
Goal: delivering scalable machine learning algorithm implementations http://lucene.apache.org/mahout/

Version 0.1 released on 07 April 2009 includes 10 algorithm libraries


Details in published paper: http://www.cs.stanford.edu/people/ang//paper s/nips06-mapreducemulticore.pdf

Objective
Implement two Data Mining/Machine Learning algorithms
Convert the algorithm in MapReduce paradigm Implement using Hadoop Optimize computation
take advantage of MapReduce paradigm

Integrate them in Mahout Library


Make it available online.

Implemented Algorithms
Classification of Multi Class data using Linear Discriminant Function (LDF)
Machine Learning method for classification Computational cost increases as number of classes increase

SPRINT
Decision tree based parallel classifier for Data Mining Requires parallelization of computations

Decision Tree Example

Attribute Lists

Algorithm

Algorithm (contd.)

SPRINT: Introduction
Carry out decision tree building process in parallel Frequent lookup of the central class list produces a lot of network communication in the parallel case Solution: Eliminate the class list Class labels distributed to each attribute list => Redundant data, but the memory-resident and network communication bottlenecks are removed Each node keeps its own set of attribute lists => No need to lookup the node information Each node is assigned a partition of each attribute list. The nodes are ordered so that the combined lists of noncategorical attributes remain sorted Each node produces its local histograms in parallel, the combined histograms is used to find the best splits

Conversion of SPRINT in MapReduce paradigm For each attribute create Attribute-list from given dataset using MapReduce Use MapReduce to sort all Attributelists Convert Tree construction algorithm into MapReduce format Write a MapReduce job to read test samples from user given input file and traverse constructed tree to find class labels.

SPRINT Contd
Tree construction:
1. At each node use MapReduce to find Gini index of each attribute in parallel 2. Find attribute with lowest value of Gini index.
1. Split Attribute-list using MapReduce 2. Make new nodes for each split.

3. Repeat above steps till unused attributes are left or Attribute-List contains records from only one class.

Advantages of using MapReduce


Automatic parallelization of decision tree building process Does not require parallel sorting algorithm
Computationally expensive and requires shared memory parallel processors

LDF: Introduction
Represent pattern classifiers in terms of a set of discriminant functions gi (x), i=1,,n. The classifier is said to assign a feature vector x to class wi if gi (x) > gj (x) for all j I Transforming gi (x) in form g (x) = at y A sample yi is classified correctly if at yi > 0 and is labeled c1
if at yi < 0 then it is labeled c2

LDF: Introduction contd


Replace all the samples labeled c2 by their negatives find a solution weight vector a such that at yi > 0 for all samples. weight vector a is called a separating vector or solution vector Use Fixed-Increment Single Sample Perceptron algorithm to find this vector For multiclass case with n classes n(n-1)/2 such separating vectors for each pair of classes

Conversion of LDF in MapReduce paradigm


Requires following tasks: 1. Analyze the given dataset in such a way that we can divide it in pairs of two classes for all possible pairs of classes 2. Transform the computation algorithm into MapReduce format. 3. Store the final separating vectors obtained from computation algorithm for classifying new samples. 4. Write a MapReduce job which uses the obtained separating vectors for classifying samples given by the user.

LDF contd
Task 1: divide the data into n(n-1)/2 class pairs First Mapper sorts the input file according to the class labels and outputs class labels and records Reducer function collects the sorted records in output file. Second Mapper takes sorted file as input and outputs class label and byteoffset of each record Reducer takes class label and vector of byteoffsets as input collects the class label and minimum and maximum value of byteoffset for that class n(n-1)/2 files are created for each pair of classes to store the start and end byteoffset of records of those classes

LDF contd
Task 2 : transformation of Fixed-Increment Single Sample Perceptron algorithm into MapReduce format n(n-1)/2 input files creates same number of Map tasks in new job all the n(n-1)/2 LDFs are created in parallel Mapper takes class label pairs along with their offsets as input and produces two class labels and the corresponding separating vector as output Reducer simply collects these in an output file.

LDF contd
Task 3: store output file created by task 1
contains parameters of the trained classifier in the form of class label pairs and the corresponding separating vector

Testing: Classify test samples


Mapper reads the sample from the user given input file and finds the class label of the sample
Reducer outputs samples with class label name

Flow chart
1st phase of map Reduce

2nd phase of map Reduce

Final phase of map Reduce

Testing of Data
Read Classifier/part00000 Store Weight vectors for each plane. The Input Test file is divided into chunks . Each chunk is given to Mapper Class in Classifier.java. Depending upon the output Correct classification or Incorrect Classification Output is collected. In the reducer class the correct and incorrect classification of each class is summed up and finally the result is written to Result/part00000. Result Analyzer.java reads this file and shows the output to the user. LDA Driver.java runs the Mapper , Reducer, Classifier and Result analyzer class.

Classes Implemented
LDF Mapper.java LDF Reducer.java LDF Driver.java LDF Classifier.java LDF Result Analyzer.java

Advantages from MapReduce


Computation of n(n-1)/2 LDFs is automatically parallelized which reduces the training time of the classifier. Replication of input data is avoided by sorting it using MapReduce and storing only the line offset information
helps in reducing storage requirement during training.

Installation of Mahout
Download the tar files of both apache-mahout and apache-maven projects Unzip the tar files in a directory Set the Path Variables for maven Configure settings.xml to add http proxy and port in conf directory in maven folder Set present working directory to the mahout's core folder Compile the project by 'mvn-compile' Build the project by 'mvn-install'

Troubleshooting while installation


By default the proxy address and port are not set
build fails due to unavailability of packages.
Solution: set the proxy address and port before building the packages Once the Build is complete, running the examples is same as running any java program

Integration with Mahout library


Make wrapper classes which encapsulates the working of Map and Reduce functions Make classes to set variables for dataset Make classes to model exceptions that can possibly exist Put all files related to an algorithm in one folder and add it as package in the library

Thanks

Anda mungkin juga menyukai