What is Mahout?
Subproject of Apache Lucene
Goal: delivering scalable machine learning algorithm implementations http://lucene.apache.org/mahout/
Objective
Implement two Data Mining/Machine Learning algorithms
Convert the algorithm in MapReduce paradigm Implement using Hadoop Optimize computation
take advantage of MapReduce paradigm
Implemented Algorithms
Classification of Multi Class data using Linear Discriminant Function (LDF)
Machine Learning method for classification Computational cost increases as number of classes increase
SPRINT
Decision tree based parallel classifier for Data Mining Requires parallelization of computations
Attribute Lists
Algorithm
Algorithm (contd.)
SPRINT: Introduction
Carry out decision tree building process in parallel Frequent lookup of the central class list produces a lot of network communication in the parallel case Solution: Eliminate the class list Class labels distributed to each attribute list => Redundant data, but the memory-resident and network communication bottlenecks are removed Each node keeps its own set of attribute lists => No need to lookup the node information Each node is assigned a partition of each attribute list. The nodes are ordered so that the combined lists of noncategorical attributes remain sorted Each node produces its local histograms in parallel, the combined histograms is used to find the best splits
Conversion of SPRINT in MapReduce paradigm For each attribute create Attribute-list from given dataset using MapReduce Use MapReduce to sort all Attributelists Convert Tree construction algorithm into MapReduce format Write a MapReduce job to read test samples from user given input file and traverse constructed tree to find class labels.
SPRINT Contd
Tree construction:
1. At each node use MapReduce to find Gini index of each attribute in parallel 2. Find attribute with lowest value of Gini index.
1. Split Attribute-list using MapReduce 2. Make new nodes for each split.
3. Repeat above steps till unused attributes are left or Attribute-List contains records from only one class.
LDF: Introduction
Represent pattern classifiers in terms of a set of discriminant functions gi (x), i=1,,n. The classifier is said to assign a feature vector x to class wi if gi (x) > gj (x) for all j I Transforming gi (x) in form g (x) = at y A sample yi is classified correctly if at yi > 0 and is labeled c1
if at yi < 0 then it is labeled c2
LDF contd
Task 1: divide the data into n(n-1)/2 class pairs First Mapper sorts the input file according to the class labels and outputs class labels and records Reducer function collects the sorted records in output file. Second Mapper takes sorted file as input and outputs class label and byteoffset of each record Reducer takes class label and vector of byteoffsets as input collects the class label and minimum and maximum value of byteoffset for that class n(n-1)/2 files are created for each pair of classes to store the start and end byteoffset of records of those classes
LDF contd
Task 2 : transformation of Fixed-Increment Single Sample Perceptron algorithm into MapReduce format n(n-1)/2 input files creates same number of Map tasks in new job all the n(n-1)/2 LDFs are created in parallel Mapper takes class label pairs along with their offsets as input and produces two class labels and the corresponding separating vector as output Reducer simply collects these in an output file.
LDF contd
Task 3: store output file created by task 1
contains parameters of the trained classifier in the form of class label pairs and the corresponding separating vector
Flow chart
1st phase of map Reduce
Testing of Data
Read Classifier/part00000 Store Weight vectors for each plane. The Input Test file is divided into chunks . Each chunk is given to Mapper Class in Classifier.java. Depending upon the output Correct classification or Incorrect Classification Output is collected. In the reducer class the correct and incorrect classification of each class is summed up and finally the result is written to Result/part00000. Result Analyzer.java reads this file and shows the output to the user. LDA Driver.java runs the Mapper , Reducer, Classifier and Result analyzer class.
Classes Implemented
LDF Mapper.java LDF Reducer.java LDF Driver.java LDF Classifier.java LDF Result Analyzer.java
Installation of Mahout
Download the tar files of both apache-mahout and apache-maven projects Unzip the tar files in a directory Set the Path Variables for maven Configure settings.xml to add http proxy and port in conf directory in maven folder Set present working directory to the mahout's core folder Compile the project by 'mvn-compile' Build the project by 'mvn-install'
Thanks