POST
LOG IN or JOIN
ZONES: AGILE BIG DATA CLOUD DEVOPS INTEGRATION IOT JAVA MOBILE NOSQL PERFORMANCE WEB DEV
Big Data/Analytics Zone is brought to you in partnership with:
Search
Working on Android, iOS or Windows Phone apps? Check out our Mobile Zone
Carlo Scarioni
Bio
Website
4345 views
Share
The Big Data/Analytics Zone is presented by Couchbase. Learn what roles scalability,
The problems
We manage quite a bit of customer data, starting from the beginning of a customer's search for a
new insurance policy, all the way until they buy (or don't buy) the policy. We keep all of this data,
but for the most part, we don't do anything to improve our customer offering.
Our site looks exactly the same for every customer -- we don't try to engage with them on a
more personal level. No customisation exists, which means that each customer's experience
doesn't adapt to his or her personality, specic trade, or home or business location. Nothing at
all.
Filling out a long form is always boring. But lling it out while being unsure of what information to
put where, and being forced to make a phone call to conrm details, is even more boring. In
many cases, it could mean the customer just gets bored and leaves our form.
The idea
We wanted to create and test a solution that allowed us to group together similar customers using
different sets of dimensions depending on the information we wanted to provide or obtain. We
thought about introducing clustering technology and algorithms to group our customers.
This would be a very rough implementation that would allow us to prove certain techniques and
solutions for this type of problems -- it certainly would NOT cover all the nuances that machine
learning algorithms and analysis carry with them. Many liberties were taken to get to a proof of
concept. The code presented here is not 100% the same code used in the spike, but it forms a very
accurate approximation
This post covers the implementation of the solution.
Solution
Setting up the Clustering backend algorithms to allow multidimensional clustering.
I had already decided that I would put into practice my knowledge of Mahout and Hadoop to run
the clustering processing. I installed Hadoop using my own recipes from hadoop-vagrant to be
run on a local Vagrant cluster, and then to be run in a AWS cluster.
Hadoop is a framework that allows the processing of certaing types of tasks in a distributed
environment using commodity machines that allows it to massively scale horizontaly. Its main
components are the map-reduce execution framework and the HDFS distributed lesystem. For
more details, check out my blog post.
Getting the data:
After Hadoop was installed, the rst task was to nd and extract the data. The data was stored
on a SqlServer database, so we needed to fetch it and put it into HDFS. There is a fantastic tool
called Sqoop that's built just for this. Sqoop not only allows you to get data ready for HDFS, it
actually uses Hadoop itself to paralellize the extraction of the data. The standard way to run
Sqoop is as follows:
1. sqoop import driver com.microsoft.sqlserver.jdbc.SQLServerDriver
connect "jdbc:sqlserver://xxx:1433;username=xxx;password=xxx;
databaseName=xxx" query "select xxx from xxxx" splitby
customer._id fetchsize 20 deletetargetdir targetdir
aggregated_customers packagename "clustercustomers.sqoop"
nullstring '' fieldsterminatedby ','
The previous command generates the required hadoop compatible les that will be used in the
subsequent analysis. The most important part is the query that you want to use to extract the data.
In our case, in the rst iteration, we extracted information like trade, vertical, claims, years_insured,
and turnover. These values are the dimensions that we will use to group our "similar" customers.
K-Means Clustering.
I have read quite a bit about different machine learning techniques and algorithms. I have
developed a bit with them in the past, particularly in the recommendation area. The rst thing to
decide with a Machine Learning problem is what exactly I want to achieve. First, let's look at the
three main problems that Machine Learning solves, and then follow the reasoning behind my
choices.
Machine Learning algorithms in Mahout can be broadly categorized in three main areas:
Recommendation Algorithms: Try to make an informed guess about what things you might
like out of a large domain of things. In the simplest and most common form, the inference is
done based on similarity. This similarity could be based on items that you've already said you
like, or similarity with other users that happen to like the same items as you.
Assume we have a database of movies, and say you like Lethal Weapon.
Item-Based similarity:
recommendations for movies similar to Lethal Weapon.
User-Based similarity:
recommendations for movies that other people who liked Lethal Weapon liked as well
Classication Algorithms: In the family of Supervised Learning algorithms (supervised because
the set of resolutions and categories are known beforehand). Classication algortihms allow you
to assign an item to a particular category given a set of known characteristics (where the
category belongs to a limited set of options)
This technique used in Spam detection systems.
Let's say you decide that any email with at least two of the following characteristics: 4 or
more images, 4 words written in all-capital letters, and the text 'congratulations' with an
exclamation mark at the end should be marked as Spam, and anything with fewer than two
For the case of employees, I converted the values to consecutive numbers like (values are made up
for this example):
"less than 15 employees" -> 1
"between 15 and 50 employees" -> 2
"between 50 and 200 employees" -> 3
For the two discrete properties product and trade, I have to create individual dimensions for each of
the discrete values that they can be. As in my example I was only going to use Baker and
Accountant for trades and Shop and Business for product, the nal dimensions Vector ended
something like:
1. | shop | business | accountant | baker | turnover | employees | claims
|
So let's say we wanted to model an accountant with 50000 turnover 20 employees and 2 claims.
His vector would look like:
1. | 0 | 1 | 1 | 0 | 50000 | 2 | 2 |
We can already see a problem with this vector. In particular we can see that the value turnover is
much larger than the rest of the dimensions. This means that the calculation of measure will be
extremely inuenced by this value: we say this value has a much bigger weight than the rest. For
the example, we assume that there is a maximum turnover of 100000.
In our case, we want to give extra weight to the product and trade dimensions and make turnover
much less signicant.
Mahout offers some functionality for doing just that. Normally as an implementation of the class
WeightedDistanceMeasure it works by building a Vector with multipliers for each of the dimensions
of the original vector. This vector needs to be the same size as the dimensions vector. In our case,
we could have a vector like this:
1. | 10 | 10 | 5 | 5 | 1/100000 | 1/2 | 1/10 |
The effect of that Vector will be to alter the values of the original by multiplying the product by 10,
the trade by 5, making sure that turnover is always less than 1, halving the inuence of number of
employees and making claims less inuential.
NOTE: Finding the correct dimensions and weights for a clustering algorithm is a really hard
exercise which normally requires multiple iterations to nd the "best" solution. Our example,
following with the Spike approach for this hackathon, is using completely arbitrary values chosen
just to prove the technique, and not carefully crafted normalizations of data. If these values are
good enough for our examples, then they are good enough.
Following are the main parts of the raw code written to convert the initial data to a list of vectors:
01. public class VectorCreationMapReduce extends Configured implements
Tool {
02.
03. public static class VectorizerMapper extends Mapper<LongWritable,
Text, Text, VectorWritable> {
04.
05.
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
06.
07.
VectorWritable writer = new VectorWritable();
08.
System.out.println(value.toString());
09.
String[] values = value.toString().split("\\|");
10.
double[] verticals = vectorForVertical(values[1]);
11.
double[] trade = vectorForTrade(values[2]);
12.
double[] turnover = vectorForDouble(values[3]);
13.
double[] claimCount = vectorForDouble(values[4]);
14.
double[] xCoordinate = vectorForDouble(values[7]);
15.
double[] yCoordinate = vectorForDouble(values[8]);
16.
NamedVector vector = new NamedVector(new
DenseVector(concatArrays(verticals, trade, turnover, claimCount,
1. hadoop org.apache.mahout.clustering.kmeans.KMeansDriver i
vector_seq_file/partm00000 c customercentroids/clusters0final
o customerkmeans dm
clustercustomers.mahout.CustomWeightedEuclideanDistanceMeasure x
10 ow clustering
In this command, we are specifying that we want to run the KMeansDriver hadoop job. We pass in
the input vector le again, and we specify the centroids le generated by canopy with the -c option.
We then specify where the clustering output should go, the use of the weighting mechanism again,
and how many iterations we want to do on the data.
Here's a quick overview of how K-Means actually works:
The K-Means Clustering algorithm starts with the given set of K centroids and iterates over
adjusting the centorids until the iteration limit X is reached or until the centroids converge to a point
from where they don't move. Each iteration has 2 steps and works the following way. - For each
point in the input, it nds the nearest centroid and assigns the point to the cluster represented by
that centroid. - At the end of the iteration, the points are averaged to recalculate the new centroid
possition. - If the maximum number of iterations is reached, or centroid points don't move any more,
the clustering concludes.
K-Means (and Canopy as well) are parallelizable algorithms, meaning that you can have many jobs
working on a subset of the problem and aggregating results. This is where Hadoop comes in for
the clustering execution. Internally Mahout and in particular KMeansDriver is build to work on the
Hadoop Map-Reduce infrastructure. By leveraging Hadoop's map-reduce proved implementation,
Mahout algorithms are able to scale to very big data sets and process them in a parallel way.
After generating this cluster, the next step is to create individual cluster les and to a single cluster
le with the simple syntax (cluster_id, customer_id).
This is done with the following map and reduce methods:
01. public static class ClusterPassThroughMapper extends
Mapper<IntWritable, WeightedVectorWritable, IntWritable, Text> {
02.
public void map(IntWritable key, WeightedVectorWritable value,
Context context) throws IOException, InterruptedException {
03.
NamedVector vector = (NamedVector) value.getVector();
04.
context.write(key,new Text(vector.getName()));
05.
}
06. }
07.
08. public static class ClusterPointsToIndividualFile extends
Reducer<IntWritable, Text, IntWritable, Text> {
09.
private MultipleOutputs mos;
10.
11.
public void setup(Context context) {
12.
mos = new MultipleOutputs(context);
13.
}
14.
15.
16.
public void reduce(IntWritable key, Iterable<Text> value, Context
context) throws IOException, InterruptedException {
17.
for(Text text: value){
18.
mos.write("seq", key, text, "cluster"+key.toString());
19.
context.write(key,text);
20.
}
21.
}
22.
23.
public void cleanup(Context context) throws IOException,
InterruptedException {
24.
mos.close();
25.
}
This has allowed us to obtain clusters of customers - next week's post will explore what can be
done with these clusters!
PART 2
Read the rst part of this hackathon implementation at Clustering our customers to get a full
background on what's being presented here!
So for example, when a customer is lling the form we can invoke an endpoint on this new service
like this: GET /?trade=accountant&product=business&claims=2&turnover=25000&loyees=2
This call will vectorise that information, nd the correct cluster, and return the information for given
cluster. The important parts of the code follow:
01. def similarBusinesses(business: Business): Seq[Business] = {
02. loadCentroids()
03. val cluster = clusterForBusiness(business)
04. val maxBusinesses = 100
05. var currentBusinesses = 0
06. CustomHadoopTextFileReader.readFile(s"hdfs://localhost:9000/individual
clusters/cluster${cluster}r00000") {
07.
line =>
08.
val splitted = line.split("\t")
09.
userIds += splitted(1)
10.
currentBusinesses += 1
11.
12. }(currentBusinesses < maxBusinesses)
13. businessesForUserIds(userIds)
14. }
That code nds the cluster for the business arriving from the main application. Then it reads the
HDFS le representing that individual cluster and gets the business information for the returned
user ids.
To nd the cluster to which the business belongs to, we compare against the stored centroids:
01. private def clusterForBusiness(business: Business): String = {
02. val businessVector = business.vector
03. var currentDistance = Double.MaxValue
04. var selectedCentroid: (String, Cluster) = null
05. for (centroid < SimilarBusinessesRetriever.centroids) {
06.
if (distance(centroid._2.getCenter, businessVector) <
currentDistance) {
07.
currentDistance = distance(centroid._2.getCenter, businessVector);
08.
selectedCentroid = centroid;
09.
}
10. }
11. clusterId = Integer.valueOf(selectedCentroid._1)
12. selectedCentroid._1
13. }
The code that actually reads the Hadoop lesystem follows, it looks like reading a simple le:
01. object CustomHadoopTextFileReader {
02. def readFile(filePath: String)(f: String => Unit)(g: => Boolean =
true) {
03.
try {
04.
val pt = new Path(filePath)
05.
val br = new BufferedReader(new
InputStreamReader(SimilarBusinessesRetriever.fs.open(pt)));
06.
var line = br.readLine()
07.
while (line != null && g) {
08.
f(line)
09.
line = br.readLine()
10.
}
11.
} catch {
12.
case e: Exception =>
13.
e.printStackTrace()
14.
}
15. }
16. }
Then to return the premium for a particular cluster:
01. def averagePremium(cluster: Int): Int = {
02. CustomHadoopTextFileReader.readFile("hdfs://localhost:9000
/premiumsAverage/partr00000") {
03.
line =>
04.
val splitted = line.split("\\|")
05.
if (cluster.toString == splitted(0)) {
06.
return Math.ceil(java.lang.Double.valueOf(splitted(1))).toInt
07.
}
08. }(true)
09. 0
10. }
Any of these values can then be returned to the calling application - this can provide a list of
"similar" businesses or retrieve the premium average paid for those similar businesses.
This was it for the rst iteration of the hack! The second iteration was to use the same infrastructure
to cluster customers based on the location of their businesses instead of the dimensions used
here. Given that the basic procedure for clustering remains the same despite the dimensions
utilised, the code for that looks similar to the code presented here.
Published at DZone with permission of Carlo Scarioni, author and DZone MVB. (source)
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)
Theory
The Big Data/Analytics Zone is presented by Couchbase. See why successful companies are
moving beyond relational databases. Connect with Couchbase on Facebook, Twitter , LinkedIn
and Google +.
JAVALOBBY
ARCHITECTS
JAVALOBBY
JAVALOBBY
SERVER
5 Things a Java
Developer Should
Consider This Yea...
Singleton Design
Pattern An
Introspection w/ B...
POPULAR ON JAVALOBBY
Fast and Scalable Clojure Ring Web Applications with Com sat
Internet of Things (IoT): Changing How We Live and Take Care of Business
Getting Excited About Your P roject With a News Headline from the Future
LATEST ARTICLES
Topics
Camel Essential
Components
DZone's 170th
Refcard is an
essential reference to
Camel, an
open-source,
lightweight,
integration library.
This Refcard is
authored by...
Practical DNS:
Managing
Domains for
Safety, Reliability,
and Speed
DZone
SPOTLIGHT RESOURCES
Follow Us
Essential
Couchbase APIs:
Open Source
NoSQL Data
Access from
Java, Ruby, and
.NET
Search
Refcardz
Tech Library
Snippets
About DZone
Tools & Buttons
Book Reviews
IT Questions
My Prole
Advertise
Send Feedback
HTML5
Cloud
.NET
PHP
Performance
Agile
Windows Phone
Mobile
Java
Eclipse
Big Data
DevOps
Google +
Facebook
LinkedIn
Twitter