1. INTRODUCTION
Social networking site Face book gained very much
popularity in people worldwide. Everyday millions of
users share their information in form of text, images or
videos now days on their billion of pages. Facebook
engineers or analysts manipulate this large data set
using Hadoop. Data set is of 1 Peta byte disk space Fig. 1 Architecture of Hadoop
while 2500 CPU cores.[1]
Hadoop is an open source distributed framework It is two level architecture visually Map Reduce layer
which is used for distributed storage. It is founded by & HDFS layer. Location of user can be tracked by using
Apache foundation & usually processes large data sets. rack. Nodes are nothing but the commodity PCs. A rack
Hadoop has distributed file system. Large files consists of at least 30 to 40 nodes. A large data shared
distributed in small sets & referred as cluster. Packets to every node replicate at every shared node by HDFS
transferred to the cluster nodes in parallel form. The layer. HDFS is distributed file system where data found
programming model for big data is Map reduce on distributed format. Replica produced on a local
programming model. Initially Hadoop referred Java node while other two on the same rack & additional
platform for programming model. [2] anywhere else in HDFS. This reduces failure of data at
a node. A failed data node can also be detected.[3]
1.1 Hadoop for Facebook
Facebook used SQL query model for mapping & 1.3 Hadoop Distributed file system ( HDFS )
processing large data. Reason why facebook selected In HDFS every node assigned with a name. Hadoop has
Hadoop is its high building reliability in each single name node while others are cluster of data
application. Because of large shared data, everyday nodes. Redundancy maintained on name nodes, data
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1000
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 04 | Apr -2017 www.irjet.net p-ISSN: 2395-0072
nodes serves other block nodes. Entire cluster of nodes 2. Creation of cluster by observation, nearest
has single name space. Data coherency maintained by mean & partitioning.
writing once & reading many times. [4] 3. Declaration of new mean by centroid of K
Entire metadata is in main memory of server. Metadata cluster
is in form of list of files , files attributes , list of blocks & 4. Repeat step 2 & 3 until convergence.
data nodes. A transaction log includes record of file Optimization & final results depends on initial cluster
creation, deletion.[5] values. Time required for the conversions can be
exponentially increases if number of cluster increases.
The algorithm has polynomial smoothed running time.
Algorithm has fallowing advantages.
1. It is applicable to univariate data.
2. It uses median instead of mean to normalize
data.
3. Upper bound limit can be provable by K mean.
4. Suitable for text, image data sharing.
5. Triangle inequality speed up K mean clustering
6. Optimal number of clusters is possible by some
changes in algorithm. Cluster size is
Fig.2 Hadoop cloud Data flow
deterministic.
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1002