Big Data Document

Big Data
02/02/2016
SuperValu Inc.
Deepika Sharnagat
deepika.sharnagat@tcs.com
Confidentiality Statement
Include the confidentiality statement within the box provided. This has to be legally
approved
Confidentiality and Non-Disclosure Notice
The information contained in this document is confidential and proprietary to TATA
Consultancy Services. This information may not be disclosed, duplicated or used for any
other purposes. The information contained in this document may not be released in whole or
in part outside TCS for any purpose without the express written permission of TATA
Consultancy Services.
Tata Code of Conduct

We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata Code
of Conduct. We request your support in helping us adhere to the Code in letter and spirit.
We request that any violation or potential violation of the Code by any person be promptly
brought to the notice of the Local Ethics Counsellor or the Principal Ethics Counsellor or the
CEO of TCS. All communication received in this regard will be treated and kept as
confidential.
Table of Content
1. What's New in SQL Server 2012 Error! Bookmark not defined.
2. Types of Report in SQL Server 2012 Error! Bookmark not defined.
2.1 Drill-Down Report Error! Bookmark not defined.
2.2 Sub- Report Error! Bookmark not defined.
2.3 Main/Detailed Report and Drill-through Report Error! Bookmark not defined.
2.4 Linked Report Error! Bookmark not defined.
2.5 History Report Error! Bookmark not defined.
2.6 Cached Report Error! Bookmark not defined.
2.7 Snap-Shot Report Error! Bookmark not defined.
2.8 Model Report and Click-through Report Error! Bookmark not defined.
2.9 Saved Report Error! Bookmark not defined.
2.10 Published Report Error! Bookmark not defined.
2.11 Upgraded Report Error! Bookmark not defined.
3. Layouts of Reports in SQL Server 2012 Error! Bookmark not defined.
3.1 Table Report Error! Bookmark not defined.
3.2 Matrix Report Error! Bookmark not defined.
3.3 List Report Error! Bookmark not defined.
3.4 Chart Report Error! Bookmark not defined.
3.5 Gauge Report and Dashboard Error! Bookmark not defined.
3.6 Maps and Spatial Error! Bookmark not defined.
4. How to create a report in SQL Server 2012 Error! Bookmark not defined.
5. Different types of Chart Report in SQL Server 2012 Error! Bookmark not defined.
5.1 Column Charts Error! Bookmark not defined.
5.2 Area and Line Charts Error! Bookmark not defined.
5.3 Pie and Doughnut Chart Error! Bookmark not defined.
5.4 Bubble and stocked charts Error! Bookmark not defined.
6. Deploying a Report in SQL Server 2012 Error! Bookmark not defined.
6.1 Using the Deploy option in BIDS/Visual Studio Error! Bookmark not defined.
6.2 Deploying Reports through Solution Explorer Error! Bookmark not defined.
7. References 48
1. Big Data – Beginning Big Data

2. Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity and Variety
3. Big Data – Evolution of Big Data
4. Big Data – Basics of Big Data Architecture
5. Big Data – Buzz Words: What is NoSQL
6. Big Data – Buzz Words: What is Hadoop
7. Big Data – Buzz Words: What is MapReduce
8. Big Data – Buzz Words: What is HDFS
9. Big Data – Buzz Words: Importance of Relational Database in Big Data World
10. Big Data – Buzz Words: What is NewSQL
11. Big Data – Role of Cloud Computing in Big Data
12. Big Data – Operational Databases Supporting Big Data – RDBMS and NoSQL
13. Big Data – Data Mining with Hive – What is Hive? – What is HiveQL (HQL)?
14. Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin?
15. Big Data – Interacting with Hadoop – What is Sqoop? – What is Zookeeper?
16. Big Data – Basics of Big Data Analytics
17. Big Data – How to become a Data Scientist and Learn Data Science?
1.Big Data – Beginning Big Data
What is Big Data?

Does Big Data really mean data is big?
Big Data – Big Thing!

Big Data is becoming one of the most talked about technology trends nowadays. The real
challenge with the big organization is to get maximum out of the data already available and
predict what kind of data to collect in the future. How to take the existing data and make it
meaningful that it provides us accurate insight in the past data is one of the key discussion
points in many of the executive meetings in organizations. With the explosion of the data the
challenge has gone to the next level and now a Big Data is becoming the reality in many
organizations.
Big Data – A Rubik’s Cube
I like to compare big data with the Rubik’s cube. I believe they have many similarities. Just
like a Rubik’s cube it has many different solutions. Let us visualize a Rubik’s cube solving
challenge where there are many experts participating. If you take five Rubik’s cube and mix
up the same way and give it to five different experts to solve it. It is quite possible that all the
five people will solve the Rubik’s cube in fractions of the seconds but if you pay attention to
the same closely, you will notice that even though the final outcome is the same, the route
taken to solve the Rubik’s cube is not the same. Every expert will start at a different place
and will try to resolve it with different methods. Some will solve one color first and others
will solve another color first. Even though they follow the same kind of algorithm to solve
the puzzle they will start and end at a different place and their moves will be different at
many occasions. It is nearly impossible to have a exact same route taken by two experts.
Big Market and Multiple Solutions
Big Data is exactly like a Rubik’s cube – even though the goal of every organization and
expert is same to get maximum out of the data, the route and the starting point are different
for each organization and expert. As organizations are evaluating and architecting big data
solutions they are also learning the ways and opportunities which are related to Big Data.
There is not a single solution to big data as well there is not a single vendor which can claim
to know all about Big Data. Honestly, Big Data is too big a concept and there are many
players – different architectures, different vendors and different technology.
2.Big Data – What is Big Data – 3 Vs of Big Data
– Volume, Velocity and Variety
Data is forever. Think about it – it is indeed true. Are you using any application as it is which
was built 10 years ago? Are you using any piece of hardware which was built 10 years ago?
The answer is most certainly No. However, if I ask you – are you using any data which were
captured 50 years ago, the answer is most certainly yes. For example, look at the history of
our nation. I am from India and we have documented history which goes back as over 1000s
of year. Well, just look at our birthday data – atleast we are using it till today. Data never
gets old and it is going to stay there forever. Application which interprets and analysis data
got changed but the data remained in its purest format in most cases.
As organizations have grown the data associated with them also grew exponentially and
today there are lots of complexities to their data. Most of the big organizations have data in
multiple applications and in different formats. The data is also spread out so much that it is
hard to categorize with a single algorithm or logic. The mobile revolution which we are
experimenting right now has completely changed how we capture the data and build
intelligent systems. Big organizations are indeed facing challenges to keep all the data on a
platform which give them a single consistent view of their data. This unique challenge to
make sense of all the data coming in from different sources and deriving the useful
actionable information out of is the revolution Big Data world is facing.
Defining Big Data
The 3Vs that define Big Data are Variety, Velocity and Volume.
Volume
We currently see the exponential growth in the data storage as the data is now more than text
data. We can find data in the format of videos, musics and large images on our social media
channels. It is very common to have Terabytes and Petabytes of the storage system for
enterprises. As the database grows the applications and architecture built to support the data
needs to be reevaluated quite often. Sometimes the same data is re-evaluated with multiple
angles and even though the original data is the same the new found intelligence
creates explosion of the data. The big volume indeed represents Big Data.
Velocity
The data growth and social media explosion have changed how we look at the data. There
was a time when we used to believe that data of yesterday is recent. The matter of the fact
newspapers is still following that logic. However, news channels and radios have changed
how fast we receive the news. Today, people reply on social media to update them with the
latest happening. On social media sometimes a few seconds old messages (a tweet, status
updates etc.) is not something interests users. They often discard old messages and pay
attention to recent updates. The data movement is now almost real time and the update
window has reduced to fractions of the seconds. This high velocity data represent Big Data.
Variety
Data can be stored in multiple formats. For example database, excel, csv, access or for the
matter of the fact, it can be stored in a simple text file. Sometimes the data is not even in the
traditional format as we assume, it may be in the form of video, SMS, pdf or something we
might have not thought about it. It is the need of the organization to arrange it and make it
meaningful. It will be easy to do so if we have data in the same format, however it is not the
case most of the time. The real worlds have data in many different formats and that is the
challenge we need to overcome with the Big Data. This variety of the data represents Big
Data.
Big Data in Simple Words

Big Data is not just about lots of data, it is actually a concept providing an opportunity to
find new insight into your existing data as well guidelines to capture and analysis your
future data. It makes any business more agile and robust so it can adapt and overcome
business challenges.
3.Big Data – Evolution of Big Data
Data in Flat File
In earlier days data was stored in the flat file and there was no structure in the flat file. If any
data has to be retrieved from the flat file it was a project by itself. There was no possibility of
retrieving the data efficiently and data integrity has been just a term discussed without any
modeling or structure around. Database residing in the flat file had more issues than we
would like to discuss in today’s world. It was more like a nightmare when there was any data
processing involved in the application. Though applications developed at that time were also
not that advanced, the need of the data was always there and there was always need of proper
data management.
Edgar F Codd and 12 Rules
Edgar Frank Codd was a British computer scientist who, while working for IBM, invented
the relational model for database management, the theoretical basis for relational databases.
He presented 12 rules for the Relational Database and suddenly the chaotic world of the
database seems to see discipline in the rules. Relational Database was a promising land for
all the unstructured database users. Relational Database brought into the relationship
between data as well improved the performance of the data retrieval. Database world had
immediately seen a major transformation and every single vendors and database users
suddenly started to adopt the relational database models.
Relational Database Management Systems
Since Edgar F Codd proposed 12 rules for the RBDMS there were many different vendors
who started them to build applications and tools to support the relationship between
databases. This was indeed a learning curve for many of the developer who had never
worked before with the modelling of the database. However, as time passed by pretty much
everybody accepted the relationship of the database and started to evolve product which
performs its best with the boundaries of the RDBMS concepts. This was the best era for the
databases and it gave the world extreme experts as well as some of the best products. The
Entity Relationship model was also evolved at the same time. In software engineering,
an Entity–relationship model (ER model) is a data model for describing a database in an
abstract way.
Enormous Data Growth
Well, everything was going fine with the RDBMS in the database world. As there were no
major challenges the adoption of the RDBMS applications and tools was pretty much
universal. There was a race at times to make the developer’s life much easier with the
RDBMS management tools. Due to the extreme popularity and easy to use system pretty
much every data was stored in the RDBMS system. New age applications were built and
social media took the world by the storm. Every organization was feeling pressure to provide
the best experience for their users based the data they had with them. While this was all
going on at the same time data was growing pretty much every organization and application.
Data Warehousing
The enormous data growth now presented a big challenge for the organizations who wanted
to build intelligent systems based on the data and provide near real time superior user
experience to their customers. Various organizations immediately start building data
warehousing solutions where the data was stored and processed. The trend of the business
intelligence becomes the need of everyday. Data was received from the transaction system
and overnight was processed to build intelligent reports from it. Though this is a great
solution it has its own set of challenges. The relational database model and data warehousing
concepts are all built with keeping traditional relational database modeling in the mind and it
still has many challenges when unstructured data was present.
Interesting Challenge
Every organization had expertise to manage structured data but the world had already
changed to unstructured data. There was intelligence in the videos, photos, SMS, text, social
media messages and various other data sources. All of these needed to now bring to a single
platform and build a uniform system which does what businesses need. The way we do
business has also been changed. There was a time when user only got the features what
technology supported, however, now users ask for the feature and technology is built to
support the same. The need of the real time intelligence from the fast paced data flow is now
becoming a necessity.
Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the
properties of the data. The traditional database system has limits to resolve the challenges
this new kind of the data presents. Hence the need of the Big Data Science. We need
innovation in how we handle and manage data. We need creative ways to capture data and
present to users.
Big Data is Reality!
4.Big Data – Basics of Big Data Architecture
Big Data Cycle
Just like every other database related applications, bit data project have its development
cycle. Though three Vs (link) for sure plays an important role in deciding the architecture of
the Big Data projects. Just like every other project Big Data project also goes to similar
phases of the data capturing, transforming, integrating, analyzing and building actionable
reporting on the top of the data.
While the process looks almost same but due to the nature of the data the architecture is
often totally different.
Building Blocks of Big Data Architecture

Above image gives good overview of how in Big Data Architecture various components are
associated with each other. In Big Data various different data sources are part of the
architecture hence extract, transform and integration are one of the most essential layers of
the architecture. Most of the data is stored in relational as well as non relational data marts
and data warehousing solutions. As per the business need various data are processed as well
converted to proper reports and visualizations for end users. Just like software the hardware
is almost the most important part of the Big Data Architecture. In the big data architecture
hardware infrastructure is extremely important and failure over instances as well as
redundant physical infrastructure is usually implemented.
NoSQL in Data Management
NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only
SQL. This is because in Big Data Architecture the data is in any format. It can be
unstructured, relational or in any other format or from any other data source. To bring all the
data together relational technology is not enough, hence new tools, architecture and other
algorithms are invented which takes care of all the kind of data. This is collectively called
NoSQL.
5.Big Data – What is NoSQL
What is NoSQL?
NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL
means there is No SQL, which is not true – they both sound same but the meaning is totally
different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per
Wikipedia’s NoSQL Database Definition – “A NoSQL database provides a mechanism for
storage and retrieval of data that uses looser consistency models than traditional relational
databases.“
Why use NoSQL?

A traditional relation database usually deals with predictable structured data. Whereas as the
world has moved forward with unstructured data we often see the limitations of the
traditional relational database in dealing with them. For example, nowadays we have data in
format of SMS, wave files, photos and video format. It is a bit difficult to manage them by
using a traditional relational database. I often see people using BLOB filed to store such a
data. BLOB can store the data but when we have to retrieve them or even process them the
same BLOB is extremely slow in processing the unstructured data. A NoSQL database is the
type of database that can handle unstructured, unorganized and unpredictable data that our
business needs it.
Along with the support to unstructured data, the other advantage of NoSQL Database is high
performance and high availability.
Eventual Consistency
Additionally to note that NoSQL Database may not provided 100% ACID (Atomicity,
Consistency, Isolation, Durability) compliance. Though, NoSQL Database does not support
ACID they provide eventual consistency. That means over the long period of time all
updates can be expected to propagate eventually through the system and data will be
consistent.
Taxonomy
Taxonomy is the practice of classification of things or concepts and the principles. The
NoSQL taxonomy supports column store, document store, key-value stores, and graph
databases. Here are few of the examples of the each of the No SQL Category.
 Column: Hbase, Cassandra, Accumulo
 Document: MongoDB, Couchbase, Raven
 Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m
 Graph: Neo4J, Allegro, Virtuoso, Bigdata
6.Big Data – What is Hadoop
What is Hadoop?
Apache Hadoop is an open-source, free and Java based software framework offers a
powerful distributed platform to store and manage Big Data. It is licensed under an Apache
V2 license. It runs applications on large clusters of commodity hardware and it processes
thousands of terabytes of data on thousands of the nodes. Hadoop is inspired from Google’s
MapReduce and Google File System (GFS) papers. The major advantage of Hadoop
framework is that it provides reliability and high availability.
What are the core components of Hadoop?

There are two major components of the Hadoop framework and both of them does two of the
important task for it.
 Hadoop MapReduce is the method to split a larger data problem into smaller chunk
and distribute it to many different commodity servers. Each server have their own set
of resources and they have processed them locally. Once the commodity server has
processed the data they send it back collectively to main server. This is effectively a
process where we process large data effectively and efficiently.
 Hadoop Distributed File System (HDFS) is a virtual file system. There is a big
difference between any other file system and Hadoop. When we move a file on HDFS,
it is automatically split into many small pieces. These small chunks of the file are
replicated and stored on other servers (usually 3) for the fault tolerance or high
availability.
Besides above two core components Hadoop project also contains following modules as
well.
 Hadoop Common: Common utilities for the other Hadoop modules
 Hadoop Yarn: A framework for job scheduling and cluster resource management
A Multi-node Hadoop Cluster Architecture
Now let us quickly see the architecture of the a multi-node Hadoop cluster.
A small Hadoop cluster includes a single master node and multiple worker or slave node. As
discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce
Layer and another is of HDFS Layer. Each of these layer have its own relevant component.
The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave
or worker node consists of a DataNode and TaskTracker. It is also possible that slave node or
worker node is only data or compute node. The matter of the fact that is the key feature of
the Hadoop.
Why Use Hadoop?

There are many advantages of using Hadoop. Let me quickly list them over here:
 Robust and Scalable – We can add new nodes as needed as well modify them.
 Affordable and Cost Effective – We do not need any special hardware for running
Hadoop. We can just use commodity server.
 Adaptive and Flexible – Hadoop is built keeping in mind that it will handle structured
and unstructured data.
 Highly Available and Fault Tolerant – When a node fails, the Hadoop framework
automatically fails over to another node.
Why Hadoop is named as Hadoop?
In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at
Yahoo. Doug Cutting named Hadoop after his son’s toy elephant.
7.Big Data – What is MapReduce
What is MapReduce?
MapReduce was designed by Google as a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally
Google proprietary technology, it has been quite a generalized term in the recent time.
MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance

filtering and sorting operation on data where as procedure Reduce() performs a summary
operation of the data. This model is based on modified concepts of the map and reduce
functions commonly available in functional programing. The library where procedure Map()
and Reduce() belongs is written in many different languages. The most popular free
implementation of MapReduce is Apache Hadoop.
Advantages of MapReduce Procedures
The MapReduce Framework usually contains distributed servers and it runs various tasks in
parallel to each other. There are various components which manages the communications
between various nodes of the data and provides the high availability and fault
tolerance. Programs written in MapReduce functional styles are automatically parallelized
and executed on commodity machines. The MapReduce Framework takes care of the details
of partitioning the data and executing the processes on distributed server on run time. During
this process if there is any disaster the framework provides high availability and other
available modes take care of the responsibility of the failed node.
As you can clearly see more this entire MapReduce Frameworks provides much more than
just Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A
typical implementation of the MapReduce Framework processes many petabytes of data and
thousands of the processing machines.
How do MapReduce Framework Works?

A typical MapReduce Framework contains petabytes of the data and thousands of the nodes.
Here is the basic explanation of the MapReduce Procedures which uses this massive
commodity of the servers.
Map() Procedure
There is always a master node in this infrastructure which takes an input. Right after taking
input master node divides it into smaller sub-inputs or sub-problems. These sub-problems are
distributed to worker nodes. A worker node later processes them and does necessary
analysis. Once the worker node completes the process with this sub-problem it returns it
back to master node.
Reduce() Procedure
All the worker nodes return the answer to the sub-problem assigned to them to master node.
The master node collects the answer and once again aggregate that in the form of the answer
to the original big problem which was assigned master node.
The MapReduce Framework does the above Map () and Reduce () procedure in the parallel
and independent to each other. All the Map() procedures can run parallel to each other and
once each worker node had completed their task they can send it back to master code to
compile it with a single answer. This particular procedure can be very effective when it is
implemented on a very large amount of data (Big Data).
The MapReduce Framework has five different steps:
 Preparing Map() Input

 Executing User Provided Map() Code
 Shuffle Map Output to Reduce Processor
 Executing User Provided Reduce Code
 Producing the Final Output
Here is the Dataflow of MapReduce Framework:
 Input Reader
 Map Function
 Partition Function
 Compare Function
 Reduce Function
 Output Writer
MapReduce in a Single Statement
MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very

large database.
8.Big Data – What is HDFS
What is HDFS ?
HDFS stands for Hadoop Distributed File System and it is a primary storage system used by
Hadoop. It provides high performance access to data across Hadoop clusters. It is usually
deployed on low-cost commodity hardware. In commodity hardware deployment server
failures are very common. Due to the same reason HDFS is built to have high fault tolerance.
The data transfer rate between compute nodes in HDFS is very high, which leads to reduced
risk of failure.
HDFS creates smaller pieces of the big data and distributes it on different nodes. It also
copies each smaller piece to multiple times on different nodes. Hence when any node with
the data crashes the system is automatically able to use the data from a different node and
continue the process. This is the key feature of the HDFS system.
Architecture of HDFS
The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists
of single NameNode. This single NameNode is a master server and it manages the file
system as well regulates access to various files. In additional to NameNode there are
multiple DataNodes. There is always one DataNode for each data server. In HDFS a big file
is split into one or more blocks and those blocks are stored in a set of DataNodes.
The primary task of the NameNode is to open, close or rename files and directory and
regulate access to the file system, whereas the primary task of the DataNode is read and
write to the file systems. DataNode is also responsible for the creation, deletion or replication
of the data based on the instruction from NameNode.
In reality, NameNode and DataNode are software designed to run on commodity machine
build in Java language.
Visual Representation of HDFS Architecture

Let us understand how HDFS works with the help of the diagram. Client APP or HDFS
Client connects to NameSpace as well as DataNode. Client App access to the DataNode is
regulated by NameSpace Node. NameSpace Node allows Client App to connect to the
DataNode based by allowing the connection to the DataNode directly. A big data file is
divided into multiple data blocks (let us assume that those data chunks are A,B,C and D.
Client App will later on write data blocks directly to the DataNode. Client App does not have
to directly write to all the node. It just has to write to any one of the node and NameNode
will decide on which other DataNode it will have to replicate the data. In our example Client
App directly writes to DataNode 1 and detained 3. However, data chunks are automatically
replicated to other nodes. All the information like in which DataNode which data block is
placed is written back to NameNode.
High Availability During Disaster
Now as multiple DataNode have same data blocks in the case of any DataNode which faces
the disaster, the entire process will continue as other DataNode will assume the role to serve
the specific data block which was on the failed node. This system provides very high
tolerance to disaster and provides high availability.
If you notice there is only single NameNode in our architecture. If that node fails our entire
Hadoop Application will stop performing as it is a single node where we store all the
metadata. As this node is very critical, it is usually replicated on another clustered as well as
on another data rack. Though, that replicated node is not operational in architecture, it has all
the necessary data to perform the task of the NameNode in the case of the NameNode fails.
The entire Hadoop architecture is built to function smoothly even there are node failures or
hardware malfunction. It is built on the simple concept that data is so big it is impossible to
have come up with a single piece of the hardware which can manage it properly. We need
lots of commodity (cheap) hardware to manage our big data and hardware failure is part of
the commodity servers. To reduce the impact of hardware failure Hadoop architecture is built
to overcome the limitation of the non-functioning hardware.
9.Big Data – Importance of Relational Database
in Big Data World
A Big Question?
NoSQL Movement
The reason for the NoSQL Movement in recent time was because of the two important
advantages of the NoSQL databases.
1. Performance
2. Flexible Schema
In personal experience I have found that when I use NoSQL I have found both of the above
listed advantages when I use NoSQL database. There are instances when I found relational
database too much restrictive when my data is unstructured as well as they have in the
datatype which my Relational Database does not support. It is the same case when I have
found that NoSQL solution performing much better than relational databases.
Situations in Relational Database Outperforms

Adhoc reporting is the one of the most common scenarios where NoSQL is does not have
optimal solution. For example reporting queries often needs to aggregate based on the
columns which are not indexed as well are built while the report is running, in this kind of
scenario NoSQL databases (document database stores, distributed key value stores) database
often does not perform well. In the case of the ad-hoc reporting I have often found it is much
easier to work with relational databases.
SQL is the most popular computer language of all the time. In many cases, writing query
based on SQL is much easier than writing queries in NoSQL supported languages. I believe
this is the current situation but in the future this situation can reverse when No SQL query
languages are equally popular.
ACID (Atomicity Consistency Isolation Durability) – Not all the NoSQL solutions offers
ACID compliant language. There are always situations (for example banking transactions,
eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as
well database integrity can be at risk. Even though the data volume indeed qualify as a Big
Data there are always operations in the application which absolutely needs ACID compliance
matured language.
The Mixed Bag
I have often heard argument that all the big social media sites now a days have moved away
from Relational Database. Actually this is not entirely true. While researching about Big
Data and Relational Database, I have found that many of the popular social media sites uses
Big Data solutions along with Relational Database. Many are using relational databases to
deliver the results to end user on the run time and many still uses a relational database as
their major backbone.
Here are a few examples:

 Facebook uses MySQL to display the timeline.
 Twitter uses MySQL.
 Tumblr uses Sharded MySQL
 Wikipedia uses MySQL for data storage.
There are many for prominent organizations which are running large scale applications uses
relational database along with various Big Data frameworks to satisfy their
variousbusiness needs.
Summary
I believe that RDBMS is like a vanilla ice cream. Everybody loves it and everybody has
it.NoSQL and other solutions are like chocolate ice cream or custom ice cream – there is a
huge base which loves them and wants them but not every ice cream maker can make it just
right for everyone’s taste. No matter how fancy an ice cream store is there is always plain
vanilla ice cream available there. Just like the same, there are always cases and situations in
the Big Data’s story where traditional relational database is the part of the whole story. In the
real world scenarios there will be always the case when there will be need of the relational
database concepts and its ideology. It is extremely important to accept relational database as
one of the key components of the Big Data instead of treating it as a substandard technology.
10. Big Data – What is NewSQL
What is NewSQL?
NewSQL stands for new scalable and high performance SQL Database vendors. The
products sold by NewSQL vendors are horizontally scalable. NewSQL is not kind of
databases but it is about vendors who supports emerging data products with relational
database properties (like ACID, Transaction etc.) along with high performance. Products
from NewSQL vendors usually follow in memory data for speedy access as well are
available immediate scalability.
On the definition of NewSQL, Aslett writes:
“NewSQL” is our shorthand for the various new scalable/high performance SQL database
vendors. We have previously referred to these products as ‘ScalableSQL‘ to differentiate
them from the incumbent relational database products. Since this implies horizontal
scalability, which is not necessarily a feature of all the products, we adopted the term
‘NewSQL’ in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too
literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
In other words – NewSQL incorporates the concepts and principles of Structured Query
Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and
performance of NoSQL.
Categories of NewSQL
There are three major categories of the NewSQL

New Architecture – In this framework each node owns a subset of the data and queries are
split into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB
MySQL Engines – Highly Optimized storage engine for SQL with the interface of
MySQLare the example of such category. E.g. InnoDB, Akiban
Transparent Sharding – This system automatically split database across multiple nodes. E.g.
Scalearc
Summary
In simple words – NewSQL is kind of database following relational database principals and
provides scalability like NoSQL.
11. Big Data – Role of Cloud Computing in Big
Data
What is Cloud?
Cloud is the biggest buzzword around from last few years. Cloud computing is a method of
providing a shared computing resources to the application which requires dynamic resources.
These resources include applications, computing, storage, networking, development and
various deployment platforms. The fundamentals of the cloud computing are that it shares
pretty much share all the resources and deliver to end users as a service.
Examples of the Cloud Computing and Big Data are Google and Amazon.com. Both have
fantastic Big Data offering with the help of the cloud.
There are two different Cloud Deployment Models: 1) The Public Cloud and 2) The Private
Cloud
Public Cloud
Public Cloud is the cloud infrastructure build by commercial providers (Amazon, Rackspace
etc.) creates a highly scalable data center that hides the complex infrastructure from the
consumer and provides various services.
Private Cloud
Private Cloud is the cloud infrastructure build by a single organization where they are
managing highly scalable data center internally.
Here is the quick comparison between Public Cloud and Private Cloud :
Public Cloud Private Cloud
Initial cost Typically zero Typically high
Running cost Unpredictable Unpredictable
Customization Impossible Possible
Privacy No (Host has access to the data Yes
Single sign-on Impossible Possible
Scaling up Easy while within defined limits Laborious but no limits

Hybrid Cloud
Hybrid Cloud is the cloud infrastructure build with the composition of two or more clouds
like public and private cloud. Hybrid cloud gives best of the both the world as it combines
multiple cloud deployment models together.
Cloud and Big Data – Common Characteristics
There are many characteristics of the Cloud Architecture and Cloud Computing which are
also essentially important for Big Data as well. They highly overlap and at many places it
just makes sense to use the power of both the architecture and build a highly scalable
framework.
Here is the list of all the characteristics of cloud computing important in Big Data
 Scalability
 Elasticity
 Ad-hoc Resource Pooling
 Low Cost to Setup Infastructure
 Pay on Use or Pay as you Go
 Highly Available
Leading Big Data Cloud Providers

There are many players in Big Data Cloud but we will list a few of the known players in this
list.
Amazon
Amazon is arguably the most popular Infrastructure as a Service (IaaS) provider. The history
of how Amazon started in this business is very interesting. They started out with a massive
infrastructure to support their own business. Gradually they figured out that their own
resources are underutilized most of the time. They decided to get the maximum out of the
resources they have and hence they launched their Amazon Elastic Compute Cloud
(Amazon EC2) service in 2006. Their products have evolved a lot recently and now it is one
of their primary business besides their retail selling.
Amazon also offers Big Data services understand Amazon Web Services. Here is the list of
the included services:
 Amazon Elastic MapReduce – It processes very high volumes of data

 Amazon DynammoDB – It is fully managed NoSQL (Not Only SQL) database service
 Amazon Simple Storage Services (S3) – A web-scale service designed to store and
accommodate any amount of data
 Amazon High Performance Computing – It provides low-tenancy tuned high
performance computing cluster
 Amazon RedShift – It is petabyte scale data warehousing service
Google
Though Google is known for Search Engine, we all know that it is much more than that.
 Google Compute Engine – It offers secure, flexible computing from energy efficient
data centers
 Google Big Query – It allows SQL-like queries to run against large datasets
 Google Prediction API – It is a cloud based machine learning tool
Other Players
Besides Amazon and Google we also have other players in the Big Data market as well.
Microsoft is also attempting Big Data with the Cloud with Microsoft Azure. Additionally
Rackspace and NASA together have initiated OpenStack. The goal of Openstack is to
provide a massively scaled, multitenant cloud that can run on any hardware.
Thing to Watch
The cloud based solutions provides a great integration with the Big Data’s story as well it is
very economical to implement as well. However, there are few things one should be very
careful when deploying Big Data on cloud solutions.
Here is a list of a few things to watch:
 Data Integrity
 Initial Cost
 Recurring Cost
 Performance
 Data Access Security
 Location
 Compliance
Every company have different approaches to Big Data and have different rules and
regulations. Based on various factors, one can implement their own custom Big Datasolution
on a cloud.
12. Big Data – Operational Databases
Supporting Big Data – RDBMS and NoSQL
Even though we keep on talking about Big Data architecture, it is extremely crucial to
understand that Big Data system can’t just exist in the isolation of itself. There are many
needs of the business can only be fully filled with the help of the operational databases. Just
having a system which can analysis big data may not solve every single data problem.
Real World Example
Think about this way, you are using Facebook and you have just updated your information
about the current relationship status. In the next few seconds the same information is also
reflected in the timeline of your partner as well as a few of the immediate friends. After a
while you will notice that the same information is now also available to your remote friends.
Later on when someone searches for all the relationship changes with their friends your
change of the relationship will also show up in the same list. Now here is the question – do
you think Big Data architecture is doing every single of these changes? Do you think that the
immediate reflection of your relationship changes with your family member is also because
of the technology used in Big Data? Actually the answer is Facebook uses MySQL to do
various updates in the timeline as well as various events we do on their homepage. It is really
difficult to part from the operational databases in any real world business.
Now we will see a few of the examples of the operational databases.
 Relational Databases
 NoSQL Databases
 Key-Value Pair Databases
 Document Databases
 Columnar Databases
 Graph Databases
 Spatial Databases
Relational Databases
Relational Database is pretty much everywhere in most of the businesses which are here for
many years. The importance and existence of the relational database are always going to be
there as long as there are meaningful structured data around. There are many different kinds
of relational databases for example Oracle, SQL Server, MySQL and many others. If you are
looking for Open Source and widely accepted database, I suggest to try MySQL as that has
been very popular in the last few years. I also suggest you to try out PostgreSQL as well.
Besides many other essential qualities PostgreeSQL have very interesting licensing policies.
PostgreSQL licenses allow modifications and distribution of the application in open or
closed (source) form. One can make any modifications and can keep it private as well as well
contribute to the community. I believe this one quality makes it much more interesting to use
as well it will play very important role in future.
Nonrelational Databases (NOSQL)

NoSQL actually stands for Not Only SQL Databases. There are plenty of NoSQL databases
out in the market and selecting the right one is always very challenging. Here are few of the
properties which are very essential to consider when selecting the right NoSQL database for
operational purpose.
 Data and Query Model

 Persistence of Data and Design
 Eventual Consistency
 Scalability
Though above all of the properties are interesting to have in any NoSQL database but the one
which most attracts to me is Eventual Consistency.
Eventual Consistency
RDBMS uses ACID (Atomicity, Consistency, Isolation, Durability) as a key mechanism for
ensuring the data consistency, whereas NonRelational DBMS uses BASE for the same
purpose. Base stands for Basically Available, Soft state and Eventual consistency. Eventual
consistency is widely deployed in distributed systems. It is a consistency model used in
distributed computing which expects unexpected often. In large distributed system, there are
always various nodes joining and various nodes being removed as they are often using
commodity servers. This happens either intentionally or accidentally. Even though one or
more nodes are down, it is expected that entire system still functions normally. Applications
should be able to do various updates as well as retrieval of the data successfully without any
issue. Additionally, this also means that system is expected to return the same updated data
anytime from all the functioning nodes. Irrespective of when any node is joining the system,
if it is marked to hold some data it should contain the same updated data eventually.
Eventual consistency is a consistency model used in distributed computing that informally
guarantees that, if no new updates are made to a given data item, eventually all accesses to
that item will return the last updated value.
In other words – Informally, if no additional updates are made to a given data item, all
reads to that item will eventually return the same value.
Key Value Pair Databases

Key Value Pair Databases are also known as KVP databases. A key is a field name and
attribute, an identifier. The content of that field is its value, the data that is being identified
and stored.
They have a very simple implementation of NoSQL database concepts. They do not have
schema hence they are very flexible as well as scalable. The disadvantages of Key Value Pair
(KVP) database are that they do not follow ACID (Atomicity, Consistency, Isolation and
Durability) properties. Additionally, it will require data architects to plan for data placement,
replication as well as high availability. In KVP databases the data is stored as strings.
Here is a simple example of how Key Value Database will look like:
Key Value
Name Pinal Dave
Color Blue
Twitter @pinaldave
Name Nupur Dave
Movie The Hero
As the number of users grow in Key Value Pair databases it starts getting difficult to manage
the entire database. As there is no specific schema or rules associated with the database,
there are chances that database grows exponentially as well. It is very crucial to select the
right Key Value Pair Database which offers an additional set of tools to manage the data and
provides finer control over various business aspects of the same.
Riak
Riak is one of the most popular Key Value Database. It is known for its scalability and
performance in high volume and velocity database. Additionally, it implements a mechanism
for collection key and values which further helps to build manageable system.
Key Value Databases are a good choice for social media, communities, caching layers for
connecting other databases. In simpler words, whenever we required flexibility of the data
storage keeping scalability in mind – KVP databases are good options to consider.
Document Database
There are two different kinds of document databases.
1) Full document Content (web pages, word docs etc)
2) Storing Document Components for storage.
The second types of the document database we are talking about over here. They use
Javascript Object Notation (JSON) and Binary JSON for the structure of the documents.
JSON is very easy to understand language and it is very easy to write for applications. There
are two major structures of JSON used for Document Database – 1) Name Value Pairs and 2)
Ordered List.
MongoDB and CouchDB are two of the most popular Open Source NonRelational
Document Database.
MongoDB
MongoDB databases are called collections. Each collection is build of documents and each
document is composed of fields. MongoDB collections can be indexed for optimal
performance. MongoDB ecosystem is highly available, supports query services as well as
MapReduce. It is often used in high volume content management system.
CouchDB
CouchDB databases are composed of documents which

consists fields and attachments (known as description). It supports ACID properties. The
main attraction points of CouchDB are that it will continue to operate even though network
connectivity is sketchy. Due to this nature CouchDB prefers local data storage.
Document Database is a good choice of the database when users have to generate dynamic
reports from elements which are changing very frequently. A good example of document
usages is in real time analytics in social networking or content management system.
Columnar Databases
Relational Database is a row store database or a row oriented database. Columnar databases
are column oriented or column store databases. As we discussed earlier in Big Data we have
different kinds of data and we need to store different kinds of data in the database. When we
have columnar database it is very easy to do so as we can just add a new column to the
columnar database. HBase is one of the most popular columnar databases. It uses Hadoop
file system and MapReduce for its core data storage. However, remember this is not a good
solution for every application. This is particularly good for the database where there is high
volume incremental data is gathered and processed.
Graph Databases
For a highly interconnected data it is suitable to use Graph Database. This database has node
relationship structure. Nodes and relationships contain a Key Value Pair where data is stored.
The major advantage of this database is that it supports faster navigation among various
relationships. For example, Facebook uses a graph database to list and demonstrate various
relationships between users. Neo4J is one of the most popular open source graph database.
One of the major dis-advantage of the Graph Database is that it is not possible to self-
reference (self joins in the RDBMS terms) and there might be real world scenarios where
this might be required and graph database does not support it.
Spatial Databases
We all use Foursquare, Google+ as well Facebook Check-ins for location aware check-ins.
All the location aware applications figure out the position of the phone with the help of
Global Positioning System (GPS). Think about it, so many different users at different
location in the world and checking-in all together. Additionally, the applications now feature
reach and users are demanding more and more information from them, for example like
movies, coffee shop or places see. They are all running with the help of Spatial Databases.
Spatial data are standardize by the Open Geospatial Consortium known as OGC. Spatial
data helps answering many interesting questions like “Distance between two locations, area
of interesting places etc.” When we think of it, it is very clear that handing spatial data and
returning meaningful result is one big task when there are millions of users moving
dynamically from one place to another place & requesting various spatial
information. PostGIS/OpenGIS suite is very popular spatial database. It runs as a layer
implementation on the RDBMS PostgreSQL. This makes it totally unique as it
offers best from both the worlds.
Courtesy: mushroom network

13. Big Data – Data Mining with Hive – What
is Hive? – What is HiveQL (HQL)?
Now we will understand what is Hive and HQL in Big Data Story.
Yahoo started working on PIG for their application deployment on Hadoop. The goal of
Yahoo to manage their unstructured data. Similarly Facebook started deploying their
warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with
HIVE is because the traditional warehousing solutions are getting very expensive.
What is HIVE?
Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to

provide data summarization, query and analysis. It supports analysis of large datasets stored
in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it
supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as
big data analysis with the help of MapReduce. Hive is not built to get a quick response to
queries but it it is built for data mining applications. Data mining applications can take from
several minutes to several hours to analysis the data and HIVE is primarily used there.
HIVE Organization
The data are organized in three different formats in HIVE.
Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just
layered over the Hadoop File System (HDFS), hence tables are directly mapped to
directories of thefilesystems. It also supports tables stored in other native file systems.
Partitions: Hive tables can have more than one partition. They are mapped to subdirectories
and file systems as well.
Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in
the underlying file system.
Hive also has metastore which stores all the metadata. It is a relational database containing
various information related to Hive Schema (column types, owners, key-value data, statistics
etc.). We can use MySQL database over here.
What is HiveSQL (HQL)?
Hive query language provides the basic SQL like operations. Here are few of the tasks which
HQL can do easily.
 Create and manage tables and partitions

 Support various Relational, Arithmetic and Logical Operators
 Evaluate functions
 Download the contents of a table to a local directory or result of queries to HDFS
directory
Here is the example of the HQL Query:
SELECT upper(name), salesprice

FROM sales;
SELECT category, count(1)
FROM products
GROUP BY category;
When you look at the above query, you can see they are very similar to SQL like queries.
14. Big Data – Interacting with Hadoop – What
is PIG? – What is PIG Latin?
Yahoo started working on Pig for their application deployment on Hadoop. The goal of
Yahoo to manage their unstructured data.
What is Pig and What is Pig Latin?
Pig is a high level platform for creating MapReduce programs used with Hadoop and the
language we use for this platform is called PIG Latin. The pig was designed to make Hadoop
more user-friendly and approachable by power-users and nondevelopers. PIG is an
interactive execution environment supporting Pig Latin language. The language Pig Latin
has supported loading and processing of input data with series of transforming to produce
desired results.
PIG has two different execution environments
1) Local Mode – In this case all the scripts run on a single machine.
2) Hadoop – In this case all the scripts run on Hadoop Cluster.
Pig Latin vs SQL
Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does
not have to now write, compile and build solution for Big Data. The pig is very similar to
SQL in many ways. The Ping Latin language provides an abstraction layer over the data. It
focuses on the data and not the structure under the hood. Pig Latin is a very powerful
language and it can do various operations like loading and storing data, streaming data,
filtering data as well various data operations related to strings. The major difference between
SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig
Latin is very similar to SQL execution plan and that makes it much easier for programmers
to build various processes. Whereas SQL handles trees naturally, Pig Latin
follows directed acyclic graph (DAG). DAGs is used to model several different kinds of
structures in mathematics and computer science.
DAG
15. Big Data – Interacting with Hadoop – What
is Sqoop? – What is Zookeeper?
There are two most important components one should learn when learning about interacting
with Hadoop – Sqoop and Zookper.
What is Sqoop?
Most of the business stores their data in RDBMS as well as other data warehouse solutions.
They need a way to move data to the Hadoop system to do various processing and return it
back to RDBMS from Hadoop system. The data movement can happen in real time or at
various intervals in bulk. We need a tool which can help us move this data from SQL to
Hadoop and from Hadoop to SQL. Sqoop (SQL to Hadoop) is such a tool which extracts
data from non-Hadoop data sources and transforms them into the format which Hadoop can
use it and later it loads them into HDFS. Essentially it is ETL tool where it Extracts,
Transform and Load from SQL to Hadoop. The best part is that it also does extract data from
Hadoop and loads them to Non-SQL (or RDBMS) data stores. Essentially, Sqoop is a
command line tool which does SQL to Hadoop and Hadoop to SQL. It is a command line
interpreter. It creates MapReduce job behinds the scene to import data from an external
database to HDFS. It is very effective and easy to learn tool for nonprogrammers.
What is Zookeeper?
ZooKeeper is a centralized service for maintaining configuration information, naming,

providing distributed synchronization, and providing group services. In other words
Zookeeper is a replicated synchronization service with eventual consistency. In simpler
words – in Hadoop cluster there are many different nodes and one node is master. Let us
assume that master node fails due to any reason. In this case, the role of the master node has
to be transferred to a different node. The main role of the master node is managing the
writers as that task requires persistence in order of writing. In this kind of scenario
Zookeeper will assign new master node and make sure that Hadoop cluster performs without
any glitch. Zookeeper is the Hadoop’s method of coordinating all the elements of these
distributed systems.
Here are few of the tasks which Zookeepr is responsible for.
 Zookeeper manages the entire workflow of starting and stopping various nodes in the
Hadoop’s cluster.
 In Hadoop cluster when any processes need certain configuration to complete the task.
Zookeeper makes sure that certain node gets necessary configuration consistently.
 In case of the master node fails, Zookeepr can assign new master node and make sure
cluster works as expected.
There many other tasks Zookeeper performance when it is about Hadoop cluster and
communication. Basically without the help of Zookeeper it is not possible to design any new
fault tolerant distributed application.
16. Big Data – Basics of Big Data Analytics
When you have plenty of the data around you what is the first thing which comes to your
mind?
“What do all these data means?”
Exactly – the same thought comes to my mind as well. I always wanted to know what all the
data means and what meaningful information I can receive out of it. Most of the Big Data
projects are built to retrieve various intelligence all this data contains within it. Let us
take example of Facebook. When I look at my friends list of Facebook, I always want to ask
many questions such as –
 On which date my maximum friends have a birthday?

 What is the most favorite film of my most of the friends so I can talk about it and
engage them?
 What is the most liked placed to travel my friends?
 Which is the most disliked cousin for my friends in India and USA so when they
travel, I do not take them there.
There are many more questions I can think of. This illustrates that how important it is to have
analysis of Big Data.
Here are few of the kind of analysis listed which you can use with Big Data.
Slicing and Dicing: This means breaking down your data into smaller set and understanding
them one set at a time. This also helps to present various information in a variety of
different user digestible ways. For example if you have data related to movies, you can use
different slide and dice data in various formats like actors, movie length etc.
Real Time Monitoring: This is very crucial in social media when there are any events
happening and you wanted to measure the impact at the time when the event is happening.
For example, if you are using twitter when there is a football match, you can watch what fans
are talking about football match on twitter when the event is happening.
Anomaly Predication and Modeling: If the business is running normal it is alright but if
there are signs of trouble, everyone wants to know them early on the hand. Big Data analysis
of various patterns can be very much helpful to predict future. Though it may not be always
accurate but certain hints and signals can be very helpful. For example, lots of data can help
conclude that if there is lots of rain it can increase the sell of umbrella.
Text and Unstructured Data Analysis: Unstructured data are now getting norm in the new
world and they are a big part of the Big Data revolution. It is very important that we Extract,
Transform and Load the unstructured data and make meaningful data out of it. For example,
analysis of lots of images, one can predict that people like to use certain colors in certain
months in their cloths.
Big Data Analytics Solutions
There are many different Big Data Analystics Solutions out in the market. It is impossible to
list all of them so I will list a few of them over here.
 Tableau – This has to be one of the most popular visualization tools out in the big data
market.
 SAS – A high performance analytics and infrastructure company
 IBM and Oracle – They have a range of tools for Big Data Analysis
17. Big Data – How to become a Data Scientist
and Learn Data Science?
In the new world of Big Data, I see pretty much everyone wants to become Data Scientist
and there are lots of people I have already met who claims that they are Data Scientist. When
I ask what their role is, I have got a wide variety of answers.
What is Data Scientist?

Data scientists are the experts who understand various aspects of the business and know how
to strategies data to achieve the business goals. They should have a solid foundation of
various data algorithms, modeling and statistics methodology.
What do Data Scientists do?
Data scientists understand the data very well. They just go beyond the regular data
algorithms and builds interesting trends from available data. They innovate and resurrect the
entire new meaning from the existing data. They are artists in disguise of computer analyst.
They look at the data traditionally as well as explore various new ways to look at the data.
Data Scientists do not wait to build their solutions from existing data. They think creatively,
they think before the data has entered into the system. Data Scientists are visionary experts
who understand the business needs and plan ahead of the time, this tremendously help to
build solutions at rapid speed.
Besides being data expert, the major quality of Data Scientists is “curiosity”. They always
wonder about what more they can get from their existing data and how to get maximum out
of future incoming data.
Data Scientists do wonders with the data, which goes beyond the job descriptions of Data
Analysist or Business Analysist.
Skills Required for Data Scientists
Here are few of the skills a Data Scientist must have.
 Expert level skills with statistical tools like SAS, Excel, R etc.
 Understanding Mathematical Models
 Hands-on with Visualization Tools like Tableau, PowerPivots, D3. j’s etc.
 Analytical skills to understand business needs
 Communication skills
On the technology front any Data Scientists should know underlying technologies like
(Hadoop, Cloudera) as well as their entire ecosystem (programming language, analysis and
visualization tools etc.).
Remember that for becoming a successful Data Scientist one require have par excellent
skills, just having a degree in a relevant education field will not suffice.
Final Note
Data Scientists is indeed very exciting job profile. As per research there are not enough Data
Scientists in the world to handle the current data explosion. In near future Data is going to
expand exponentially, and the need of the Data Scientists will increase along with it. It is
indeed the job one should focus if you like data and science of statistics.
Courtesy: emc
References
https://en.wikipedia.org/wiki/Big_data
http://www.ibm.com/big-data/us/en/
http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data
Thank You
Contact
For more information, contact gsl.cdsfiodg@tcs.com(Email Id of ISU)
About Tata Consultancy Services (TCS)
Tata Consultancy Services is an IT services, consulting and business solutions

organization that delivers real results to global business, ensuring a level of certainty no
other firm can match. TCS offers a consulting-led, integrated portfolio of IT and IT-
enabled infrastructure, engineering and assurance services. This is delivered through its
unique Global Network Delivery ModelTM, recognized as the benchmark of excellence in
software development. A part of the Tata Group, India’s largest industrial conglomerate,
TCS has a global footprint and is listed on the National Stock Exchange and Bombay
Stock Exchange in India.
For more information, visit us at www.tcs.com.
IT Services
Business Solutions
Consulting
All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content /
information contained here is correct at the time of publishing. No material from here may be copied, modified,
reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from
TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable
laws, and could result in criminal or civil penalties. Copyright © 2011 Tata Consultancy Services Limited

Big Data Document

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Big Data Document

Diunggah oleh

Hak Cipta:

Format Tersedia

Big Data

Tata Code of Conduct

1. Big Data – Beginning Big Data

What is Big Data?

Big Data – Big Thing!

Big Data – A Rubik’s Cube

Big Data in Simple Words

Data in Flat File

Edgar F Codd and 12 Rules

Relational Database Management Systems

Enormous Data Growth

Big Data Cycle

Building Blocks of Big Data Architecture

NoSQL in Data Management

Why use NoSQL?

What are the core components of Hadoop?

A Multi-node Hadoop Cluster Architecture

Why Use Hadoop?

Why Hadoop is named as Hadoop?

MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance

Advantages of MapReduce Procedures

How do MapReduce Framework Works?

The MapReduce Framework has five different steps:

 Preparing Map() Input

MapReduce in a Single Statement

MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very

Visual Representation of HDFS Architecture

High Availability During Disaster

Situations in Relational Database Outperforms

The Mixed Bag

Here are a few examples:

On the definition of NewSQL, Aslett writes:

There are three major categories of the NewSQL

Public Cloud Private Cloud

Initial cost Typically zero Typically high

Running cost Unpredictable Unpredictable

Customization Impossible Possible

Privacy No (Host has access to the data Yes

Single sign-on Impossible Possible

Scaling up Easy while within defined limits Laborious but no limits

Cloud and Big Data – Common Characteristics

Leading Big Data Cloud Providers

 Amazon Elastic MapReduce – It processes very high volumes of data

Real World Example

Now we will see a few of the examples of the operational databases.

Nonrelational Databases (NOSQL)

 Data and Query Model

Key Value Pair Databases

Name Nupur Dave

Movie The Hero

CouchDB databases are composed of documents which

Courtesy: mushroom network

Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to

The data are organized in three different formats in HIVE.

What is HiveSQL (HQL)?

 Create and manage tables and partitions

SELECT upper(name), salesprice

What is Pig and What is Pig Latin?

PIG has two different execution environments

Pig Latin vs SQL

ZooKeeper is a centralized service for maintaining configuration information, naming,

Here are few of the tasks which Zookeepr is responsible for.

“What do all these data means?”

 On which date my maximum friends have a birthday?