Anda di halaman 1dari 3


Big Data Characteristics - Velocity, Volume & Variety Structured and Unstructured Data Limitations of traditional large-scale systems How a distributed way of computing is superior (Scaling and Cost) Introduction to Hadoop and its eco-system

Hands on: Installation of Virtualization software and running Hadoop VM.

HDFS and Cluster overview
HDFS Overview and Architecture Types of Nodes (Name Node, Checkpoint Node, Backup Node) Configuration files and important parameters Data Replication Safe Mode Data Flow (Read & Write) Interfaces to HDFS (HTTP, CLI and Java API) File Formats SequenceFileFormat Mapfile Custom Input Formats and Output Formats Avro Data Files Data Compression

Hands on: HDFS Commands (for Data I/O)

MapReduce Anatomy & Data flow
Functional Programming model Pros and cons of different models Evolution and overview of MapReduce Data flow using Word Count Input and Output File Formats Hadoop Data Types Basic MapReduce API Concepts Input Splits, Shuffling, Sorting, Combining Custom Writable & WritableComparables

Hands On: Developing and deploying MR programs

Working within Eclipse Working from CLI in different modes Stand-alone Pseudo distributed Fully distributed Streaming (in Phython)

Advanced MapReduce
Combiners & Partitioners JVM Reuse and others Distributed Cache Compression of the data at different phases Hadoop tunable parameters Hadoop profiling

MR Best practices and Debugging

Configuration files Memory allocation for different tasks Debugging MR Jobs (JUnit and MRUnit Testing Frameworks) Hadoop Counters and Web UI Understanding the different logs in Hadoop

Tutorials: Fundamental MR Algorithms

Sorting Joining Mapside Join Reduce Side Join Inverted Index

Pig Dataflow Language

What is Pig? Pig Console - Grunt Pig Data Model Input and Output Relational Operations User Defined Functions Testing Scripts with PigUnit Writing Scripts to Perform Well

Hive Data Warehouse framework

An Overview of Hadoop and MapReduce Hive in the Hadoop Ecosystem Java vs. Hive: The Word Count Algorithm Hive Architecture HiveQL : Data Definition HiveQL : Data Manipulation HiveQL : Queries, Views, Indexes Hive - Schema Design and Tuning Pig and Hive Comparison

Tutorials: Hadoop Data Operations

Oozie Creating workflow Running workflows Sqoop Getting Data from RDBMS Flume Getting log data

NoSQL Concepts
Review of traditional Databases ACID vs. BASE Schema on Read vs. Schema on Write CAP Theorem Need for NoSQL Databases Key-value Columnar Document Graph

HBase Overview
HBase & Cassandra concepts HBase Architecture Hbase Data Modeling HBase Commands HBase Coprocessors EndPoints, Observers

Introduction to Mahout
Classification Clustering Recommender Engine

* Session=2Hrs