Hadoop Course

About Course
Master the skills of Big Data, NoSQL and Data Science at once and become a successful Big Data Scientist
with access to 16 courses at once for a lifetime. Start your journey now!
List of Courses present in this combo pack
Hadoop Architect Training : All in 1 Combo Course: Hadoop Developer, Hadoop Analyst, Hadoop Administ
and Hadoop Testing
R Programming Training
Mahout Training
Data Science Training: Building Recommender Systems
Statistics and Probability Training
Apache Solr Training
Splunk Training
Apache Storm Training
Splunk Admin Training
HBase Training
Cassandra Training
MongoDB Training
Apache Spark, Scala Training
Key Features:
A comprehensive, in-depth combo of Big Data + Data Science + No-SQL courses including as many as
16 niches, highly endorsed and top-paying technology courses
Intensive Learning on Hadoop Hadoop Architect Training All in 1 Combo Course which includes
Hadoop Developer, Hadoop Analyst, Hadoop Administration and Hadoop Testing, R programming Training,
Mahout Training, Data Science Training: Building Recommender Systems, Statistics and Probability Training,
Apache Solr Training, Splunk Training, apache Storm Training, Splunk admin Training, HBase Training,
Cassandra Training, MongoDB Training and Apache Spark and Scala Training
237 hours of High-Quality in-depth Video E-Learning Sessions
404 hours of Lab Exercises
Intellipaat Proprietary VM for Lifetime and free cloud access for 6 months for performing exercises.
70% of extensive learning through Hands-on exercises, Project Work, Assignments and Quizzes
The training will prepare you for multiple Professional Certification Exams:
Cloudera Certification:
CCA Spark and Hadoop Developer, CCAH, R Certification, Mahout Certification, Cloudera Certification
(CCP:DS), Apache Storm Certification, Cloudera Apache HBase Certification, Apache Cassandra Professional
Certification, MongoDB Certification, Apache Spark Certification
24*7 Lifetime Support with Rapid Problem Resolution Guaranteed
Lifetime Access to Videos, Tutorials and Course Material
Guidance to Resume Preparation and Job Assistance
Step -by- Step Installation of multiple Softwares
Course Completion Certificate from Intellipaat
About Big Data, Data Science Combo Course

Through this exceptionally elaborative course, learners can acquire outstanding skills required by Big Data Data
Scientist Expert and gain in-depth knowledge on Development, Administration and Analysis profile and
integration of multiple systems together.Gaining expertise in as many as 16 technologies at one time on a single
order is the ultimate ticket to your dream job, top-notch company and huge earnings. Intellipaats All in One Big
Data and Data Science Combo course endows you with the most endorsed technologies like Hadoop, Spark,
Storm, Scala, NoSQL, Mahout, Splunk, Solr, Data Science, R Programming and core statistics and
probability.This training course is a have-it-all package to produce skilled, competent and leading Big Data
Scientist and Architects. Enrolling for this course will give individuals in-depth knowledge and scope of being
identified by the top multinationals worldwide.
Project Work:
Hadoop Projects
1. Project
Working with Map Reduce, Hive, Sqoop
Problem Statement
It describes how to import MySQL data using Sqoop and querying it using hive and also describes how to run the
word count MapReduce job.
2. Project
Work on Movie lens data for finding top records
Data
Movie Lens Dataset
Problem Statement
It includes:
Write a MapReduce program to find the top 10 movies from the u.data file
Create the same top 10 movies using PIG by loading u.data into pig
Create the same top 10 movies using HIVE by loading u.data into HIVE
3. Project
Hadoop Yarn Project End to End PoC
Problem Statement
It includes:
Import Movie data
Append the data
How to use sqoop commands to bring the data into the HDFS
End to End flow of transaction data
How to process the real word data or a huge amount of data using MapReduce program in terms of the
movie etc.
4. Project
Partitioning Tables
Problem Statement
It describes the parting and How to perform portioning. It includes:
Manual Partitioning
Dynamic Partitioning
Bucketing
5. Project
Sales Commission
Data Sales
Problem Statement
In this we calculate the commission according to the sales.

6. Project
Connecting Pentaho with Hadoop Ecosystem
Problem Statement
It includes:
Quick Overview of ETL and BI
Configuring Pentaho to work with Hadoop Distribution
Loading data into Hadoop cluster
Transforming data into Hadoop cluster
Extracting data from Hadoop Cluster

7. Project
Multi-node Cluster Setup
Problem Statement
It includes following actions:
Hadoop Multi Node Cluster Setup using Amazon ec2 Creating 4 node cluster setup
Running Map Reduce Jobs on Cluster

8. Project
Hadoop Testing using MR
Problem Statement
It describes how to test map reduce codes with MR unit.
9. Project
Hadoop Weblog Analytics
Data
Weblogs
Problem Statement
The goal is to enable the participants to have a feel of the actual data sets in a production environment and how
to load the data into a Hadoop cluster using various techniques. Once data is loaded, the next goal is to perform
basic analytics on this data.
R Programming Project
Restaurant Revenue Prediction
Data Revenue Data set
Problem Statement
It predicts the annual restaurant sales based on the objective measurements. It uses following data fields:
Id
Opening Date
Type of the City
Type of the Restaurant
Three categories of Obfuscated Data
Revenue
It also includes:
Data Overview
Data Fields
Evaluation using RMSE
Feature Engineering / Selection
Data Science Projects:

Project 1Understanding Cold Start Problem in Data Science
Algorithms for Recommender
Ways of Recommendation
Types of Recommendation -Collaborative Filtering Based Recommendation, Content-Based

Recommendation
Cold Start Problem
Project 2Recommendation for Movie, Summary
Recommendation for movie
Two Types of Predictions Rating Prediction, Item Prediction
Important Approaches: Memory Based and Model-Based
Knowing User Based Methods in K-Nearest Neighbor
Understanding Item Based Method
Matrix Factorization
Decomposition of Singular Value
Data Science Project discussion
Collaboration Filtering
Business Variables Overview
SPT Project
Data Analysis Project
Data Sales
Problem Statement
It includes the following actions:
Understand the business solutions
Discussion with the warehouse team
Data Collection & Storage
Data Cleaning
Build a Hypothesis Tree around the business problem
Produce the final result.
Apache Solr Project

Function Queries
Problem Statement It describes that how to use function queries in Solr, suppose an index store the
dimensions in meters x, y, z of some hypothetical boxes with arbitrary names stored in field boxname.
Suppose we want to search for box matching name findbox but ranked according to volumes of boxes.
Splunk Project:
The Splunk Project, after finishing this training course, will let you create a report and dashboard with
the text file having employee details.
You will perform various row operations to fetch data as per your requirements and use important
Splunk commands on the file to extract certain fields.
Other significant aspects of this project are editing the event, adding tags, searching event with tag
names and saving tag search.
Splunk Admin Project Field Extraction

Problem Statement It includes:
About Field Extraction
Field Extractor Utility
Field Extraction page in Splunk Web
Configure field extraction in configuration files etc.
Apache Storm Projects:
Real-time Project on Storm
The Project Bolt Blue Print
HBase Project Integrate Hive and Java with HBase

Problem Statement This project describes that how to integrate hive and java with HBase. It includes
following actions:
Installation of HBase
Creation of Table
Java Program to create the table in HBase
Managing the HBase Table with Hive
Bulk Import etc.
MongoDB Project Java MongoDB Integration

Problem Statement It creates a table to insert the video file using the java program. For this it performs
following actions:
Installation of Java
Adding MongoDB Java Connector etc.
Apache Spark Projects:

Mini Projects
Project 1. List the itemsProject 2. Sorting of RecordsProject 3. Show a histogram of date vs users created.
Optionally, use a rich visualization likeProject 4. Prepare a map of tags vs # of questions in each tag and display
it.Major Projects Project 1 Movie RecommendationProject 2 Twitter API Integration for tweet AnalysisProject 3
Data Exploration Using Spark SQL Wikipedia dataset
Curriculum
Hadoop
Module 1 Introduction to Big Data & Hadoop, Hadoop Ecosystem, Map Reduce and
HDFS
What is Big Data?
Factors constituting Big Data
Hadoop and its Ecosystem
Map Reduce -Concepts of Map, Reduce, Ordering, Concurrency, Shuffle, Reducing, Concurrency
Hadoop Distributed File System (HDFS) Concepts and its Importance
Deep Dive in Map Reduce Execution Framework, Partitioner, Combiner, Data Types, Key pairs
HDFS Deep Dive Architecture, Data Replication, Name Node, Data Node, Data Flow
Parallel Copying with DISTCP, Hadoop Archives
Assignment 1
Module 2 Hands-on Exercises
Installing Hadoop in Pseudo Distributed Mode, Understanding Important configuration files, their
Properties and Demon Threads
Accessing HDFS from Command Line
Map Reduce Basic Exercises
Understanding Big Data Hadoop Ecosystem
Introduction to Sqoop, use cases and Installation
Introduction to Hive, use cases and Installation
Introduction to Pig, use cases and Installation
Introduction to Oozie, use cases and Installation
Introduction to Flume, use cases and Installation
Introduction to Yarn
Assignment -2 and 3
Mini Project Importing Mysql Data using Sqoop and Querying it using Hive
Module 3 Deep Dive in Map Reduce.

1.
Mapper & Reducer
Relation between input splits and HDFS blocks.
Map reduce job submission flow of input splits.
2.
How Mapper and Combiner Works
Mapper and Combiners Work
3.
Shuffle & Sort Phase,Combiner & Partitioner.
Map Reduce in detail.
Comparison b/w YARN and MRV1
MapReduce job Execution.
MapReduce Combiner.
Mapreduce Partitioner.
shuffle & Sort Phase.
4.
Job Scheduler
Map reduce job submission flow
Job launch process (Job)
Job launch Process (task)
Job launch process (Task tracker)
Job launch process (Task runner)
5.
Joining Of Files/Datasets
Joining Data sets in Map Reduce.
Distributed cache.
6.
Reduce Joins
Counters
Reduce Join
7.
8.
9.
Input Format
Custom Input Format.
Inverted Indexing.
Map Reduce Inverted Indexing
Hadoop APIs
10.Explanation of MapReduce organization.
How mapper process with detailed example

testing module.
How to develop Map Reduce Application.
writing unit test Best Practices for developing and writing.
Debugging Map Reduce applications.
Module 3.1
Project 1- Hands on exercise end to end PoC using Yarn or Hadoop 2.
1.
Real World Transactions handling of Bank
2.
Moving data using Sqoop to HDFS
3.
Incremental update of data to HDFS
4.
Running Map Reduce Program
5.
Running Hive queries for data analytics
Project 2- Hands on exercise end to end PoC using Yarn or Hadoop 2.7
Running Map Reduce Code for Movie Rating and finding their fans and average rating
Assignment -4 and 5
Module 4 Deep Dive in Pig
1.
Introduction to Pig
What Is Pig?
Pigs Features
Pig Use Cases
Interacting with Pig
2.
Basic Data Analysis with Pig
Pig Latin Syntax
Loading Data
Simple Data Types
Field Definitions
Data Output
Viewing the Schema
Filtering and Sorting Data
Commonly-Used Functions
Hands-On Exercise: Using Pig for ETL Processing
3.
Processing Complex Data with Pig
Complex/Nested Data Types
Grouping
Iterating Grouped Data
Hands-On Exercise: Analyzing Data with Pig
4.
Multi-Dataset Operations with Pig
Techniques for Combining Data Sets
Joining Data Sets in Pig
Set Operations
Splitting Data Sets
Hands-On Exercise
5.
Extending Pig
Macros and Imports
UDFs
Using Other Languages to Process Data with Pig
Hands-On Exercise: Extending Pig with Streaming and UDFs

6.
Pig Jobs
Case studies of Fortune 500 companies which are Electronic Arts and Walmart with real
data sets.
Assignment 6
Module 5 Deep Dive in Hive
1.
Introduction to Hive
What Is Hive?
Hive Schema and Data Storage
Comparing Hive to Traditional Databases
Hive vs. Pig
Hive Use Cases
Interacting with Hive

2.
Relational Data Analysis with Hive
Hive Databases and Tables
Basic HiveQL Syntax
Data Types
Joining Data Sets
Common Built-in Functions
Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
3.
Hive Data Management
Hive Data Formats
Creating Databases and Hive-Managed Tables
Loading Data into Hive
Altering Databases and Tables
Self-Managed Tables
Simplifying Queries with Views
Storing Query Results
Controlling Access to Data
Hands-On Exercise: Data Management with Hive
4.
Hive Optimization
Understanding Query Performance
Partitioning
Bucketing

5.
Indexing Data
Extending Hive
User-Defined Functions
6.
Hands on Exercises Playing with huge data and Querying extensively.
6.
User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning
Assignment 7
Module 6 Impala
1.
Introduction to Impala
What is Impala?
How Impala Differs from Hive and Pig
How Impala Differs from Relational Databases
Limitations and Future Directions
Using the Impala Shell
2.
Choosing the Best (Hive, Pig, Impala)
2.
Modeling and Managing Data with Impala and Hive
Data Storage Overview
Creating Databases and Tables
Loading Data into Tables
HCatalog
Impala Metadata Caching
4.
Data Partitioning
Partitioning Overview
Partitioning in Impala and Hive
Module 7 (AVRO)Data Formats
Selecting a File Format
Hadoop Tool Support for File Formats
Avro Schemas
Using Avro with Hive and Sqoop
Avro Schema Evolution
Compression
Module 8 Introduction to Hbase architecture
What is Hbase
Where does it fits
What is NOSQL
Assignment -8
Apache Spark
Module 9 Why Spark? Explain Spark and Hadoop Distributed File System
What is Spark
Comparison with Hadoop
Components of Spark
Module 10 Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph

Analysis, Machine Learning
Apache Spark- Introduction, Consistency, Availability, Partition
Unified Stack Spark
Spark Components
Comparison with Hadoop Scalding example, mahout, storm, graph
Module 11 Running Spark on a Cluster, Writing Spark Applications using Python, Java,
Scala
Explain python example
Show installing a spark
Explain driver program
Explaining spark context with example
Define weakly typed variable
Combine scala and java seamlessly.
Explain concurrency and distribution.
Explain what is trait.
Explain higher order function with example.
Define OFI scheduler.
Advantages of Spark
Example of Lamda using spark
Explain Mapreduce with example
Module 12 Hadoop Cluster Setup and Running Map Reduce Jobs
Module 13 Major Project Putting it all together and Connecting Dots
Putting it all together and Connecting Dots
Working with Large data sets, Steps involved in analyzing large data
Assignment 9, 10
Module 14 Advance Mapreduce
Delving Deeper Into The Hadoop API
More Advanced Map Reduce Programming, Joining Data Sets in Map Reduce
Graph Manipulation in Hadoop
Assignment 11, 12
Module 15 ETL Connectivity with Hadoop Ecosystem
How ETL tools work in Big data Industry
Connecting to HDFS from ETL tool and moving data from Local system to HDFS
Moving Data from DBMS to HDFS
Working with Hive with ETL Tool
Creating Map Reduce job in ETL tool
End to End ETL PoC showing Hadoop integration with ETL tool.
Module 16 Hadoop Cluster Configuration
Hadoop configuration overview and important configuration file
Configuration parameters and values
HDFS parameters MapReduce parameters
Hadoop environment setup
Include and Exclude configuration files

Lab: MapReduce Performance Tuning
Module 17 Hadoop Administration and Maintenance
Namenode/Datanode directory structures and files
File system image and Edit log
The Checkpoint Procedure
Namenode failure and recovery procedure
Safe Mode
Metadata and Data backup
Potential problems and solutions / what to look for
Adding and removing nodes

Lab: MapReduce File system Recovery
Module 18 Hadoop Monitoring and Troubleshooting
Best practices of monitoring a Hadoop cluster
Using logs and stack traces for monitoring and troubleshooting
Using open-source tools to monitor Hadoop cluster
Module 19 Job Scheduling
How to schedule Hadoop Jobs on the same cluster
Default Hadoop FIFO Schedule
Fair Scheduler and its configuration
Module 20 Hadoop Multi Node Cluster Setup and Running Map Reduce Jobs on
Amazon Ec2
Module 21 ZOOKEEPER
ZOOKEEPER Introduction
ZOOKEEPER use cases
ZOOKEEPER Services
ZOOKEEPER data Model
Znodes and its types
Znodes operations
Znodes watches
Znodes reads and writes
Consistency Guarantees
Cluster management
Leader Election
Distributed Exclusive Lock
Important points
Module 22 Advance Oozie
Why Oozie?
Installing Oozie
Running an example
Oozie- workflow engine
Example M/R action
Word count example
Workflow application
Workflow submission
Workflow state transitions
Oozie job processing
Oozie- HADOOP security
Why Oozie security?
Job submission to hadoop
Multi tenancy and scalability
Time line of Oozie job
Coordinator
Bundle
Layers of abstraction
Architecture
Use Case 1: time triggers
Use Case 2: data and time triggers
Use Case 3: rolling window
Module 23 Advance Flume
Apache Flume
Big data ecosystem
Physically distributed Data sources
Changing structure of Data
Closer look
Anatomy of Flume
Core concepts
Event
Clients
Agents
Source
Channels
Sinks
Interceptors
Channel selector
Sink processor
Data ingest
Agent pipeline
Transactional data exchange
Routing and replicating
Why channels?
Use case- Log aggregation
Adding flume agent
Handling a server farm
Data volume per agent
Example describing a single node flume deployment
Module 24 Advance HUE
HUE introduction
HUE ecosystem
What is HUE?
HUE real world view
Advantages of HUE
How to upload data in File Browser?
View the content
Integrating users
Integrating HDFS
Fundamentals of HUE FRONTEND
Module 25 Advance Impala
IMPALA Overview: Goals
User view of Impala: Overview
User view of Impala: SQL
User view of Impala: Apache HBase
Impala architecture
Impala state store
Impala catalogue service
Query execution phases
Comparing Impala to Hive
Testing
Module 26 Hadoop Stack Integration Testing
Why Hadoop testing is important
Unit testing
Integration testing
Performance testing
Diagnostics
Nightly QA test
Benchmark and end to end tests
Functional testing
Release certification testing
Security testing
Scalability Testing
Commissioning and Decommissioning of Data Nodes Testing
Reliability testing
Release testing
Module 27 Roles and Responsibilities of Hadoop Testing
Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test
bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion.
ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using
sqoop/flume which includes but not limited to data verification, Reconciliation.
User Authorization and Authentication testing (Groups, Users, Privileges etc)
Report defects to the development team or manager and driving them to closure.
Consolidate all the defects and create defect reports.
Validating new feature and issues in Core Hadoop.
Module 28 Framework called MR Unit for Testing of Map-Reduce Programs
Report defects to the development team or manager and driving them to closure.
Consolidate all the defects and create defect reports.
Validating new feature and issues in Core Hadoop
Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.
Module 29 Unit Testing
Automation testing using the OOZIE.
Data validation using the query surge tool.
Module 30 Test Execution of Hadoop _customized
Test plan for HDFS upgrade
Test automation and result
Module 31 Test Plan Strategy Test Cases of Hadoop Testing
How to test install and configure
Module 32 High Availability Federation, Yarn and Security

Module 33 Job and Certification Support
Major Project on Big Data and Hadoop, Hadoop Development, Cloudera Certification Tips and
Guidance and Mock Interview Preparation, Practical Development Tips and Techniques, certification
preparation
Project Work
1.
Project
Working with Map Reduce, Hive, Sqoop
Problem Statement
It describes that how to import mysql data using sqoop and querying it using hive and also describes that
how to run the word count mapreduce job.
2. Project
Work on Movie lens data for finding top records
Data
Movie Lens dataset
Problem Statement
It includes:
Write a MapReduce program to find the top 10 movies from the u.data file
Create the same top 10 movies using PIG by loading u.data into pig
Create the same top 10 movies using HIVE by loading u.data into HIVE
3. Project
Hadoop Yarn Project End to End PoC
Problem Statement
It includes:
Import Movie data
Append the data
How to use sqoop commands to bring the data into the hdfs
End to End flow of transaction data
How to process the real word data or huge amount of data using map reduce program in terms of
movie etc.
4. Project
Partitioning Tables
Problem Statement
It describes about the parting and How to perform portioning. It includes:
Manual Partitioning
Dynamic Partitioning
Bucketing
5. Project
Sales Commission
Data
Sales
Problem Statement
In this we calculate the commission according to the sales.
6. Project
Connecting Pentaho with Hadoop Eco-system
Problem Statement
It includes:
Quick Overview of ETL and BI
Configuring Pentaho to work with Hadoop Distribution
Loading data into Hadoop cluster
Transforming data into Hadoop cluster
Extracting data from Hadoop Cluster
7. Project
Multinode Cluster Setup
Problem Statement
It includes following actions:
8. Project
Hadoop Testing using MR
Problem Statement
It describes that how to test map reduce codes with MR unit.
9. Project
Hadoop Weblog Analytics
Data
Weblogs
Problem Statement
The goal is to enable the participants to have a feel of the actual data sets in a production environment and
how to load the data into a Hadoop cluster using various techniques. Once data is loaded, the next goal is
to perform basic analytics on this data.
R Programming
Module 1 How R Works
Data mining Using Statistical packages
A Few concepts Before Starting
Module 2 part 1 What is R-Packages
R-Calculator
Assigning Values To Variables
Vector Creation
Module 2 part 2 What is Sorting
Generating Repeats
What is rep Function
Generating Factor Levels
Sorting Process
Module 2 part 3 Transpose Function
Stack Function Used
Module 3 part 1 Functions & Reading Data from External Files
Merge Function
Strsplit Function
Matrices
Matrix Manipulation
Row Sums
Module 3 part 2 Generating Plots and Pie Charts
Line Plots
Bar Plots
Bar Plots For Population
Histogram
Pie Chart Components
Module 4 part 1 Analysis of Variancy (ANOVA)
One Way Analysis of Variance
Two Way Analysis of Variance
Module 4 part 2 What is Cluster Analysis
K-Means Clustering
Cluster Algorithm Working
Module 5 part 1 Association Rule Mining Affinity Analysis
Association Rule Mining Affinity Analysis
Module 5 part 2 Two Variable RelationShips
Linear Regression
Dependent And Independent Variables
Scatter Plots
Module 6 Part 1 Database connectivity & Logistic Regression
Logistic Regression
Examples of Logistic Regression
Logistic Regression in R
Predication
Module 6 Part 2 ROC Curve in R
Confusion Matrix
ROC Curve in R
Sensitivity & Specificity
Data Base Connectivity RODBC
Reading Data to ODBC Tables
Function (Mean)
Examples Of Function
Module 7 Integrating R with Hadoop
Methods to integrate two popular open source softwares for Big Data analytics: R and Hadoop
Integrating R with Hadoop using RHadoop and RMR package
Exploring RHIPE (R Hadoop Integrated Programming Environment)
Writing MapReduce Jobs in R and executing them on Hadoop
Project
Restaurant Revenue Prediction
Data
Revenue Data set
Problem Statement
It predicts the annual restaurant sales based on the objective measurements. It uses following data fields:
Id
Opening Date
Type of the City
Type of the Restaurant
Three categories of Obfuscated Data
Revenue
It also includes:
Data Overview
Data Fields
Evaluation using RMSE
Feature Engineering / Selection
Mahout
Module 1 Mahout Overview
Classification and Recommendation
Clustering in Mahout
Pattern Mining
Understanding machine Learning
Using Model diagram to decide the approach
Data flow
Supervised and Unsupervised learning
Module 2 Mahout Recommendations
Concept of Recommendation
Recommendations by E-commerce site
Comparison between User Recommendations and Item recommendation
Define recommenders and Classifiers
Process of Collaborative Filtering
Explaining Pearson coefficient algorithm
Euclidean distance measure
Implementing a recommender using map reduce
Module 3 Clustering Session 1
Defining Clustering
User-to-user similarity
Clustering Illustration
Euclidean distance measure
Distance measure vector
Understanding the process of Clustering
Vectorizing documents-Unstructured data
Document clustering
Sequence-to-sparse Utility
K-Mean Clustering

Module 4 Classification Session 1
Terminology
Predictor and Target variable
Classifiable Data
Key Challenges in Classification algorithm
Vectorizing Continuous data
Classification Examples
Logic Regression and its examples
Module 4 Clustering and Classification Session 2
Clustering
Clustering Process
Transaction Clustering
Different techniques of Vectorization
Distance measure
Clustering algorithm-K-MEAN
Clustering Application-1
Clustering Application-2
Sentiment Analyzer
Module 5 Pattern Mining
Pearson Coefficient
Collaborative Filtering Process
Collaborative Filtering
Similarity Algorithms
Pearson Correlation
Euclidean Distance Measure -Frequent Pattern & Association rules
Frequent Pattern Growth
Session 6 Course Summary

Data Science
Module 1 Getting started with Data Science and Recommender Systems
Data Science Overview
Reasons to use Data Science
Project Lifecycle
Data Acquirement
Evaluation of Input Data
Transforming Data
Statistical and analytical methods to work with data
Machine Learning basics
Introduction to Recommender systems
Apache Mahout Overview
Module 2 Reasons to Use, Project Lifecycle
What is Data Science?
What Kind of Problems can you solve?
Data Science Project Life Cycle
Data Science-Basic Principles
Data Acquisition
Data Collection
Understanding Data- Attributes in a Data, Different types of Variables
Build the Variable type Hierarchy
Two Dimensional Problem
Co-relation b/w the Variables- explain using Paint Tool
Outliers, Outlier Treatment
Boxplot, How to Draw a Boxplot
Module 3 Acquiring Data
Discussion on Boxplot- also Explain
Example to understand variable Distributions
What is Percentile? Example using Rstudio tool
How do we identify outliers?
How do we handle outliers?
Outlier Treatment: Using Capping/Flooring General Method
Distribution- What is Normal Distribution?
Why Normal Distribution is so popular?
Uniform Distribution
Skewed Distribution
Transformation
Module 4 Machine Learning in Data Science
Discussion about Boxplot and Outlier
Goal: Increase Profits of a Store
Areas of increasing the efficiency
Data Request
Business Problem: To maximize shop Profits
What are Interlinked variables
What is Strategy
Interaction b/w the Variables
Univariate analysis
Multivariate analysis
Bivariate analysis
Relation b/w Variables
Standardize Variables
What is Hypothesis?
Interpret the Correlation
Negative Correlation
Machine Learning
Module 5 Statistical and analytical methods dealing with data, Implementation of

Recommenders using Apache Mahout and Transforming Data
Correlation b/w Nominal Variables
Contingency Table
What is Expected Value?
What is Mean?
How Expected Value is differ from Mean
Experiment Controlled Experiment, Uncontrolled Experiment
Degree of Freedom
Dependency b/w Nominal Variable & Continuous Variable
Linear Regression
Extrapolation and Interpolation
Univariate Analysis for Linear Regression
Building Model for Linear Regression
Pattern of Data means?
Data Processing Operation
What is sampling?
Sampling Distribution
Stratified Sampling Technique
Disproportionate Sampling Technique
Balanced Allocation-part of Disproportionate Sampling
Systematic Sampling
Cluster Sampling
2 angels of Data Science-Statistical Learning, Machine Learning
Module 6 Testing and Assessment, Production Deployment and More
Multi variable analysis
linear regration
Simple linear regration
Hypothesis testing
Speculation vs. claim(Query)
Sample
Step to test your hypothesis
performance measure
Generate null hypothesis
alternative hypothesis
Testing the hypothesis
Threshold value
Hypothesis testing explanation by example
Null Hypothesis
Alternative Hypothesis
Probability
Histogram of mean value
Revisit CHI-SQUARE independence test
Correlation between Nominal Variable
Module 7 Business Algorithms, Simple approaches to Prediction, Building model,

Model deployment
Machine Learning
Importance of Algorithms
Supervised and Unsupervised Learning
Various Algorithms on Business
Simple approaches to Prediction
Predict Algorithms
Population data
sampling
Disproportionate Sampling
Steps in Model Building
Sample the data
What is K?
Training Data
Test Data
Validation data
Model Building
Find the accuracy
Rules
Iteration
Deploy the model
Linear regression
Module 8 Getting started with Segmentation of Prediction and Analysis
Clustering
Cluster and Clustering with Example
Data Points, Grouping Data Points
Manual Profiling
Horizontal & Vertical Slicing
Clustering Algorithm
Criteria for take into Consideration before doing Clustering
Graphical Example
Clustering & Classification: Exclusive Clustering, Overlapping Clustering, Hierarchy Clustering
Simple Approaches to Prediction
Different types of Distances: 1.Manhattan, 2.Euclidean, 3.Consine Similarity
Clustering Algorithm in Mahout
Probabilistic Clustering
Pattern Learning
Nearest Neighbor Prediction
Nearest Neighbor Analysis
Module 9 Integration of R and Hadoop
R introduction
How R is typically used
Features of R
Introduction to Big data
R+Hadoop
Ways to connect with R and Hadoop
Products
Case Study
Architecture
Steps for Installing RIMPALA
How to create IMPALA packages
Projects
Project 1-Understanding Cold Start Problem in Data Science
Algorithms for Recommender
Ways of Recommendation
Types of Recommendation -Collaborative Filtering Based Recommendation, Content-Based

Recommendation
Cold Start Problem
Project 2-Recommendation for Movie, Summary
Recommendation for movie
Two Types of Predictions Rating Prediction, Item Prediction
Important Approaches: Memory Based and Model Based
Knowing User Based Methods in K-Nearest Neighbor
Understanding Item Based Method
Matrix Factorization
Decomposition of Singular Value
Data Science Project discussion
Collaboration Filtering
Business Variables Overview
Data Science Assignment
Real-time enterprise problem
Use of various datasets to solve this problem
Use of Variables for Problem Resolution
Building strategy to solve this problem with the available data
Descriptive Statistics
SPT
Module 1 Information of Statistics
What is statistics
How is this useful
What is this course for
Module 2 Data Conversion
Converting data into useful information
Collecting the data
Understand the data
Finding useful information in the data
Interpreting the data
Visualizing the data
Module 3 Terms of Statistics
Descriptive statistics
Let us understand some terms in statistics
Variable
Module4 Plots
Dot Plots
Histogram
Stemplots
Box and whisker plots
Outlier detection from box plots and Box and whisker plots
Module 5 Statistics & Probability
What is probability
Set & rules of probability
Bayes Theorem
Module 6 Distributions
Probability Distributions
Few Examples
Student T- Distribution
Sampling Distribution
Student t- Distribution
Poison distribution
Module7 Sampling
Stratified Sampling
Proportionate Sampling
Systematic Sampling
P Value
Stratified Sampling
Module 8 Tables & Analysis
Cross Tables
Bivariate Analysis
Multi variate Analysis
Dependence and Independence tests ( Chi-Square )
Analysis of Variance
Correlation between Nominal variables
Project
Data Analysis Project
Data
Sales
Problem Statement
It includes the following actions:
Understand the business solutions
Discussion with the warehouse team
Data Collection & Storage
Data Cleaning
Build a Hypothesis Tree around the business problem
Produce the final result.
Apache Solr
Module 1. The Fundamentals
About Solr
Installing and running Solr
Adding content to Solr
Reading a Solr XML response
Changing parameters in the URL
Using the browse interface

Module 2. Searching
Sorting results
Query parsers
More queries
Hardwiring request parameters
Adding fields to default search
Faceting
Result grouping
Module 3. Indexing
Adding your own content to Solr
Deleting data from solr
Building a bookstore search
Adding book data
Exploring the book data
Dedupe update processor

Module 4. Updating your schema
Adding fields to the schema
Analyzing text
Module 5. Relevance
Field weighting
Phrase queries
Function queries
Fuzzier search
Sounds-like
Module 6. Extended features
More-like-this
Geospatial
Spell checking
Suggestions
Highlighting
Pseudo-fields
Pseudo-joins
Multilanguage
Module 7. Multicore
Adding more kinds of data

Module 8. SolrCloud
Introduction
How SolrCloud works
Commit strategies
ZooKeeper
Managing Solr config files
Project
Function Queries
Problem Statement
It describes that how to use function queries in Solr, suppose an index store the dimensions in meters x,
y, z of some hypothetical boxes with arbitrary names stored in field boxname. Suppose we want to search
for box matching name findbox but ranked according to volumes of boxes.
Splunk
Module 1 Basic Concepts of Splunk Development
Splunk development concepts
Roles and responsibilities of Splunk Developer
Module 2 Saving and Scheduling Searches
Exporting search results
Saving and sharing search results
Saving searches
Search scheduling
Module 3 Creating Alerts
Describing alerts
Alert Creation
View fired alerts
Module 4 Tags and Event Types
Understanding tags
Creating tags and using them in a search
Defining event types and their usefulness
Creating and using event types in a search
Module 5 Search Commands
Reviewing search commands and performing general search practices
Examine the anatomy of a search
Using various commands to perform searches:fields, table, rename, rex&erex, multiply
Module 6 Reporting Commands
Using following commands and their functions:

1. top
2. rare
3. stats
4. addcoltotals
5. addtotals
Module 7 Visualizations
Explore the available visualizations
Create Charts and timecharts
Omit null values and format results
Module 8 Analyzing, Calculating and Formatting Results
Using eval command
Perform calculations
Value Conversion
Round values
Format values
Conditional statements
Filtering calculated results
Module 9 Correlating Events
Overview of Transactions
Search Transactions
Module 10 Enriching Data with Lookups
What are lookups?
Lookup file example
Creating a lookup table
Defining a lookup
Configuring an automatic lookup
Using the lookup in searches and reports
Module 11 Creating Reports and Dashboards
Creating reports and charts
Creating dashboards and adding reports
Module 12 Getting started with Parsing
Data Preview and Parsing Phase
Raw Data Manipulation
Extraction of Fields
Project
The Splunk Project, after finishing this training course, will let you create a report and dashboard
with the text file having employee details.
You will perform various row operations to fetch data as per your requirements and use important
Splunk commands on the file to extract certain fields.
Other significant aspects of this project is editing the event, adding tags, searching event with tag
names and saving tag search.
Splunk Admin
Module 1- Simple Splunk Environment
Installing Splunk
License Management
Data Inputs
App management
Module 2- Basic Production Environment
Introduction to Splunk Configuration Files
Universal Forwarder
Forwarder Management
Module 3 Various Data Inputs
Understanding Monitor Inputs
What are Network Inputs?
Define Modular and Scripted Inputs
Explaining Windows Inputs
What are Fine-tuning Inputs?
Module 4 Index and User Management
Concept of Indexing in Splunk
Maintenance and Optimization of Indexes
Users: Their Roles and Authentication
Module 5 Getting started with Parsing
Data Preview and Parsing Phase
Raw Data Manipulation
Extraction of Fields
Module 6 Search Scaling and Monitoring
Performing Distributed Search
Search Performance Tuning
Understanding Execution issues in large scale deployment
Distributed Management Console
Project
Field Extraction
Problem Statement
It includes:
About Field Extraction
Field Extractor Utility
Field Extraction page in Splunk Web
Configure field extraction in configuration files etc.
Apache Storm
Module 1 Understanding Architecture of Storm
Bayesian Law
Hadoop Distributed Computing
Big Data features
Legacy Architecture of Real Time System
Storm vs. Hadoop
Logical Dynamic and Components in Storm
Storm Topology
Execution Components in Storm
Stream Grouping
Tuple
Spout
Bolt-normalization bolt
Module 2 Installation of Apache storm
Installing Apache Storm
Module 3 Grouping
Different types of Grouping
Reliable and unreliable messaging
Fetching data Direct connection and En-queued message
Bolt Lifecycle
Module 4 Overview of Trident
Trident Spouts and its types
Components and Interface of Trident spout
Trident Function, Filter & Aggregator
Module 5 Boot Stripping
Twitter Boot Stripping
Detailed learning on Boot Stripping
Concepts of Storm
Storm Development Environment
Projects
Real-time Project on Storm
The Project Bolt Blue Print
HBase
Module 1 HBase Overview
Getting started with HBase
Core Concepts of HBase
Understanding HBase with an Example
Module 2 Architecture of NoSQL
Why HBase?
Where to use HBase?
What is NoSQL?
Module 3 HBase Data Modeling
HDFS vs.HBase
HBase Use Cases
Data Modeling HBase
Module 4 HBase Cluster Components
HBase Architecture
Main components of HBase Cluster
Module 5 HBase API and Advanced Operations
HBase Shell
HBase API
Primary Operations
Advanced Operations
Module 6 Integration of Hive with HBase
Create a Table and Insert Data into it
Integration of Hive with HBase
Load Utility
Module 7 File loading with both load Utility
Putting Folder to VM
File loading with both load Utility
Project
Integrate Hive and Java with HBase
Problem Statement
This project describes that how to integrate hive and java with HBase. It includes following actions:
Installation of HBase
Creation of Table
Java Program to create the table in HBase
Managing the HBase Table with Hive
Bulk Import etc.
Cassandra
Module 1-Advantages and Usage of Cassandra
Brief Introduction of the course
Advantages and Usage of Cassandra
Module 2-CAP Theorem and No SQL DataBase
Why No SQL DataBase
Replication in RDBMS
Key Challenges with RDBMS
Schema
No SQL(Not only SQL)
No SQL Category
Advantage &Limitation
Key Characteristics of No SQL Data Base
CAP Theorem
Consistency
Module 3-Cassandra fundamentals, Data model, Installation and setup
What is Cassandra?
Non relational
Key deployment concept
What is column oriented database
Data Model column
What is column family
Installation
Module 4-Steps in Configuration
Token calculation
Configuration overview
Node tool
Validators
Comparators
Expiring column
QA
Module 5-Summarization, node tool commands, cluster, Indexes, Cassandra &

Mapreduce, Installing Ops-center
Difference between Relational modeling & Cassandra modeling
Steps in Cassandra modeling
Time series modeling in Cassandra
Column family
Data modeling in Cassandra
Column family vs. Super column family
Counter column family
Partitioners
Partitioners strategies
Replication
Gossip protocols
Read operation
Consistency
Comparison
Module 6-Multi Cluster setup
Node settings
Setup of Multinode cluster
Row cache and Key cache
Read operation
System keyspace
Commands overview
Column family
VNodes
Module 7-Thrift/AVRO/JSON/Hector Client
Thrift
AVRO
JSON
Hector client
How to write a JAVA code
Hector tag
Module 8-Datastax installation part, Secondary index
Node tool commands
Management of Cassandra
Secondary index
Cassandra & map reduce
Datastax installation part
Module 9-Cassandra API and Summarization and Thrift
API
Internals of connection pool
Client connectivity to cassandra
Hector client key features
Hector client key concepts
Java code
Summarization
Thrift
MongoDB
Module 1 Getting started with NoSQL, MongoDB and their Installation
Database type description
What is NoSQL Database?
NoSQL Database s Types
Challenges with RDBMS
Why we require NoSQL data?
What is MONGODB
JSON/BSON Introduction
JSON Data Types
Example of JSON
Installation of MONGODB
Module 2 Part 1 NoSQL and its iMportance
Database Type
OLTP
OLAP
NOSQL
Type of NOSQL Database
Challenges with RDBMS
Why NOSQL
ACID property
CAP Theorem
Base property
Introduction to Json/ Bson
Json Data types
Database collection & document
MongoDB use cases
Unacknowledged
Acknowledged
Juurnaled
Fsynced
Repica Acknowledged
Module 2 Part 2 CRUD Operations
MongoDB crud Tutorial
Installation Rent
used ppt
json its syntax
CRUD Introduction,
Read and Write Operations
Write Operation Concern Levels
MongoDB CRUD Tutorials
MongoDB CRUD Reference
Hands on with CRUD Operations
Module 3 Part 1 Understanding Schema Design, Backup strategies, Data Modeling

and Monitoring
Data Modeling in MongoDB
RDBMS vs. Data models
Data Modeling tools
Data modeling example & patterns
Model TREE structure
Operational strategies
Backup strategies
Monitoring
Monitoring Commands
Monitoring of performance issues
Run time configuration
Export & import of data
Relationship between Document
Model Specific Application Contexts
Data Model Reference
Hands on with MongoDB Data Modeling
Module 3 Part 2 Data Administration and Management
Data Management
Introduction to replica
Election of new primary
Replica set
Type of Replica
Hidden Replica
Arbiter Replica
Sharding
Concepts around Replication
Setting up replicated cluster
Setting up Sharded Cluster
Sharding Database, Collections
Hands on Exercise
Module 4 Indexes and Aggregation
Introduction to Indexes
Concepts around Indexes
Type of Indexes
Index Property
Introduction to Aggregation
Type of Aggregation
Use cases of Aggregation
Hands on Exercise
Module 5 Security in MongoDB
Security Risks to Databases
MongoDB Security Approach
MongoDB Security Concept
Access Control
Integration with MongoDB with Robomongo
Integration with MongoDB with Java
Module 6 MongoDB Integration with Jaspersoft, Load and Manage Unstructured Data
(Videos, Images, Logs, Resumes etc.)
Integration with MongoDB with Jaspersoft
Additional Concept (GridFS mongo files)
Loading and Managing Unstructured Data (Videos, Images, Logs, Resumes etc.)
Project
Java MongoDB Integration
Problem Statement
It creates a table to insert the video file using the java program. For this it performs following actions:
Installation of Java
Adding MongoDB Java Connector etc.
Apache Spark
Module 1-Why Spark? Explain Spark and Hadoop Distributed File System
What is Spark
Comparison with Hadoop
Components of Spark
Module 2-Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph

Analysis, Machine Learning
Apache Spark- Introduction, Consistency, Availability, Partition
Unified Stack Spark
Spark Components
Comparison with Hadoop Scalding example, mahout, storm, graph
Module 3-Running Spark on a Cluster, Writing Spark Applications using Python, Java,
Scala
Explain python example
Show installing a spark
Explain driver program
Explaining spark context with example
Define weakly typed variable
Combine scala and java seamlessly.
Explain concurrency and distribution.
Explain what is trait.
Explain higher order function with example.
Define OFI scheduler.
Advantages of Spark
Example of Lamda using spark
Explain Mapreduce with example
Module 4-RDD and its operation
Difference between RISC and CISC
Define Apache Mesos
Cartesian product between two RDD
Define count
Define Filter
Define Fold
Define API Operations
Define Factors
Module 5-Spark, Hadoop, and the Enterprise Data Centre, Common Spark Algorithms
How hadoop cluster is different from spark
Define writing data
Explain sequence file and its usefulness
Define protocol buffers
Define text file, CSV, Object Files and File System
Define sparse metrics
Explain RDD and Compression
Explain data stores and its usefulness
Module 6-Spark Streaming
Define Elastic Search
Explain Streaming and its usefulness
Apache bookeeper
Define Dstream
Define mapreduce word count
Explain Paraquet
Scala ORM
Define Mlib
Explain multi graphix and its usefulness
Define property graph
Module 7-Spark Persistence in Spark
Persistence
Motivation
Example
Transformation
Scala and Python
Examples K-means
Latent Dirichlet Allocation (LDA)
Module 8-Broadcast and accumulator
Motivation
Broadcast Variables
Example: Join
Alternative if one table is small
Better version with broadcast
How to create a Broadcast
Accumulators motivation
Example: Join
Accumulator Rules
Custom accumulators
Another common use
Creating an accumulator using spark context object
Module 9-Spark SQL and RDD
Introduction
Spark SQL main capabilities
Spark SQL usage diagram
Spark SQL
Important topics in Spark SQL- Data frames
Twitter language analysis
Module 10-Operations/Accumulators/Traits
How parallelism Takes place
The Master Parameter
Join Operations Example
Accumulators
Traits
Module 11-Scheduling/Partitioning
Task Scheduling/ distribution
Scheduling Around Applications
Static Partitioning
Dynamic Sharing
Scheduling Within An Application
Fair Scheduling
High Availability Of Spark Master
Standby Masters With Zookeeper
Single Node Recovery With Local File System
High Order Functions
Module 12-Capacity Planning in Spark
Practicals : Creating Maps, Transformations
capacity planning in spark
concurrency in java
concurrency in scala
Module 13-Log Analysis
Array Buffers
Compact Buffer
Protocol Buffer
Log Analysis With Spark
First Log Analyzers In Spark
Mini Projects
Project 1. List the items
Project 2. Sorting of Records
Project 3. Show a histogram of date vs users created. Optionally, use a rich visualization like
Project 4. Prepare a map of tags vs # of questions in each tag and display it.
Major Projects
Project 1 Movie Recommendation
Project 2 Twitter API Integration for tweet Analysis
Project 3 Data Exploration Using Spark SQL Wikipedia dataset
Scala
Module 1-Introduction of Scala
Scala Overview
Module 2-Pattern Matching
Advantages of Scala
REPL (Read Evaluate print loop)
Language Features
Type Interface
Higher order function
Option
Pattern Matching
Collection
Currying
Traits
Application Space
Module 3-Executing the Scala code
Uses of scala interpreter
Example of static object timer in scala
Testing of String equality in scala
Implicit classes in scala with examples.
Recursion in scala
Currying in scala with examples.
Classes in scala
Module 4-Classes concept in Scala
Constructor
Constructor overloading
Properties
Abstract classes
Type hierarchy in Scala
Object equality
Val and var methods
Module 5-Case classes and pattern matching
Sealed traits
Case classes
Constant pattern in case classes
Wild card pattern
Variable pattern
Constructor pattern
Tuple pattern
Module 6-Concepts of traits with example
Java equivalents
Advantages of traits
Avoiding boilerplate code
Linearization of traits
Modelling a real world example
Module 7-Scala java Interoperability
How traits are implemented in scala and java
How extending multiple traits is handled
Module 8-Scala collections
Classification of scala collections
Iterable
Iterator and iterable
List sequence example in scala
Module 9-Mutable collections vs. Immutable collections
Array in scala
List in scala
Difference between list and list buffer
Array buffer
Queue in scala
Dequeue in scala
Mutable queue in scala
Stacks in scala
Sets and maps in scala
Tuples
Module 10-Use Case bobsrockets package
Different import types
Selective imports
Testing-Assertions
Scala test case- scala test fun. Suite
Junit test in scala
Interface for Junit via Junit 3 suite in scala test
SBT
Directory structure for packaging scala application

Hadoop Course

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hadoop Course

Diunggah oleh

Hak Cipta:

Format Tersedia

About Course

237 hours of High-Quality in-depth Video E-Learning Sessions

404 hours of Lab Exercises

24*7 Lifetime Support with Rapid Problem Resolution Guaranteed

Lifetime Access to Videos, Tutorials and Course Material

Guidance to Resume Preparation and Job Assistance

Step -by- Step Installation of multiple Softwares

Course Completion Certificate from Intellipaat

About Big Data, Data Science Combo Course

Import Movie data

Append the data

End to End flow of transaction data

In this we calculate the commission according to the sales.

Quick Overview of ETL and BI

Configuring Pentaho to work with Hadoop Distribution

Loading data into Hadoop cluster

Transforming data into Hadoop cluster

Extracting data from Hadoop Cluster

Running Map Reduce Jobs on Cluster

Type of the City

Type of the Restaurant

Three categories of Obfuscated Data

Evaluation using RMSE

Feature Engineering / Selection

Data Science Projects:

Algorithms for Recommender

Types of Recommendation -Collaborative Filtering Based Recommendation, Content-Based

Recommendation for movie

Two Types of Predictions Rating Prediction, Item Prediction

Important Approaches: Memory Based and Model-Based

Knowing User Based Methods in K-Nearest Neighbor

Understanding Item Based Method

Decomposition of Singular Value

Data Science Project discussion

Business Variables Overview

Understand the business solutions

Discussion with the warehouse team

Data Collection & Storage

Build a Hypothesis Tree around the business problem

Produce the final result.

Apache Solr Project

Splunk Admin Project Field Extraction

About Field Extraction

Field Extractor Utility

Field Extraction page in Splunk Web

Configure field extraction in configuration files etc.

Apache Storm Projects:

Real-time Project on Storm

The Project Bolt Blue Print

HBase Project Integrate Hive and Java with HBase

Java Program to create the table in HBase

Managing the HBase Table with Hive

Bulk Import etc.

MongoDB Project Java MongoDB Integration

Adding MongoDB Java Connector etc.

Apache Spark Projects:

What is Big Data?

Factors constituting Big Data

Hadoop and its Ecosystem

Hadoop Distributed File System (HDFS) Concepts and its Importance

Parallel Copying with DISTCP, Hadoop Archives

Accessing HDFS from Command Line

Map Reduce Basic Exercises