Anda di halaman 1dari 61

About Course

Master the skills of Big Data, NoSQL and Data Science at once and become a successful Big Data Scientist
with access to 16 courses at once for a lifetime. Start your journey now!
List of Courses present in this combo pack

Hadoop Architect Training : All in 1 Combo Course: Hadoop Developer, Hadoop Analyst, Hadoop Administ
and Hadoop Testing
R Programming Training
Mahout Training
Data Science Training: Building Recommender Systems
Statistics and Probability Training
Apache Solr Training
Splunk Training
Apache Storm Training
Splunk Admin Training
HBase Training
Cassandra Training
MongoDB Training
Apache Spark, Scala Training
Key Features:

A comprehensive, in-depth combo of Big Data + Data Science + No-SQL courses including as many as
16 niches, highly endorsed and top-paying technology courses

Intensive Learning on Hadoop Hadoop Architect Training All in 1 Combo Course which includes
Hadoop Developer, Hadoop Analyst, Hadoop Administration and Hadoop Testing, R programming Training,
Mahout Training, Data Science Training: Building Recommender Systems, Statistics and Probability Training,
Apache Solr Training, Splunk Training, apache Storm Training, Splunk admin Training, HBase Training,
Cassandra Training, MongoDB Training and Apache Spark and Scala Training

237 hours of High-Quality in-depth Video E-Learning Sessions

404 hours of Lab Exercises

Intellipaat Proprietary VM for Lifetime and free cloud access for 6 months for performing exercises.

70% of extensive learning through Hands-on exercises, Project Work, Assignments and Quizzes

The training will prepare you for multiple Professional Certification Exams:
Cloudera Certification:
CCA Spark and Hadoop Developer, CCAH, R Certification, Mahout Certification, Cloudera Certification
(CCP:DS), Apache Storm Certification, Cloudera Apache HBase Certification, Apache Cassandra Professional
Certification, MongoDB Certification, Apache Spark Certification

24*7 Lifetime Support with Rapid Problem Resolution Guaranteed

Lifetime Access to Videos, Tutorials and Course Material

Guidance to Resume Preparation and Job Assistance

Step -by- Step Installation of multiple Softwares

Course Completion Certificate from Intellipaat

About Big Data, Data Science Combo Course


Through this exceptionally elaborative course, learners can acquire outstanding skills required by Big Data Data
Scientist Expert and gain in-depth knowledge on Development, Administration and Analysis profile and
integration of multiple systems together.Gaining expertise in as many as 16 technologies at one time on a single
order is the ultimate ticket to your dream job, top-notch company and huge earnings. Intellipaats All in One Big
Data and Data Science Combo course endows you with the most endorsed technologies like Hadoop, Spark,
Storm, Scala, NoSQL, Mahout, Splunk, Solr, Data Science, R Programming and core statistics and
probability.This training course is a have-it-all package to produce skilled, competent and leading Big Data
Scientist and Architects. Enrolling for this course will give individuals in-depth knowledge and scope of being
identified by the top multinationals worldwide.
Project Work:

Hadoop Projects

1. Project
Working with Map Reduce, Hive, Sqoop
Problem Statement

It describes how to import MySQL data using Sqoop and querying it using hive and also describes how to run the
word count MapReduce job.
2. Project
Work on Movie lens data for finding top records
Data
Movie Lens Dataset
Problem Statement
It includes:

Write a MapReduce program to find the top 10 movies from the u.data file

Create the same top 10 movies using PIG by loading u.data into pig

Create the same top 10 movies using HIVE by loading u.data into HIVE

3. Project
Hadoop Yarn Project End to End PoC
Problem Statement
It includes:

Import Movie data

Append the data

How to use sqoop commands to bring the data into the HDFS

End to End flow of transaction data

How to process the real word data or a huge amount of data using MapReduce program in terms of the
movie etc.
4. Project
Partitioning Tables
Problem Statement
It describes the parting and How to perform portioning. It includes:

Manual Partitioning

Dynamic Partitioning

Bucketing
5. Project
Sales Commission
Data Sales
Problem Statement

In this we calculate the commission according to the sales.


6. Project
Connecting Pentaho with Hadoop Ecosystem
Problem Statement
It includes:

Quick Overview of ETL and BI

Configuring Pentaho to work with Hadoop Distribution

Loading data into Hadoop cluster

Transforming data into Hadoop cluster

Extracting data from Hadoop Cluster


7. Project
Multi-node Cluster Setup
Problem Statement
It includes following actions:

Hadoop Multi Node Cluster Setup using Amazon ec2 Creating 4 node cluster setup

Running Map Reduce Jobs on Cluster


8. Project
Hadoop Testing using MR
Problem Statement
It describes how to test map reduce codes with MR unit.
9. Project
Hadoop Weblog Analytics
Data
Weblogs
Problem Statement
The goal is to enable the participants to have a feel of the actual data sets in a production environment and how
to load the data into a Hadoop cluster using various techniques. Once data is loaded, the next goal is to perform
basic analytics on this data.
R Programming Project
Restaurant Revenue Prediction
Data Revenue Data set
Problem Statement
It predicts the annual restaurant sales based on the objective measurements. It uses following data fields:

Id

Opening Date

Type of the City

Type of the Restaurant

Three categories of Obfuscated Data

Revenue
It also includes:

Data Overview

Data Fields

Evaluation using RMSE

Feature Engineering / Selection

Data Science Projects:


Project 1Understanding Cold Start Problem in Data Science

Algorithms for Recommender

Ways of Recommendation

Types of Recommendation -Collaborative Filtering Based Recommendation, Content-Based


Recommendation
Cold Start Problem
Project 2Recommendation for Movie, Summary

Recommendation for movie

Two Types of Predictions Rating Prediction, Item Prediction

Important Approaches: Memory Based and Model-Based

Knowing User Based Methods in K-Nearest Neighbor

Understanding Item Based Method

Matrix Factorization

Decomposition of Singular Value

Data Science Project discussion

Collaboration Filtering

Business Variables Overview

SPT Project
Data Analysis Project
Data Sales
Problem Statement
It includes the following actions:

Understand the business solutions

Discussion with the warehouse team

Data Collection & Storage

Data Cleaning

Build a Hypothesis Tree around the business problem

Produce the final result.

Apache Solr Project


Function Queries
Problem Statement It describes that how to use function queries in Solr, suppose an index store the
dimensions in meters x, y, z of some hypothetical boxes with arbitrary names stored in field boxname.
Suppose we want to search for box matching name findbox but ranked according to volumes of boxes.

Splunk Project:

The Splunk Project, after finishing this training course, will let you create a report and dashboard with
the text file having employee details.

You will perform various row operations to fetch data as per your requirements and use important
Splunk commands on the file to extract certain fields.

Other significant aspects of this project are editing the event, adding tags, searching event with tag
names and saving tag search.

Splunk Admin Project Field Extraction


Problem Statement It includes:

About Field Extraction

Field Extractor Utility

Field Extraction page in Splunk Web

Configure field extraction in configuration files etc.

Apache Storm Projects:

Real-time Project on Storm

The Project Bolt Blue Print

HBase Project Integrate Hive and Java with HBase


Problem Statement This project describes that how to integrate hive and java with HBase. It includes
following actions:

Installation of HBase

Creation of Table

Java Program to create the table in HBase

Managing the HBase Table with Hive

Bulk Import etc.

MongoDB Project Java MongoDB Integration


Problem Statement It creates a table to insert the video file using the java program. For this it performs
following actions:

Installation of Java

Adding MongoDB Java Connector etc.

Apache Spark Projects:


Mini Projects
Project 1. List the itemsProject 2. Sorting of RecordsProject 3. Show a histogram of date vs users created.
Optionally, use a rich visualization likeProject 4. Prepare a map of tags vs # of questions in each tag and display
it.Major Projects Project 1 Movie RecommendationProject 2 Twitter API Integration for tweet AnalysisProject 3
Data Exploration Using Spark SQL Wikipedia dataset

Curriculum
Hadoop
Module 1 Introduction to Big Data & Hadoop, Hadoop Ecosystem, Map Reduce and
HDFS

What is Big Data?

Factors constituting Big Data

Hadoop and its Ecosystem

Map Reduce -Concepts of Map, Reduce, Ordering, Concurrency, Shuffle, Reducing, Concurrency

Hadoop Distributed File System (HDFS) Concepts and its Importance

Deep Dive in Map Reduce Execution Framework, Partitioner, Combiner, Data Types, Key pairs

HDFS Deep Dive Architecture, Data Replication, Name Node, Data Node, Data Flow

Parallel Copying with DISTCP, Hadoop Archives

Assignment 1
Module 2 Hands-on Exercises

Installing Hadoop in Pseudo Distributed Mode, Understanding Important configuration files, their
Properties and Demon Threads

Accessing HDFS from Command Line

Map Reduce Basic Exercises

Understanding Big Data Hadoop Ecosystem

Introduction to Sqoop, use cases and Installation

Introduction to Hive, use cases and Installation

Introduction to Pig, use cases and Installation

Introduction to Oozie, use cases and Installation

Introduction to Flume, use cases and Installation

Introduction to Yarn

Assignment -2 and 3
Mini Project Importing Mysql Data using Sqoop and Querying it using Hive

Module 3 Deep Dive in Map Reduce.


1.

Mapper & Reducer

Relation between input splits and HDFS blocks.

Map reduce job submission flow of input splits.

2.

How Mapper and Combiner Works

Mapper and Combiners Work

3.

Shuffle & Sort Phase,Combiner & Partitioner.

Map Reduce in detail.

Comparison b/w YARN and MRV1

MapReduce job Execution.

MapReduce Combiner.

Mapreduce Partitioner.

shuffle & Sort Phase.

4.

Job Scheduler

Map reduce job submission flow

Job launch process (Job)

Job launch Process (task)

Job launch process (Task tracker)

Job launch process (Task runner)

5.

Joining Of Files/Datasets

Joining Data sets in Map Reduce.

Distributed cache.

6.

Reduce Joins

Counters

Reduce Join
7.

8.

9.

Input Format
Custom Input Format.

Inverted Indexing.
Map Reduce Inverted Indexing
Hadoop APIs
10.Explanation of MapReduce organization.

How mapper process with detailed example


testing module.

How to develop Map Reduce Application.

writing unit test Best Practices for developing and writing.

Debugging Map Reduce applications.

Module 3.1

Project 1- Hands on exercise end to end PoC using Yarn or Hadoop 2.

1.

Real World Transactions handling of Bank

2.

Moving data using Sqoop to HDFS

3.

Incremental update of data to HDFS

4.

Running Map Reduce Program

5.

Running Hive queries for data analytics

Project 2- Hands on exercise end to end PoC using Yarn or Hadoop 2.7
Running Map Reduce Code for Movie Rating and finding their fans and average rating

Assignment -4 and 5
Module 4 Deep Dive in Pig
1.

Introduction to Pig

What Is Pig?

Pigs Features

Pig Use Cases

Interacting with Pig

2.

Basic Data Analysis with Pig

Pig Latin Syntax

Loading Data

Simple Data Types

Field Definitions

Data Output

Viewing the Schema

Filtering and Sorting Data

Commonly-Used Functions

Hands-On Exercise: Using Pig for ETL Processing

3.

Processing Complex Data with Pig

Complex/Nested Data Types

Grouping

Iterating Grouped Data

Hands-On Exercise: Analyzing Data with Pig

4.

Multi-Dataset Operations with Pig

Techniques for Combining Data Sets

Joining Data Sets in Pig

Set Operations

Splitting Data Sets

Hands-On Exercise

5.

Extending Pig

Macros and Imports

UDFs

Using Other Languages to Process Data with Pig

Hands-On Exercise: Extending Pig with Streaming and UDFs


6.

Pig Jobs

Case studies of Fortune 500 companies which are Electronic Arts and Walmart with real
data sets.
Assignment 6
Module 5 Deep Dive in Hive
1.

Introduction to Hive

What Is Hive?

Hive Schema and Data Storage

Comparing Hive to Traditional Databases

Hive vs. Pig

Hive Use Cases

Interacting with Hive


2.

Relational Data Analysis with Hive

Hive Databases and Tables

Basic HiveQL Syntax

Data Types

Joining Data Sets

Common Built-in Functions

Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
3.

Hive Data Management

Hive Data Formats

Creating Databases and Hive-Managed Tables

Loading Data into Hive

Altering Databases and Tables

Self-Managed Tables

Simplifying Queries with Views

Storing Query Results

Controlling Access to Data

Hands-On Exercise: Data Management with Hive

4.

Hive Optimization

Understanding Query Performance

Partitioning

Bucketing


5.

Indexing Data
Extending Hive
User-Defined Functions

6.

Hands on Exercises Playing with huge data and Querying extensively.

6.

User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning

Assignment 7
Module 6 Impala
1.

Introduction to Impala

What is Impala?

How Impala Differs from Hive and Pig

How Impala Differs from Relational Databases

Limitations and Future Directions

Using the Impala Shell

2.

Choosing the Best (Hive, Pig, Impala)

2.

Modeling and Managing Data with Impala and Hive

Data Storage Overview

Creating Databases and Tables

Loading Data into Tables

HCatalog

Impala Metadata Caching

4.

Data Partitioning

Partitioning Overview

Partitioning in Impala and Hive

Module 7 (AVRO)Data Formats

Selecting a File Format

Hadoop Tool Support for File Formats

Avro Schemas

Using Avro with Hive and Sqoop

Avro Schema Evolution

Compression

Module 8 Introduction to Hbase architecture

What is Hbase

Where does it fits

What is NOSQL

Assignment -8
Apache Spark
Module 9 Why Spark? Explain Spark and Hadoop Distributed File System

What is Spark

Comparison with Hadoop

Components of Spark

Module 10 Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph


Analysis, Machine Learning

Apache Spark- Introduction, Consistency, Availability, Partition

Unified Stack Spark

Spark Components

Comparison with Hadoop Scalding example, mahout, storm, graph

Module 11 Running Spark on a Cluster, Writing Spark Applications using Python, Java,
Scala

Explain python example

Show installing a spark

Explain driver program

Explaining spark context with example

Define weakly typed variable

Combine scala and java seamlessly.

Explain concurrency and distribution.

Explain what is trait.

Explain higher order function with example.

Define OFI scheduler.

Advantages of Spark

Example of Lamda using spark

Explain Mapreduce with example

Module 12 Hadoop Cluster Setup and Running Map Reduce Jobs

Hadoop Multi Node Cluster Setup using Amazon ec2 Creating 4 node cluster setup

Running Map Reduce Jobs on Cluster

Module 13 Major Project Putting it all together and Connecting Dots

Putting it all together and Connecting Dots

Working with Large data sets, Steps involved in analyzing large data

Assignment 9, 10
Module 14 Advance Mapreduce

Delving Deeper Into The Hadoop API

More Advanced Map Reduce Programming, Joining Data Sets in Map Reduce

Graph Manipulation in Hadoop

Assignment 11, 12
Module 15 ETL Connectivity with Hadoop Ecosystem

How ETL tools work in Big data Industry

Connecting to HDFS from ETL tool and moving data from Local system to HDFS

Moving Data from DBMS to HDFS

Working with Hive with ETL Tool

Creating Map Reduce job in ETL tool

End to End ETL PoC showing Hadoop integration with ETL tool.

Module 16 Hadoop Cluster Configuration

Hadoop configuration overview and important configuration file

Configuration parameters and values

HDFS parameters MapReduce parameters

Hadoop environment setup

Include and Exclude configuration files


Lab: MapReduce Performance Tuning

Module 17 Hadoop Administration and Maintenance

Namenode/Datanode directory structures and files

File system image and Edit log

The Checkpoint Procedure

Namenode failure and recovery procedure

Safe Mode

Metadata and Data backup

Potential problems and solutions / what to look for

Adding and removing nodes


Lab: MapReduce File system Recovery

Module 18 Hadoop Monitoring and Troubleshooting

Best practices of monitoring a Hadoop cluster

Using logs and stack traces for monitoring and troubleshooting

Using open-source tools to monitor Hadoop cluster

Module 19 Job Scheduling

How to schedule Hadoop Jobs on the same cluster

Default Hadoop FIFO Schedule

Fair Scheduler and its configuration

Module 20 Hadoop Multi Node Cluster Setup and Running Map Reduce Jobs on
Amazon Ec2

Hadoop Multi Node Cluster Setup using Amazon ec2 Creating 4 node cluster setup

Running Map Reduce Jobs on Cluster

Module 21 ZOOKEEPER

ZOOKEEPER Introduction

ZOOKEEPER use cases

ZOOKEEPER Services

ZOOKEEPER data Model

Znodes and its types

Znodes operations

Znodes watches

Znodes reads and writes

Consistency Guarantees

Cluster management

Leader Election

Distributed Exclusive Lock

Important points

Module 22 Advance Oozie

Why Oozie?

Installing Oozie

Running an example

Oozie- workflow engine

Example M/R action

Word count example

Workflow application

Workflow submission

Workflow state transitions

Oozie job processing

Oozie- HADOOP security

Why Oozie security?

Job submission to hadoop

Multi tenancy and scalability

Time line of Oozie job

Coordinator

Bundle

Layers of abstraction

Architecture

Use Case 1: time triggers

Use Case 2: data and time triggers

Use Case 3: rolling window

Module 23 Advance Flume

Apache Flume

Big data ecosystem

Physically distributed Data sources

Changing structure of Data

Closer look

Anatomy of Flume

Core concepts

Event

Clients

Agents

Source

Channels

Sinks

Interceptors

Channel selector

Sink processor

Data ingest

Agent pipeline

Transactional data exchange

Routing and replicating

Why channels?

Use case- Log aggregation

Adding flume agent

Handling a server farm

Data volume per agent

Example describing a single node flume deployment

Module 24 Advance HUE

HUE introduction

HUE ecosystem

What is HUE?

HUE real world view

Advantages of HUE

How to upload data in File Browser?

View the content

Integrating users

Integrating HDFS

Fundamentals of HUE FRONTEND

Module 25 Advance Impala

IMPALA Overview: Goals

User view of Impala: Overview

User view of Impala: SQL

User view of Impala: Apache HBase

Impala architecture

Impala state store

Impala catalogue service

Query execution phases

Comparing Impala to Hive

Testing
Module 26 Hadoop Stack Integration Testing

Why Hadoop testing is important

Unit testing

Integration testing

Performance testing

Diagnostics

Nightly QA test

Benchmark and end to end tests

Functional testing

Release certification testing

Security testing

Scalability Testing

Commissioning and Decommissioning of Data Nodes Testing

Reliability testing

Release testing

Module 27 Roles and Responsibilities of Hadoop Testing

Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test
bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion.

ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using
sqoop/flume which includes but not limited to data verification, Reconciliation.

User Authorization and Authentication testing (Groups, Users, Privileges etc)

Report defects to the development team or manager and driving them to closure.

Consolidate all the defects and create defect reports.

Validating new feature and issues in Core Hadoop.

Module 28 Framework called MR Unit for Testing of Map-Reduce Programs

Report defects to the development team or manager and driving them to closure.

Consolidate all the defects and create defect reports.

Validating new feature and issues in Core Hadoop

Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.

Module 29 Unit Testing

Automation testing using the OOZIE.

Data validation using the query surge tool.

Module 30 Test Execution of Hadoop _customized

Test plan for HDFS upgrade

Test automation and result

Module 31 Test Plan Strategy Test Cases of Hadoop Testing

How to test install and configure

Module 32 High Availability Federation, Yarn and Security


Module 33 Job and Certification Support

Major Project on Big Data and Hadoop, Hadoop Development, Cloudera Certification Tips and
Guidance and Mock Interview Preparation, Practical Development Tips and Techniques, certification
preparation

Project Work
1.

Project
Working with Map Reduce, Hive, Sqoop

Problem Statement
It describes that how to import mysql data using sqoop and querying it using hive and also describes that
how to run the word count mapreduce job.

2. Project
Work on Movie lens data for finding top records

Data
Movie Lens dataset

Problem Statement
It includes:

Write a MapReduce program to find the top 10 movies from the u.data file

Create the same top 10 movies using PIG by loading u.data into pig

Create the same top 10 movies using HIVE by loading u.data into HIVE

3. Project
Hadoop Yarn Project End to End PoC

Problem Statement
It includes:

Import Movie data

Append the data

How to use sqoop commands to bring the data into the hdfs

End to End flow of transaction data

How to process the real word data or huge amount of data using map reduce program in terms of
movie etc.

4. Project
Partitioning Tables

Problem Statement
It describes about the parting and How to perform portioning. It includes:

Manual Partitioning

Dynamic Partitioning

Bucketing

5. Project
Sales Commission

Data
Sales

Problem Statement
In this we calculate the commission according to the sales.

6. Project
Connecting Pentaho with Hadoop Eco-system

Problem Statement
It includes:

Quick Overview of ETL and BI

Configuring Pentaho to work with Hadoop Distribution

Loading data into Hadoop cluster

Transforming data into Hadoop cluster

Extracting data from Hadoop Cluster

7. Project
Multinode Cluster Setup

Problem Statement
It includes following actions:

Hadoop Multi Node Cluster Setup using Amazon ec2 Creating 4 node cluster setup

Running Map Reduce Jobs on Cluster

8. Project
Hadoop Testing using MR

Problem Statement
It describes that how to test map reduce codes with MR unit.

9. Project
Hadoop Weblog Analytics

Data

Weblogs

Problem Statement
The goal is to enable the participants to have a feel of the actual data sets in a production environment and
how to load the data into a Hadoop cluster using various techniques. Once data is loaded, the next goal is
to perform basic analytics on this data.

R Programming
Module 1 How R Works

Data mining Using Statistical packages

A Few concepts Before Starting

Module 2 part 1 What is R-Packages

R-Calculator

Assigning Values To Variables

Vector Creation

Module 2 part 2 What is Sorting

Generating Repeats

What is rep Function

Generating Factor Levels

Sorting Process

Module 2 part 3 Transpose Function

Stack Function Used

Module 3 part 1 Functions & Reading Data from External Files

Merge Function

Strsplit Function

Matrices

Matrix Manipulation

Row Sums

Module 3 part 2 Generating Plots and Pie Charts

Line Plots

Bar Plots

Bar Plots For Population

Histogram

Pie Chart Components

Module 4 part 1 Analysis of Variancy (ANOVA)

One Way Analysis of Variance

Two Way Analysis of Variance

Module 4 part 2 What is Cluster Analysis

K-Means Clustering

Cluster Algorithm Working

Module 5 part 1 Association Rule Mining Affinity Analysis

Association Rule Mining Affinity Analysis

Module 5 part 2 Two Variable RelationShips

Linear Regression

Dependent And Independent Variables

Scatter Plots

Module 6 Part 1 Database connectivity & Logistic Regression

Logistic Regression

Examples of Logistic Regression

Logistic Regression in R

Predication

Module 6 Part 2 ROC Curve in R

Confusion Matrix

ROC Curve in R

Sensitivity & Specificity

Data Base Connectivity RODBC

Reading Data to ODBC Tables

Function (Mean)

Examples Of Function

Module 7 Integrating R with Hadoop

Methods to integrate two popular open source softwares for Big Data analytics: R and Hadoop

Integrating R with Hadoop using RHadoop and RMR package

Exploring RHIPE (R Hadoop Integrated Programming Environment)

Writing MapReduce Jobs in R and executing them on Hadoop

Project
Restaurant Revenue Prediction

Data
Revenue Data set

Problem Statement
It predicts the annual restaurant sales based on the objective measurements. It uses following data fields:

Id

Opening Date

Type of the City

Type of the Restaurant

Three categories of Obfuscated Data

Revenue
It also includes:

Data Overview

Data Fields

Evaluation using RMSE

Feature Engineering / Selection

Mahout
Module 1 Mahout Overview

Classification and Recommendation

Clustering in Mahout

Pattern Mining

Understanding machine Learning

Using Model diagram to decide the approach

Data flow

Supervised and Unsupervised learning

Module 2 Mahout Recommendations

Concept of Recommendation

Recommendations by E-commerce site

Comparison between User Recommendations and Item recommendation

Define recommenders and Classifiers

Process of Collaborative Filtering

Explaining Pearson coefficient algorithm

Euclidean distance measure

Implementing a recommender using map reduce

Module 3 Clustering Session 1

Defining Clustering

User-to-user similarity

Clustering Illustration

Euclidean distance measure

Distance measure vector

Understanding the process of Clustering

Vectorizing documents-Unstructured data

Module 3 Clustering Session 2

Document clustering

Sequence-to-sparse Utility

K-Mean Clustering

Module 3 Clustering Session 3


Module 4 Classification Session 1

Terminology

Predictor and Target variable

Classifiable Data

Key Challenges in Classification algorithm

Vectorizing Continuous data

Classification Examples

Logic Regression and its examples

Module 4 Clustering and Classification Session 2

Clustering

Clustering Process

Transaction Clustering

Different techniques of Vectorization

Distance measure

Clustering algorithm-K-MEAN

Clustering Application-1

Clustering Application-2

Sentiment Analyzer

Module 5 Pattern Mining

Pearson Coefficient

Collaborative Filtering Process

Collaborative Filtering

Similarity Algorithms

Pearson Correlation

Euclidean Distance Measure -Frequent Pattern & Association rules

Frequent Pattern Growth

Session 6 Course Summary


Data Science
Module 1 Getting started with Data Science and Recommender Systems

Data Science Overview

Reasons to use Data Science

Project Lifecycle

Data Acquirement

Evaluation of Input Data

Transforming Data

Statistical and analytical methods to work with data

Machine Learning basics

Introduction to Recommender systems

Apache Mahout Overview

Module 2 Reasons to Use, Project Lifecycle

What is Data Science?

What Kind of Problems can you solve?

Data Science Project Life Cycle

Data Science-Basic Principles

Data Acquisition

Data Collection

Understanding Data- Attributes in a Data, Different types of Variables

Build the Variable type Hierarchy

Two Dimensional Problem

Co-relation b/w the Variables- explain using Paint Tool

Outliers, Outlier Treatment

Boxplot, How to Draw a Boxplot

Module 3 Acquiring Data

Discussion on Boxplot- also Explain

Example to understand variable Distributions

What is Percentile? Example using Rstudio tool

How do we identify outliers?

How do we handle outliers?

Outlier Treatment: Using Capping/Flooring General Method

Distribution- What is Normal Distribution?

Why Normal Distribution is so popular?

Uniform Distribution

Skewed Distribution

Transformation

Module 4 Machine Learning in Data Science

Discussion about Boxplot and Outlier

Goal: Increase Profits of a Store

Areas of increasing the efficiency

Data Request

Business Problem: To maximize shop Profits

What are Interlinked variables

What is Strategy

Interaction b/w the Variables

Univariate analysis

Multivariate analysis

Bivariate analysis

Relation b/w Variables

Standardize Variables

What is Hypothesis?

Interpret the Correlation

Negative Correlation

Machine Learning

Module 5 Statistical and analytical methods dealing with data, Implementation of


Recommenders using Apache Mahout and Transforming Data

Correlation b/w Nominal Variables

Contingency Table

What is Expected Value?

What is Mean?

How Expected Value is differ from Mean

Experiment Controlled Experiment, Uncontrolled Experiment

Degree of Freedom

Dependency b/w Nominal Variable & Continuous Variable

Linear Regression

Extrapolation and Interpolation

Univariate Analysis for Linear Regression

Building Model for Linear Regression

Pattern of Data means?

Data Processing Operation

What is sampling?

Sampling Distribution

Stratified Sampling Technique

Disproportionate Sampling Technique

Balanced Allocation-part of Disproportionate Sampling

Systematic Sampling

Cluster Sampling

2 angels of Data Science-Statistical Learning, Machine Learning

Module 6 Testing and Assessment, Production Deployment and More

Multi variable analysis

linear regration

Simple linear regration

Hypothesis testing

Speculation vs. claim(Query)

Sample

Step to test your hypothesis

performance measure

Generate null hypothesis

alternative hypothesis

Testing the hypothesis

Threshold value

Hypothesis testing explanation by example

Null Hypothesis

Alternative Hypothesis

Probability

Histogram of mean value

Revisit CHI-SQUARE independence test

Correlation between Nominal Variable

Module 7 Business Algorithms, Simple approaches to Prediction, Building model,


Model deployment

Machine Learning

Importance of Algorithms

Supervised and Unsupervised Learning

Various Algorithms on Business

Simple approaches to Prediction

Predict Algorithms

Population data

sampling

Disproportionate Sampling

Steps in Model Building

Sample the data

What is K?

Training Data

Test Data

Validation data

Model Building

Find the accuracy

Rules

Iteration

Deploy the model

Linear regression

Module 8 Getting started with Segmentation of Prediction and Analysis

Clustering

Cluster and Clustering with Example

Data Points, Grouping Data Points

Manual Profiling

Horizontal & Vertical Slicing

Clustering Algorithm

Criteria for take into Consideration before doing Clustering

Graphical Example

Clustering & Classification: Exclusive Clustering, Overlapping Clustering, Hierarchy Clustering

Simple Approaches to Prediction

Different types of Distances: 1.Manhattan, 2.Euclidean, 3.Consine Similarity

Clustering Algorithm in Mahout

Probabilistic Clustering

Pattern Learning

Nearest Neighbor Prediction

Nearest Neighbor Analysis

Module 9 Integration of R and Hadoop

R introduction

How R is typically used

Features of R

Introduction to Big data

R+Hadoop

Ways to connect with R and Hadoop

Products

Case Study

Architecture

Steps for Installing RIMPALA

How to create IMPALA packages

Projects
Project 1-Understanding Cold Start Problem in Data Science

Algorithms for Recommender

Ways of Recommendation

Types of Recommendation -Collaborative Filtering Based Recommendation, Content-Based


Recommendation
Cold Start Problem

Project 2-Recommendation for Movie, Summary

Recommendation for movie

Two Types of Predictions Rating Prediction, Item Prediction

Important Approaches: Memory Based and Model Based

Knowing User Based Methods in K-Nearest Neighbor

Understanding Item Based Method

Matrix Factorization

Decomposition of Singular Value

Data Science Project discussion

Collaboration Filtering

Business Variables Overview

Data Science Assignment

Real-time enterprise problem

Use of various datasets to solve this problem

Use of Variables for Problem Resolution

Building strategy to solve this problem with the available data

Descriptive Statistics

SPT
Module 1 Information of Statistics

What is statistics

How is this useful

What is this course for

Module 2 Data Conversion

Converting data into useful information

Collecting the data

Understand the data

Finding useful information in the data

Interpreting the data

Visualizing the data

Module 3 Terms of Statistics

Descriptive statistics

Let us understand some terms in statistics

Variable

Module4 Plots

Dot Plots

Histogram

Stemplots

Box and whisker plots

Outlier detection from box plots and Box and whisker plots

Module 5 Statistics & Probability

What is probability

Set & rules of probability

Bayes Theorem

Module 6 Distributions

Probability Distributions

Few Examples

Student T- Distribution

Sampling Distribution

Student t- Distribution

Poison distribution

Module7 Sampling

Stratified Sampling

Proportionate Sampling

Systematic Sampling

P Value

Stratified Sampling

Module 8 Tables & Analysis

Cross Tables

Bivariate Analysis

Multi variate Analysis

Dependence and Independence tests ( Chi-Square )

Analysis of Variance

Correlation between Nominal variables

Project
Data Analysis Project

Data
Sales

Problem Statement
It includes the following actions:

Understand the business solutions

Discussion with the warehouse team

Data Collection & Storage

Data Cleaning

Build a Hypothesis Tree around the business problem

Produce the final result.

Apache Solr
Module 1. The Fundamentals

About Solr

Installing and running Solr

Adding content to Solr

Reading a Solr XML response

Changing parameters in the URL

Using the browse interface


Module 2. Searching

Sorting results

Query parsers

More queries

Hardwiring request parameters

Adding fields to default search

Faceting

Result grouping
Module 3. Indexing

Adding your own content to Solr

Deleting data from solr

Building a bookstore search

Adding book data

Exploring the book data

Dedupe update processor


Module 4. Updating your schema

Adding fields to the schema

Analyzing text
Module 5. Relevance

Field weighting

Phrase queries

Function queries

Fuzzier search

Sounds-like
Module 6. Extended features

More-like-this

Geospatial

Spell checking

Suggestions

Highlighting

Pseudo-fields

Pseudo-joins

Multilanguage
Module 7. Multicore

Adding more kinds of data


Module 8. SolrCloud

Introduction

How SolrCloud works

Commit strategies

ZooKeeper

Managing Solr config files

Project
Function Queries

Problem Statement
It describes that how to use function queries in Solr, suppose an index store the dimensions in meters x,
y, z of some hypothetical boxes with arbitrary names stored in field boxname. Suppose we want to search
for box matching name findbox but ranked according to volumes of boxes.

Splunk
Module 1 Basic Concepts of Splunk Development

Splunk development concepts

Roles and responsibilities of Splunk Developer

Module 2 Saving and Scheduling Searches

Exporting search results

Saving and sharing search results

Saving searches

Search scheduling

Module 3 Creating Alerts

Describing alerts

Alert Creation

View fired alerts

Module 4 Tags and Event Types

Understanding tags

Creating tags and using them in a search

Defining event types and their usefulness

Creating and using event types in a search

Module 5 Search Commands

Reviewing search commands and performing general search practices

Examine the anatomy of a search

Using various commands to perform searches:fields, table, rename, rex&erex, multiply

Module 6 Reporting Commands

Using following commands and their functions:


1. top
2. rare
3. stats
4. addcoltotals
5. addtotals

Module 7 Visualizations

Explore the available visualizations

Create Charts and timecharts

Omit null values and format results

Module 8 Analyzing, Calculating and Formatting Results

Using eval command

Perform calculations

Value Conversion

Round values

Format values

Conditional statements

Filtering calculated results

Module 9 Correlating Events

Overview of Transactions

Search Transactions

Module 10 Enriching Data with Lookups

What are lookups?

Lookup file example

Creating a lookup table

Defining a lookup

Configuring an automatic lookup

Using the lookup in searches and reports

Module 11 Creating Reports and Dashboards

Creating reports and charts

Creating dashboards and adding reports

Module 12 Getting started with Parsing

Data Preview and Parsing Phase

Raw Data Manipulation

Extraction of Fields

Project

The Splunk Project, after finishing this training course, will let you create a report and dashboard
with the text file having employee details.

You will perform various row operations to fetch data as per your requirements and use important
Splunk commands on the file to extract certain fields.

Other significant aspects of this project is editing the event, adding tags, searching event with tag
names and saving tag search.

Splunk Admin
Module 1- Simple Splunk Environment

Installing Splunk

License Management

Data Inputs

App management

Module 2- Basic Production Environment

Introduction to Splunk Configuration Files

Universal Forwarder

Forwarder Management

Module 3 Various Data Inputs

Understanding Monitor Inputs

What are Network Inputs?

Define Modular and Scripted Inputs

Explaining Windows Inputs

What are Fine-tuning Inputs?

Module 4 Index and User Management

Concept of Indexing in Splunk

Maintenance and Optimization of Indexes

Users: Their Roles and Authentication

Module 5 Getting started with Parsing

Data Preview and Parsing Phase

Raw Data Manipulation

Extraction of Fields

Module 6 Search Scaling and Monitoring

Performing Distributed Search

Search Performance Tuning

Understanding Execution issues in large scale deployment

Distributed Management Console

Project
Field Extraction

Problem Statement
It includes:

About Field Extraction

Field Extractor Utility

Field Extraction page in Splunk Web

Configure field extraction in configuration files etc.

Apache Storm
Module 1 Understanding Architecture of Storm

Bayesian Law

Hadoop Distributed Computing

Big Data features

Legacy Architecture of Real Time System

Storm vs. Hadoop

Logical Dynamic and Components in Storm

Storm Topology

Execution Components in Storm

Stream Grouping

Tuple

Spout

Bolt-normalization bolt

Module 2 Installation of Apache storm

Installing Apache Storm

Module 3 Grouping

Different types of Grouping

Reliable and unreliable messaging

Fetching data Direct connection and En-queued message

Bolt Lifecycle

Module 4 Overview of Trident

Trident Spouts and its types

Components and Interface of Trident spout

Trident Function, Filter & Aggregator

Module 5 Boot Stripping

Twitter Boot Stripping

Detailed learning on Boot Stripping

Concepts of Storm

Storm Development Environment

Projects
Real-time Project on Storm
The Project Bolt Blue Print
HBase
Module 1 HBase Overview

Getting started with HBase

Core Concepts of HBase

Understanding HBase with an Example

Module 2 Architecture of NoSQL

Why HBase?

Where to use HBase?

What is NoSQL?

Module 3 HBase Data Modeling

HDFS vs.HBase

HBase Use Cases

Data Modeling HBase

Module 4 HBase Cluster Components

HBase Architecture

Main components of HBase Cluster

Module 5 HBase API and Advanced Operations

HBase Shell

HBase API

Primary Operations

Advanced Operations

Module 6 Integration of Hive with HBase

Create a Table and Insert Data into it

Integration of Hive with HBase

Load Utility

Module 7 File loading with both load Utility

Putting Folder to VM

File loading with both load Utility

Project
Integrate Hive and Java with HBase

Problem Statement
This project describes that how to integrate hive and java with HBase. It includes following actions:

Installation of HBase

Creation of Table

Java Program to create the table in HBase

Managing the HBase Table with Hive

Bulk Import etc.

Cassandra
Module 1-Advantages and Usage of Cassandra

Brief Introduction of the course

Advantages and Usage of Cassandra

Module 2-CAP Theorem and No SQL DataBase

Why No SQL DataBase

Replication in RDBMS

Key Challenges with RDBMS

Schema

No SQL(Not only SQL)

No SQL Category

Advantage &Limitation

Key Characteristics of No SQL Data Base

CAP Theorem

Consistency

Module 3-Cassandra fundamentals, Data model, Installation and setup

What is Cassandra?

Non relational

Key deployment concept

What is column oriented database

Data Model column

What is column family

Installation

Module 4-Steps in Configuration

Token calculation

Configuration overview

Node tool

Validators

Comparators

Expiring column

QA

Module 5-Summarization, node tool commands, cluster, Indexes, Cassandra &


Mapreduce, Installing Ops-center

Difference between Relational modeling & Cassandra modeling

Steps in Cassandra modeling

Time series modeling in Cassandra

Column family

Data modeling in Cassandra

Column family vs. Super column family

Counter column family

Partitioners

Partitioners strategies

Replication

Gossip protocols

Read operation

Consistency

Comparison

Module 6-Multi Cluster setup

Node settings

Setup of Multinode cluster

Row cache and Key cache

Read operation

System keyspace

Commands overview

Column family

VNodes

Module 7-Thrift/AVRO/JSON/Hector Client

Thrift

AVRO

JSON

Hector client

How to write a JAVA code

Hector tag

Module 8-Datastax installation part, Secondary index

Node tool commands

Management of Cassandra

Secondary index

Cassandra & map reduce

Datastax installation part

Module 9-Cassandra API and Summarization and Thrift

API

Internals of connection pool

Client connectivity to cassandra

Hector client key features

Hector client key concepts

Java code

Summarization

Thrift

MongoDB
Module 1 Getting started with NoSQL, MongoDB and their Installation

Database type description

What is NoSQL Database?

NoSQL Database s Types

Challenges with RDBMS

Why we require NoSQL data?

What is MONGODB

JSON/BSON Introduction

JSON Data Types

Example of JSON

Installation of MONGODB

Module 2 Part 1 NoSQL and its iMportance

Database Type

OLTP

OLAP

NOSQL

Type of NOSQL Database

Challenges with RDBMS

Why NOSQL

ACID property

CAP Theorem

Base property

Introduction to Json/ Bson

Json Data types

Database collection & document

MongoDB use cases

Unacknowledged

Acknowledged

Juurnaled

Fsynced

Repica Acknowledged

Module 2 Part 2 CRUD Operations

MongoDB crud Tutorial

Installation Rent

used ppt

json its syntax

CRUD Introduction,

Read and Write Operations

Write Operation Concern Levels

MongoDB CRUD Tutorials

MongoDB CRUD Reference

Hands on with CRUD Operations

Module 3 Part 1 Understanding Schema Design, Backup strategies, Data Modeling


and Monitoring

Data Modeling in MongoDB

RDBMS vs. Data models

Data Modeling tools

Data modeling example & patterns

Model TREE structure

Operational strategies

Backup strategies

Monitoring

Monitoring Commands

Monitoring of performance issues

Run time configuration

Export & import of data

Relationship between Document

Model Specific Application Contexts

Data Model Reference

Hands on with MongoDB Data Modeling

Module 3 Part 2 Data Administration and Management

Data Management

Introduction to replica

Election of new primary

Replica set

Type of Replica

Hidden Replica

Arbiter Replica

Sharding

Concepts around Replication

Setting up replicated cluster

Setting up Sharded Cluster

Sharding Database, Collections

Hands on Exercise

Module 4 Indexes and Aggregation

Introduction to Indexes

Concepts around Indexes

Type of Indexes

Index Property

Introduction to Aggregation

Type of Aggregation

Use cases of Aggregation

Hands on Exercise

Module 5 Security in MongoDB

Security Risks to Databases

MongoDB Security Approach

MongoDB Security Concept

Access Control

Integration with MongoDB with Robomongo

Integration with MongoDB with Java

Module 6 MongoDB Integration with Jaspersoft, Load and Manage Unstructured Data
(Videos, Images, Logs, Resumes etc.)

Integration with MongoDB with Jaspersoft

Additional Concept (GridFS mongo files)

Loading and Managing Unstructured Data (Videos, Images, Logs, Resumes etc.)

Project
Java MongoDB Integration

Problem Statement
It creates a table to insert the video file using the java program. For this it performs following actions:

Installation of Java

Adding MongoDB Java Connector etc.

Apache Spark
Module 1-Why Spark? Explain Spark and Hadoop Distributed File System

What is Spark

Comparison with Hadoop

Components of Spark

Module 2-Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph


Analysis, Machine Learning

Apache Spark- Introduction, Consistency, Availability, Partition

Unified Stack Spark

Spark Components

Comparison with Hadoop Scalding example, mahout, storm, graph

Module 3-Running Spark on a Cluster, Writing Spark Applications using Python, Java,
Scala

Explain python example

Show installing a spark

Explain driver program

Explaining spark context with example

Define weakly typed variable

Combine scala and java seamlessly.

Explain concurrency and distribution.

Explain what is trait.

Explain higher order function with example.

Define OFI scheduler.

Advantages of Spark

Example of Lamda using spark

Explain Mapreduce with example

Module 4-RDD and its operation

Difference between RISC and CISC

Define Apache Mesos

Cartesian product between two RDD

Define count

Define Filter

Define Fold

Define API Operations

Define Factors

Module 5-Spark, Hadoop, and the Enterprise Data Centre, Common Spark Algorithms

How hadoop cluster is different from spark

Define writing data

Explain sequence file and its usefulness

Define protocol buffers

Define text file, CSV, Object Files and File System

Define sparse metrics

Explain RDD and Compression

Explain data stores and its usefulness

Module 6-Spark Streaming

Define Elastic Search

Explain Streaming and its usefulness

Apache bookeeper

Define Dstream

Define mapreduce word count

Explain Paraquet

Scala ORM

Define Mlib

Explain multi graphix and its usefulness

Define property graph

Module 7-Spark Persistence in Spark

Persistence

Motivation

Example

Transformation

Scala and Python

Examples K-means

Latent Dirichlet Allocation (LDA)

Module 8-Broadcast and accumulator

Motivation

Broadcast Variables

Example: Join

Alternative if one table is small

Better version with broadcast

How to create a Broadcast

Accumulators motivation

Example: Join

Accumulator Rules

Custom accumulators

Another common use

Creating an accumulator using spark context object

Module 9-Spark SQL and RDD

Introduction

Spark SQL main capabilities

Spark SQL usage diagram

Spark SQL

Important topics in Spark SQL- Data frames

Twitter language analysis

Module 10-Operations/Accumulators/Traits

How parallelism Takes place

The Master Parameter

Join Operations Example

Accumulators

Traits

Module 11-Scheduling/Partitioning

Task Scheduling/ distribution

Scheduling Around Applications

Static Partitioning

Dynamic Sharing

Scheduling Within An Application

Fair Scheduling

High Availability Of Spark Master

Standby Masters With Zookeeper

Single Node Recovery With Local File System

High Order Functions

Module 12-Capacity Planning in Spark

Practicals : Creating Maps, Transformations

capacity planning in spark

concurrency in java

concurrency in scala

Module 13-Log Analysis

Array Buffers

Compact Buffer

Protocol Buffer

Log Analysis With Spark

First Log Analyzers In Spark

Mini Projects
Project 1. List the items
Project 2. Sorting of Records
Project 3. Show a histogram of date vs users created. Optionally, use a rich visualization like
Project 4. Prepare a map of tags vs # of questions in each tag and display it.

Major Projects
Project 1 Movie Recommendation
Project 2 Twitter API Integration for tweet Analysis
Project 3 Data Exploration Using Spark SQL Wikipedia dataset

Scala
Module 1-Introduction of Scala

Scala Overview

Module 2-Pattern Matching

Advantages of Scala

REPL (Read Evaluate print loop)

Language Features

Type Interface

Higher order function

Option

Pattern Matching

Collection

Currying

Traits

Application Space

Module 3-Executing the Scala code

Uses of scala interpreter

Example of static object timer in scala

Testing of String equality in scala

Implicit classes in scala with examples.

Recursion in scala

Currying in scala with examples.

Classes in scala

Module 4-Classes concept in Scala

Constructor

Constructor overloading

Properties

Abstract classes

Type hierarchy in Scala

Object equality

Val and var methods

Module 5-Case classes and pattern matching

Sealed traits

Case classes

Constant pattern in case classes

Wild card pattern

Variable pattern

Constructor pattern

Tuple pattern

Module 6-Concepts of traits with example

Java equivalents

Advantages of traits

Avoiding boilerplate code

Linearization of traits

Modelling a real world example

Module 7-Scala java Interoperability

How traits are implemented in scala and java

How extending multiple traits is handled

Module 8-Scala collections

Classification of scala collections

Iterable

Iterator and iterable

List sequence example in scala

Module 9-Mutable collections vs. Immutable collections

Array in scala

List in scala

Difference between list and list buffer

Array buffer

Queue in scala

Dequeue in scala

Mutable queue in scala

Stacks in scala

Sets and maps in scala

Tuples

Module 10-Use Case bobsrockets package

Different import types

Selective imports

Testing-Assertions

Scala test case- scala test fun. Suite

Junit test in scala

Interface for Junit via Junit 3 suite in scala test

SBT

Directory structure for packaging scala application

Anda mungkin juga menyukai