Advance Big Data Science using Python-R-Hadoop-Spark (2/3)
Total Duration: 90 hours + Practice
Hadoop core components- HDFS Hadoop Data Analysis Tools: Impala Distributed Persistence HDFS Overview & Data storage in HDFS Introduction to Impala & Architecture Spark Streaming Overview(Example: Streaming Word Count) Get the data into Hadoop from local machine(Data Loading How Impala executes Queries and its importance. Techniques) - vice versa Hive vs. PIG vs. Impala Spark: Spark meets Hive Extending Impala with User Defined functions Analyze Hive and Spark SQL Architecture Hadoop core components- MapReduce (YARN) Improving Impala Performance Analyze Spark SQL Map Reduce Overview (Traditional way Vs. MapReduce way) Context in Spark SQL Concept of Mapper & Reducer Hadoop Data Analysis Tools: Hbase (NOSQL Database) Implement a sample example for Spark SQL Understanding Map reduce program skeleton Introduction to NoSQL Databases, types, and Hbase Integrating hive and Spark SQL Running MapReduce job in Command line HBase v/s RDBMS, HBase Components, HBase Architecture Support for JSON and Parquet File Formats Implement Data HBase Cluster Deployment Visualization in Spark Hadoop Data Analysis Tools: Hadoop-PIG Loading of Data Introduction to PIG - MapReduce Vs Pig, Pig Use Cases Hadoop: Introduction to other Apache Projects Hive Queries through Spark Pig Latin Program & Execution Introduction to Zookeeper/Oozie/Sqoop/Flume Performance Tuning Tips in Spark Pig Latin : Relational Operators, File Loaders, Group Shared Variables: Broadcast Variables & Accumulators Operator, COGROUP Operator, Joins and COGROUP, Union, SPARK: Introduction Diagnostic Operators, Pig UDF Introduction to Apache Spark Data Science using SPARK Python Use Pig to automate the design and implementation of Streaming Data Vs. In Memory Data Hadoop - Python Integration Map Reduce Vs. Spark Spark - Python Integration (PySpark) MapReduce applications Modes of Spark Data Analysis using PIG Spark Installation Demo Spark -Python: Machine Learning -Predictive Modeling Basics Hadoop Data Analysis Tools: Hadoop-Hive Overview of Spark on a cluster Introduction to Machine Learning & Predictive Modeling Introduction to Hive - Hive Vs. PIG - Hive Use Cases Spark Standalone Cluster Types of Business problems - Mapping of Techniques Discuss the Hive data storage principle Major Classes of Learning Algorithms -Supervised vs Spark: Spark in practice Unsupervised Learning, Explain the File formats and Records formats supported by Invoking Spark Shell Different Phases of Predictive Modeling (Data Pre-processing, the Hive environment Creating the Spark Context Sampling, Model Building, Validation) Perform operations with data in Hive Loading a File in Shell Overfitting (Bias-Variance Trade off) & Performance Metrics Hive QL: Joining Tables, Dynamic Partitioning, Custom Performing Some Basic Operations on Files in Spark Shell Types of validation(Bootstrapping, K-Fold validation etc) Map/Reduce Scripts Building a Spark Project with sbt Hive Script, Hive UDF Running Spark Project with sbt Caching Overview