Day 1: Introduction to Big Data Problems and Hadoop

This is an overview session with focus on concepts, baseline understanding and experience sharing. With some amount of hands-on and canned Virtual machine installation

  • Thinking at Scale
  • Hadoop Eco System
  • MapReduce and HDFS Concepts
  • Grid Architecture and grid Benefits 
  • ETL on Grid
  • Hands On and Familiarity with the environment (including Training Virtual machine installation)

Thinking at Scale:

This session addresses common challenges and general best practices for scaling with data

  • Problem faced by web2.0 companies - Introduction to Google and yahoo search example
  • Limitations of SMP and MPP (massive parallel processing) architecture - scaling problem, cost problem
  • Patterns and concepts associated with large data problems

Hadoop Eco System:

  • An introduction to other projects surrounding Hadoop, which complete the greater ecosystem of available large-data processing tools.
  • Understand the computing problems best solved with Hadoop
  • Hadoop Conceptual and Physical Architecture

MapReduce and HDFS:

  • Map Reduce Paradigm - Introduction
  • File Storage paradigm - Introduction (Hadoop file system)
  • Map reduce job for aggregation

Grid Architecture and grid Benefits

  • Hadoop (Grid) Architecture - presentation
  • Benefits of Grid (Redundancy built in grid)

ETL on Grid:

  • Problem faced by large warehouse
  • How cluster and grid works in ETL
  • End to End architecture for ETL on grid
  • Need for other components
  • What cannot be done on grid?
  • Real time. Grid latency, ETL Transformations

Hands on Environment

Install Hadoop on virtual machine (Mapreduce)

Carry Over Exercise: Identify the Map Reduce Application Scenarios in Client Context

Day 2: Getting Started with Hadoop: Hands-On Workshop on MapReduce

Introduction to hadoop class files and jar file

Hadoop Mapreduce API  

Map reduce job creation (in java) - Hands on

Job tracking using http

Instrumentation of map jobs and reduce jobs

Map Reduce Algorithms and Programs

Common Mapreduce Patterns and how they are solved

   Sorting (Actual Working code, not pseudo code)

   Searching

   Indexing

   Joining

Work through Case: Solution and Programming

Introduction to Advanced Hadoop API

Advanced APIs with a case based example to work through

    Distributed cache

    Input Formats and Output Formats

    Writable and Writable Comparable

    Partitioners and Counters

Day 3: Introduction to PIG

Experience Sharing: Typical Mapreduce Applications in the Internet Industry

         Validate the Map Reduce Application Scenarios shared by the team members (Introductory Day session exercise)

Introduction to PIG

Introduction to PIG 

Working with PIG

Apache Log Analysis case

Introduction to PIG UDF

Writing UDFs - User-defined functions 

Reporting Requirements from Hadoop Deployment)           

How to create/extend Piggybank

Best Practices

Dos and Donts

Good Cases and Bad Cases

Best Practices for Data Processing Pipelines

Interactive Q&A on experience Sharing

Pre Requisites for the Training:

Familiar working in a Linux environment, Familiar with Java Programming

One Life cycle of Data Integration And / Or Data Analysis experience