Get in Touch

Course Outline

Section 1: Introduction to Hadoop

  • History and core concepts of Hadoop
  • Overview of the Hadoop ecosystem
  • Different distributions available
  • High-level architecture
  • Common misconceptions about Hadoop
  • Challenges associated with Hadoop
  • Hardware and software requirements
  • Lab: First look at Hadoop

Section 2: HDFS

  • Design and architecture
  • Key concepts (horizontal scaling, replication, data locality, rack awareness)
  • Daemons: Namenode, Secondary Namenode, DataNode
  • Communication mechanisms and heart-beats
  • Data integrity management
  • Read and write paths
  • Namenode High Availability (HA) and Federation
  • Labs: Interacting with HDFS

Section 3: MapReduce

  • Core concepts and architecture
  • Daemons (MRv1): JobTracker and TaskTracker
  • Execution phases: driver, mapper, shuffle/sort, and reducer
  • MapReduce Version 1 and Version 2 (YARN)
  • Internals of MapReduce
  • Introduction to writing MapReduce programs in Java
  • Labs: Running a sample MapReduce program

Section 4: Pig

  • Pig versus Java MapReduce
  • Workflow of a Pig job
  • The Pig Latin programming language
  • ETL processes with Pig
  • Transformations and Joins
  • User Defined Functions (UDFs)
  • Labs: Writing Pig scripts to analyze data

Section 5: Hive

  • Architecture and design principles
  • Data types supported
  • SQL capabilities within Hive
  • Creating Hive tables and performing queries
  • Partitioning data
  • Performing joins
  • Text processing techniques
  • Labs: Various practical exercises on processing data with Hive

Section 6: HBase

  • Core concepts and architecture
  • HBase versus RDBMS versus Cassandra
  • HBase Java API
  • Handling time-series data in HBase
  • Schema design strategies
  • Labs: Interacting with HBase via the shell; programming with the HBase Java API; Schema design exercise

Requirements

  • Proficiency in the Java programming language (as most practical exercises are conducted in Java)
  • Familiarity with the Linux environment (ability to navigate the Linux command line and edit files using vi or nano)

Lab environment

No Installation Required: Students do not need to install Hadoop software on their local machines. A fully operational Hadoop cluster will be provided for use.

Participants will need to have access to the following tools:

  • An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows, PuTTY is recommended)
  • A web browser to access the cluster (Firefox is recommended)
 28 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories