Get in Touch

Course Outline

1: HDFS (17%)

  • Outline the role of HDFS Daemons
  • Explain the standard operation of an Apache Hadoop cluster, encompassing both data storage and processing functionalities.
  • Identify contemporary computing system features that drive the need for systems like Apache Hadoop.
  • Classify the primary objectives behind HDFS design
  • Given a specific scenario, determine the suitable use case for HDFS Federation
  • Identify the components and daemons within an HDFS HA-Quorum cluster
  • Analyze the significance of HDFS security mechanisms (Kerberos)
  • Select the optimal data serialization method for a given scenario
  • Describe the pathways for file read and write operations
  • Identify the necessary commands for manipulating files within the Hadoop File System Shell

2: YARN and MapReduce version 2 (MRv2) (17%)

  • Comprehend the impact of upgrading a cluster from Hadoop 1 to Hadoop 2 on cluster configurations
  • Understand the deployment of MapReduce v2 (MRv2 / YARN), including all associated YARN daemons
  • Grasp the core design strategy for MapReduce v2 (MRv2)
  • Determine how YARN manages resource allocations
  • Identify the workflow of a MapReduce job executing on YARN
  • Determine the file modifications required to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN

3: Hadoop Cluster Planning (16%)

  • Highlight key considerations when selecting hardware and operating systems to host an Apache Hadoop cluster.
  • Analyze options for selecting an appropriate OS
  • Understand kernel tuning and disk swapping mechanisms
  • Given a scenario and workload pattern, identify the hardware configuration best suited to the context
  • Given a scenario, determine the ecosystem components required for the cluster to meet SLA requirements
  • Cluster sizing: Given a scenario and execution frequency, identify workload specifics, including CPU, memory, storage, and disk I/O
  • Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements within a cluster
  • Network Topologies: Understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

4: Hadoop Cluster Installation and Administration (25%)

  • Given a scenario, identify how the cluster handles disk and machine failures
  • Analyze logging configuration and the format of logging configuration files
  • Understand the basics of Hadoop metrics and cluster health monitoring
  • Identify the function and purpose of available tools for cluster monitoring
  • Be able to install all ecosystem components in CDH 5, including (but not limited to): Impala, Flume, Oozie, Hue, Manager, Sqoop, Hive, and Pig
  • Identify the function and purpose of available tools for managing the Apache Hadoop file system

5: Resource Management (10%)

  • Understand the overall design goals of each Hadoop scheduler
  • Given a scenario, determine how the FIFO Scheduler allocates cluster resources
  • Given a scenario, determine how the Fair Scheduler allocates cluster resources under YARN
  • Given a scenario, determine how the Capacity Scheduler allocates cluster resources

6: Monitoring and Logging (15%)

  • Understand the functions and features of Hadoop’s metric collection capabilities
  • Analyze the NameNode and JobTracker Web UIs
  • Understand how to monitor cluster Daemons
  • Identify and monitor CPU usage on master nodes
  • Describe how to monitor swap and memory allocation on all nodes
  • Identify how to view and manage Hadoop’s log files
  • Interpret a log file

Requirements

  • Fundamental skills in Linux administration
  • Basic programming proficiency
 35 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories