Get in Touch

Course Outline

  • Introduction
    • History and core concepts of Hadoop
    • Ecosystem overview
    • Distributions
    • High-level architecture
    • Common myths about Hadoop
    • Challenges associated with Hadoop (hardware and software)
    • Labs: Discussing Big Data projects and challenges
  • Planning and installation
    • Selecting software and Hadoop distributions
    • Sizing the cluster and planning for future growth
    • Selecting appropriate hardware and network configurations
    • Rack topology
    • Installation procedures
    • Multi-tenancy
    • Directory structure and logs
    • Benchmarking
    • Labs: Installing the cluster and running performance benchmarks
  • HDFS operations
    • Core concepts (horizontal scaling, replication, data locality, rack awareness)
    • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring
    • Command-line and browser-based administration
    • Adding storage and replacing defective drives
    • Labs: Familiarising with HDFS command lines
  • Data ingestion
    • Using Flume for logs and other data ingestion into HDFS
    • Using Sqoop to import data from SQL databases to HDFS, and to export back to SQL
    • Hadoop data warehousing with Hive
    • Copying data between clusters (distcp)
    • Utilising S3 as a complement to HDFS
    • Best practices and architectures for data ingestion
    • Labs: Setting up and using Flume and Sqoop
  • MapReduce operations and administration
    • Parallel computing before MapReduce: comparing HPC versus Hadoop administration
    • MapReduce cluster loads
    • Nodes and Daemons (JobTracker, TaskTracker)
    • Walkthrough of the MapReduce UI
    • MapReduce configuration
    • Job configuration
    • Optimising MapReduce
    • Ensuring robustness in MR: guidance for programmers
    • Labs: Running MapReduce examples
  • YARN: New architecture and capabilities
    • YARN design goals and implementation architecture
    • New components: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling under YARN
    • Labs: Investigating job scheduling
  • Advanced topics
    • Hardware monitoring
    • Cluster monitoring
    • Adding and removing servers, upgrading Hadoop
    • Backup, recovery, and business continuity planning
    • Oozie job workflows
    • Hadoop high availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: Setting up monitoring
  • Optional tracks
    • Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5)
    • Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)

Requirements

  • Familiarity with basic Linux system administration
  • Foundational scripting skills

While prior knowledge of Hadoop and Distributed Computing is not mandatory, these concepts will be introduced and explained during the course.

Lab environment

Zero Install: There is no requirement to install Hadoop software on participants' personal machines. A functional Hadoop cluster will be provided for all students.

Participants will require the following:

  • An SSH client (Linux and Mac systems come with SSH clients pre-installed; for Windows, PuTTY is recommended)
  • A web browser to access the cluster. We recommend using the Firefox browser with the FoxyProxy extension installed
 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories