Apache HDFS 2.3.0
Apache MapReduce (for MR1) 1.2.1
Apache YARN (for MR2) 2.3.0
Apache Hive 0.12.0
Cloudera Impala 2.0.0
Apache HBase 0.98.0
Apache Accumulo 1.6.0
Apache Solr 4.4.0
Apache Oozie 4.0.0
Cloudera Hue 3.5.0
Apache ZooKeeper 3.4.5
Apache Flume 1.5.0
Apache Sqoop 1.4.4
Apache Sentry (Incubating) 1.4.0-incubating
In short:
- Hadoop Common: A set of shared libraries
- HDFS: The Hadoop filesystem
- MapReduce: Parallel computation framework
- ZooKeeper: Configuration management and coordination
- HBase: Column-oriented database on HDFS
- Hive: Data warehouse on HDFS with SQL-like access
- Pig: Higher-level programming language for Hadoop computations
- Oozie: Orchestration and workflow management
- Mahout: A library of machine learning and data mining algorithms
- Flume: Collection and import of log and event data
- Sqoop: Imports data from relational databases
- The Hadoop Distributed File System, or HDFS, is often considered the foundation component for the rest of the Hadoop ecosystem. HDFS is the storage layer for Hadoop and provides the ability to store mass amounts of data while growing storage capacity and aggregate bandwidth in a linear fashion. HDFS is a logical filesystem that spans many servers, each with multiple hard drives. This is important to understand from a security perspective because a given file in HDFS can span many or all servers in the Hadoop cluster. This means that client interactions with a given file might require communication with every node in the cluster. This is made possible by a key implementation feature of HDFS that breaks up files into blocks. Each block of data for a given file can be stored on any physical drive on any node in the cluster. The important security takeaway is that all files in HDFS are broken up into blocks, and clients using HDFS will communicate over the network to all of the servers in the Hadoop cluster when reading and writing files.
- NameNode
The NameNode is responsible for keeping track of all the metadata related to the files in HDFS, such as filenames, block locations, file permissions, and replication. From a security perspective, it is important to know that clients of HDFS, such as those reading or writing files, always communicate with the NameNode. - DataNode
The DataNode is responsible for the actual storage and retrieval of data blocks in HDFS. Clients of HDFS reading a given file are told by the NameNode which DataNode in the cluster has the block of data requested. When writing data to HDFS, clients write a block of data to a DataNode determined by the NameNode. From there, that DataNode sets up a write pipeline to other DataNodes to complete the write based on the desired replication factor. - JournalNode
The JournalNode is a special type of component for HDFS. When HDFS is configured for high availability (HA), JournalNodes take over the NameNode responsibility for writing HDFS metadata information. Clusters typically have an odd number of JournalNodes (usually three or five) to ensure majority. For example, if a new file is written to HDFS, the metadata about the file is written to every JournalNode. When the majority of the JournalNodes successfully write this information, the change is considered durable. - HttpFS
HttpFS is a component of HDFS that provides a proxy for clients to the Name‐Node and DataNodes. This proxy is a REST API and allows clients to communicate to the proxy to use HDFS without having direct connectivity to any of the other components in HDFS. HttpFS will be a key component in certain cluster architectures. - NFS Gateway
The NFS gateway, as the name implies, allows for clients to use HDFS like an NFS-mounted filesystem. The NFS gateway is an actual daemon process that facilitates the NFS protocol communication between clients and the underlying HDFS cluster. Much like HttpFS, the NFS gateway sits between HDFS and clients and therefore affords a security boundary that can be useful in certain cluster architectures. - KMS
The Hadoop Key Management Server, or KMS, plays an important role in HDFS transparent encryption at rest. Its purpose is to act as the intermediary between HDFS clients, the NameNode, and a key server, handling encryption operations such as decrypting data encryption keys and managing encryption zone keys.
- NameNode
- Apache YARN
- Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.
- Other processing frameworks and applications, such as Impala and Spark, use YARN as the resource management framework. While YARN provides a more general resource management framework, MapReduce is still the canonical application that runs on it. MapReduce that runs on YARN is considered version 2, or MR2 for short.
- Apache MapReduce
- MapReduce is the processing counterpart to HDFS and provides the most basic mechanism to batch process data. When MapReduce is executed on top of YARN, it is often called MapReduce2, or MR2. This distinguishes the YARN-based verison of
MapReduce from the standalone MapReduce framework, which has been retroactively named MR1. MapReduce jobs are submitted by clients to the MapReduce framework and operate over a subset of data in HDFS, usually a specified directory. MapReduce itself is a programming paradigm that allows chunks of data, or blocks in the case of HDFS, to be processed by multiple servers in parallel, independent of one another. While a Hadoop developer needs to know the intricacies of how MapReduce works, a security architect largely does not. What a security architect needs to know is that clients submit their jobs to the MapReduce framework and from that point on,the MapReduce framework handles the distribution and execution of the client code across the cluster. Clients do not interact with any of the nodes in the cluster to make their job run. Jobs themselves require some number of tasks to be run to complete the work. Each task is started on a given node by the MapReduce framework’s scheduling algorithm. - A key point about MapReduce is that other Hadoop ecosystem components are frameworks and libraries on top of MapReduce, meaning that MapReduce handles the actual processing of data, but these frameworks and libraries abstract the MapReduce job execution from clients. Hive, Pig, and Sqoop are examples of components that use MapReduce in this fashion.
- MapReduce is the processing counterpart to HDFS and provides the most basic mechanism to batch process data. When MapReduce is executed on top of YARN, it is often called MapReduce2, or MR2. This distinguishes the YARN-based verison of
- Apache Hive
- The Apache Hive project was started by Facebook. The company saw the utility of MapReduce to process data but found limitations in adoption of the framework due to the lack of Java programming skills in its analyst communities. Most of Facebook’s analysts did have SQL skills, so the Hive project was started to serve as a SQL abstraction layer that uses MapReduce as the execution engine.
- Cloudera Impala
- Cloudera Impala is a massive parallel processing (MPP) framework that is purposebuilt for analytic SQL. Impala reads data from HDFS and utilizes the Hive metastore for interpreting data structures and formats.
- New users to the Hadoop ecosystem often ask what the difference is between Hive and Impala because they both offer SQL access to data in HDFS. Hive was created to allow users that are familiar with SQL to process data in HDFS without needing to know anything about MapReduce. It was designed to abstract the innards of MapReduce to make the data in HDFS more accessible. Hive is largely used for batch access and ETL work. Impala, on the other hand, was designed from the ground up to be a fast analytic processing engine to support ad hoc queries and business intelligence (BI) tools. There is utility in both Hive and Impala, and they should be treated as complementary components.
- Apache Sentry
Sentry is the component that provides fine-grained role-based access controls (RBAC) to several of the other ecosystem components, such as Hive and Impala. While individual components may have their own authorization mechanism, Sentry
provides a unified authorization that allows centralized policy enforcement across components. It is a critical component of Hadoop security.- Sentry server
The Sentry server is a daemon process that facilitates policy lookups made by other Hadoop ecosystem components. Client components of Sentry are configured to delegate authorization decisions based on the policies put in place by Sentry. - Policy database
The Sentry policy database is the location where all authorization policies are stored. The Sentry server uses the policy database to determine if a user is allowed to perform a given action. Specifically, the Sentry server looks for a matching policy that grants access to a resource for the user. In earlier versions of Sentry, the policy database was a text file that contained all of the policies.
- Sentry server
- Apache HBase
- HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. HBase features compression, in-memory operation, and Bloom filters on a per-column basis. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs. Hbase is a column-oriented key -value data store. HBase typically utilizes HDFS as the underlying storage layer for data.
- Apache Accumulo
- Apache Accumulo is a sorted and distributed key/value store designed to be a robust, scalable, high-performance storage and retrieval system. Like HBase, Accumulo was originally based on the Google BigTable design, but was built on top of the Apache Hadoop ecosystem of projects (in particular, HDFS, ZooKeeper, and Apache Thrift). Accumulo uses roughly the same data model as HBase.
- Apache Solr
- The Apache Solr project, and specifically SolrCloud, enables the search and retrieval
of documents that are part of a larger collection that has been sharded across multiple physical servers. Search is one of the canonical use cases for big data and is one of the most common utilities used by anyone accessing the Internet. Solr is built on top of the Apache Lucene project, which actually handles the bulk of the indexing and search capabilities. Solr expands on these capabilities by providing enterprise search features such as faceted navigation, caching, hit highlighting, and an administration interface.
Solr has a single component, the server. There can be many Solr servers in a single deployment, which scale out linearly through the sharding provided by SolrCloud. SolrCloud also provides replication features to accommodate failures in a distributed environment.
- The Apache Solr project, and specifically SolrCloud, enables the search and retrieval
- Apache Oozie
- Apache Oozie is a workflow management and orchestration system for Hadoop. It allows for setting up workflows that contain various actions, each of which can utilize a different component in the Hadoop ecosystem. For example, an Oozie workflow could start by executing a Sqoop import to move data into HDFS, then a Pig script to transform the data, followed by a Hive script to set up metadata structures. Oozie allows for more complex workflows, such as forks and joins that allow multiple steps to be executed in parallel, and other steps that rely on multiple steps to be completed before continuing. Oozie workflows can run on a repeatable schedule based on different types of input conditions such as running at a certain time or waiting until a certain path exists in HDFS.
Oozie consists of just a single server component, and this server is responsible for handling client workflow submissions, managing the execution of workflows, and reporting status.
- Apache Oozie is a workflow management and orchestration system for Hadoop. It allows for setting up workflows that contain various actions, each of which can utilize a different component in the Hadoop ecosystem. For example, an Oozie workflow could start by executing a Sqoop import to move data into HDFS, then a Pig script to transform the data, followed by a Hive script to set up metadata structures. Oozie allows for more complex workflows, such as forks and joins that allow multiple steps to be executed in parallel, and other steps that rely on multiple steps to be completed before continuing. Oozie workflows can run on a repeatable schedule based on different types of input conditions such as running at a certain time or waiting until a certain path exists in HDFS.
- Apache ZooKeeper
- Apache ZooKeeper is a distributed coordination service that allows for distributed systems to store and read small amounts of data in a synchronized way. It is often used for storing common configuration information. Additionally, ZooKeeper is heavily used in the Hadoop ecosystem for synchronizing high availability (HA) services, such as NameNode HA and ResourceManager HA. ZooKeeper itself is a distributed system that relies on an odd number of servers called a ZooKeeper ensemble to reach a quorum, or majority, to acknowledge a given transaction. ZooKeeper has only one component, the ZooKeeper server.
- Apache Flume
- Apache Flume is an event-based ingestion tool that is used primarily for ingestion into Hadoop, but can actually be used completely independent of it. Flume, as the name would imply, was initially created for the purpose of ingesting log events into HDFS. The Flume architecture consists of three main pieces: sources, sinks, and channels. A Flume source defines how data is to be read from the upstream provider. This would include things like a syslog server, a JMS queue, or even polling a Linux directory. A Flume sink defines how data should be written downstream. Common Flume sinks include an HDFS sink and an HBase sink. Lastly, a Flume channel defines how data is stored between the source and sink. The two primary Flume channels are the memory channel and file channel. The memory channel affords speed at the cost of reliability, and the file channel provides reliability at the cost of speed. Flume consists of a single component, a Flume agent. Agents contain the code for sources, sinks, and channels. An important part of the Flume architecture is that Flume agents can be connected to each other, where the sink of one agent connects to the source of another.
- Apache Sqoop
Apache Sqoop provides the ability to do batch imports and exports of data to and from a traditional RDBMS, as well as other data sources such as FTP servers. Sqoop itself submits map-only MapReduce jobs that launch tasks to interact with the RDBMS in a parallel fashion. Sqoop is used both as an easy mechanism to initially seed a Hadoop cluster with data, as well as a tool used for regular ingestion and extraction routines. Sqoop1 is a set of client libraries that are invoked from the command line using the sqoop binary. - Cloudera Hue
- Cloudera Hue is a web application that exposes many of the Hadoop ecosystem components in a user-friendly way. Hue allows for easy access into the Hadoop cluster without requiring users to be familiar with Linux or the various command-line interfaces the components have. Hue has a number different security controls available. Hue is comprised of the following components:
- Hue server
- This is the main component of Hue. It is effectively a web server that serves web content to users. Users are authenticated at first logon and from there, actions performed by the end user are actually done by Hue itself on behalf of the user. This concept is known as impersonation.
- Kerberos Ticket Renewer
- As the name implies, this component is responsible for periodically renewing the Kerberos ticket-granting ticket (TGT), which Hue uses to interact with the Hadoop cluster when the cluster has Kerberos enabled.
http://wikibon.org/wiki/v/HBase,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined
Literature:
- Practical Hadoop Security by Bhushan Lakhe
- Securing Hadoop by Sudheesh Narayanan
- Hadoop Security by Ben Spivey and Joey Echeverria
- Big Data Forensics – Learning Hadoop Investigations by Joe Sremack