For most of the websites we use, we do so under the impression that we have some sort of amount of safety. We hope that a password protected login screen will keep the bad guys out. Unfortunately, this isn’t entirely the case. One of the many ways hackers can access your account is called “Session ID Hijacking”. Essentially, when you log into your Facebook or Ebay account, the server spits out a random combination of characters which is called your “Session ID”, the point of which is to differentiate you between other users, and the page you’re currently on from other pages. It’s the computer version of “Welcome Mr Smith, enjoy your stay.” If a hacker can get their hands on the right session ID, they would be able to bypass the entire verification process and hop straight into “Welcome Mr Smith”, and have access to all of your data with relative ease. Each session ID is supposed to be randomised so that no one could guess one. This is where Burp’s Sequencer tool comes in.

The Sequencer is used to test the overall “randomness” of a variable that an application’s server provides. Not only that, but it also runs a bunch of different tests to check how easily a variable can be guessed. This is used most commonly for session IDs because these are usually the most important things to keep random on a website, however, things like cookies may also be susceptible. 

So, the first step to using the sequencer is to find the page you want to test, either through the Spider or by clicking around manually. Send a request to the page and get a response back. On a login screen, this would mean entering any username-password combo just to get an answer from the site. Right-click the response and press “Send to Sequencer”
Burp SequencerGo to the ‘Sequencer’ tab. Don’t bother fiddling with all of the different options and menus, what you want to direct your attention to is the “Token Location within Response” section in the “Live Capture” subtab. Here is where you’ll want to select what it is that you want to test for randomness. If Burp hasn’t already found it for you in the “Cookies” or “For Field” drop-down boxes, you can manually select in by clicking “Configure”, selecting it like so Tokenand clicking “OK”.

Now click “Start Live Capture”. Burp will send request tokens to the server and document its responses. It may be a little slow, but if you aren’t in a rush wait it out until it makes twenty thousand requests so you can make a good analysis. The sequencer gives you lots of different analyses, you can look at the individual tests by clicking through the tabs, but Burp does give you an overall summary on the first page. Take note that the Sequencer only gives you the information, but it doesn’t actually tell you what to do with it.Capture 15

Burp Suite is an incredibly powerful security tool, and part of what makes it that powerful is its relative simplicity. Its more powerful tools such as the Spider or Intruder are quite intuitive, and it’s filled with a load of smaller, simple tools that make a security analyst’s job much easier. These tools may be a little bit limited or one-sided in their design, but that just makes them better for the job they’re doing. Scissors are no use for cutting trees, but we don’t use them for that anyway. One of these tools is the Repeater.

The Repeater is used to manually change small bits of code in the requests you send to the web application you’re testing, without actually waiting for them to load through a browser. Say you have a login page that you’re testing for vulnerabilities. The Repeater will let you quickly make changes to the page request code, which is important if you know what you’re doing and what results you’re expecting. To use the Repeater, get Burp up and running, turn Intercept to ON, and go to the web page you want to test, let’s assume it is a login page, and simply enter any two username and password values. We are expecting you to get these wrong. The point is for Burp to intercept what the request you’re sending out looks like. And before we move on, please make sure that you’re either working with a local version of a website that won’t affect the real thing or with the conspicuous consent of the site owners, otherwise, all of this is illegal. Anyway, find the request you want in the Target tab and Site Map subtab, right click, and press ‘Send to Repeater’.Send to Repeater

Now go to the Repeater tab and you should see two spaces, one called Request, and the other called Response. Request is what you’re editing, and Response is what the website spits out back at you. From here you can change anything you want about the request in any form, from the raw data to hexadecimal values. Just press go and you’ll see how the website would react. With premium, you can even render what the code looks like. Notice that in your browser, the website hasn’t changed. From here it’s up to you and your prior HTML knowledge to start picking at the site.

Following celebrity news is a lot like watching a bad horror movie. You’re constantly wondering why every decision they make is just so stupid. Whether we’re watching Friday the 13th or TMZ, we always end up yelling “No! Stop!” at our screens. We lift our chins up boldly and proclaim “I’d never do such a thing!”. That, or we shrug our shoulders and mumble “Can’t be helped” if something random and extraordinary happens to them. That’s pretty much what I did when a month ago a huge LinkedIn password dump led to hackers gaining access to thousands of Twitter accounts, including Mark Zuckerberg’s, not that he uses his much anyway. 

What I’m saying is we think our passwords are very secure, or maybe just secure enough, until it’s too late. This particular hack happened because people tend to use the same password everywhere, or at the very least the passwords are very similar. In the case of Mark Zuckerberg, I can only imagine his LinkedIn and Twitter passwords to be Faceb00kRul3z. This is not the only way to gain access to someone’s account, however. A very common way is to get a very powerful computer to enter every possible character it can in the hopes that it’ll get a match. Burp Suite is a very powerful tool for doing this, just remember to only use it with the consent of the site owner and without malice.

The first thing you’ll want to do is load up Burp Suite (assuming you have it set up already).Burp Home

Then, go to the web application you want to break into. Click around on it, or use Burp Spider until you have enough information on the site or have found the page you want to enter. As an example, I’ll use DVWA, which is a free open-source web app made specifically to have its vulnerabilities exploited.Capture 8

What you want to do now is just enter anything into both fields and click login. The point right now is not to guess the password, but to show Burp what the response to your invalid input is. Now open your Burp window, open up the Target tab and the Site Map subtab, and find the page and request that your invalid login attempt is in. Right-click on the request and click ‘Send to Intruder’.Capture 9

Now Burp Intruder can work with the web page. Go to the Intruder tab and the Positions subtab. You should see the request script, with bits bolded in. That’s Burp letting you know where it found a login textbox or a cookie that it thinks you can work with. Find the pieces of text that you want to fuzz and use the ‘Clear’ button on the right to clear the pieces of text you want to leave alone. Above all the code there’s a drop-down bar that asks you what attack type you want.

There are four attack types: Sniper is used when you only have one piece of code you want to break into (called a position), so it throws data at it (called a payload) one by one. Battering Ram works with several positions and inserts the same payload into them all at once. Pitchfork uses several sets of payloads where it enters the different payloads into different positions at the same time. For example, if you had two positions and two payload sets, it would enter the first payload from the first set into the first position and the first payload from the second set into the second position, then the second payload from the first set into the first position and the second payload from the second set into the second position, and so on. Cluster Bomb sets the same payload into one position while running through every payload in another, then sets the second payload into the first position while running through every payload, then the third, until it finds a match. This is what we want to use since we don’t know what usernames work with what passwords, so select that.

Capture 10

Now go into the Payloads subtab. The Payload Options section is where you’ll enter the payloads that you want to be used. Either enter them by hand, or copy and paste them, or if you have the premium version, load them from the Add From List drop-down box, where Burp already prepared some for you. You can change what set you’re editing in the drop-down option in the Payload Sets section. After you’ve got all of that done

After you’ve got all of that done, you’re ready to fuzz, just press Start Attack in the top right corner of the window and your login attempts will show up on the screen. A status of 302 means your login was invalid, a status of 200 means you broke in.Capture 11

And that’s it, now you just wait and hope for the best. You may have noticed that most of the passwords are quite similar, which would make a malicious hacker’s job much easier. If you can, change your password to something a little more complex, you’ll save yourself a world of regret later.

Using Burp Suite efficiently means understanding your tools and capabilities, but it also means having a full scope of the web application you’re about to start testing. Let’s say your friend asks you for help changing their bike’s handlebars. You’d have to use a wrench to unscrew the old handlebars, replace them with new ones, and screw the new ones in. You could be a master at screwing bolts in and you could have a PhD in wrenches, but if you’ve never seen a bicycle in your life, and you don’t know where the handlebars are, you won’t be much help to your friend. The idea is the same (albeit a little less silly) with penetration testing. 

You’ll want to look at your web application the same way the guys in Ocean’s Eleven look at a casino. If you’ve never seen an Ocean’s movie, it’s about a rag-tag group of thieves who go around robbing high-profile locations. It’s all very elaborate and entertaining, but there are a couple of similarities. Before doing anything, the gang gets blueprints of the building they want to break into. Sometimes they build life-size replicas of the vaults they want to crack. They gather as much information about their target as they can before making a decision.

This is what you should do as well. When tasked with penetrating a website, check everything. Find places where a user can enter input, like text boxes or buttons. Look for any links that may lead to other websites. Check for files and forms. Get a feel for how the components of the web application interact with other web applications as well as each other.

If this seems long and tedious, that’s because it is. Nobody has the time or the patience to click and prod every nook and cranny, which is why Burp has a built-in function for it. It’s called Burp Spider and its job is to make yours a whole lot easier. It crawls your site and tells you of all of the different elements that it has to offer. Finding and identifying vulnerabilities is up to you, but the program really does take some weight off your shoulders. Fair warning, however, the spider can miss things, which is why you should always double check what it gives you to make sure you have everything you need.

Using Burp Spider is easy, first, open up Burp and go to the desired URL. Go to the ‘Target’ tab and the ‘Site map’ subtab. Right-click the URL and select “Add to Scope”.Add to Scope

This tells Burp what exactly it should be working with. Anything within the “scope” is data that can be scanned and penetrated, anything outside is fluff. This way you can have lots of tabs open and only crawl what you need to crawl. The next step is to right-click that same URL and select “Spider this Branch”. More files should show up on the right-hand side.Spider Branch

And that’s it. You are now free to analyse the files Burp gives you and begin to manipulate them. I’ll soon be making more posts about the other functionalities Burp has that will help you become a better white-hat hacker.

Before we can get into the real nitty-gritty of what Burp Suite is and what it does, we’ll have to take baby steps getting into it. And the first step is configuring Burp Suite to work with our browsers. This Burp Suite setup guide will show you how. First, let’s open it up. I should mention that to run the Burp .jar file you need version 1.6 or later of Java. If you’re not sure what version you have, you can just type “java -version” into Command Prompt and it’ll tell you. Unless your computer has a virus made specifically to stop Burp Suite from running, you should see a splash screen, and then this:New Project

I’m going to assume you didn’t already buy the premium version or Burp, so just click Next with ‘Temporary Project’ selected, and select ‘Use Burp Defaults’ and click Start Burp on the screen after that. Now we’re here:
Burp Home

I remember the reaction I had the first time I came upon this page, which was “Woah”; that top bar has more tabs than I have immediate family members. Don’t you worry dear reader, I’ll go over each tab one by one, and you’ll be a pro at this in no time. For now, we can ignore most of these and focus on what we’re trying to do right now, which is set up Burp with a browser of your choice. Let’s go to the second tab, ‘Proxy’, and then the ‘Options’ subtab under it. I’ll show what we’re looking for specifically:Proxy Listener

Check to make sure that in the Proxy Listeners table there is an entry that has the values I underlined here. If there isn’t, press the gear to the left of the table and then ‘Restore Defaults’.

The next thing we’re going to do is set up your browser to use Burp as an HTTP proxy server. It’s different for every browser, so I’ll just put them all and you can skip ahead to the browser you’re working with.

Internet Explorer:
Press the gear at the top right corner and then ‘Internet Options’. This will take you to this window:IE Internet Options

Go to the Connections tab at the top and press ‘Lan Settings’. Uncheck the ‘Automatically detect settings’ and ‘Use automatic configuration script’ boxes. Check the “Use a proxy server for your LAN” box and enter the Burp proxy listener address and  port which are and 8080 by default. Uncheck “Bypass proxy server for local addresses” box if it’s checked. Click ‘Advanced’ and check the ‘Use the same proxy server for all protocols’ box, and make sure that are no entries in the ‘Exceptions’ field. 

Chrome uses the same proxy settings as your computer, so you can just follow the instructions for Internet Explorer and Chrome will pick up on it as well.

Press the three lines in the top right corner, click on ‘Options’ and then ‘Advanced’ on the left. Click the ‘Network’ tab and click on the ‘Settings’ button under ‘Connection’. Now you’re here:Firefox Connections Options

Select ‘Manual proxy configuration’ and enter your Burp proxy listener ( in the HTTP Proxy field and 8080 for the port. Check the ‘Use this proxy server for all protocols’ box and make sure the ‘No Proxy for’ field is empty (unlike in the picture example).

After Setting Up Browser
I just made this subtitle so you wouldn’t get confused about where the Firefox heading ends. Anyway, try out what you have so far by going to any HTTP website (not HTTPS yet, I’ll get to that).The site shouldn’t load completely, and that’s what’s supposed to happen. Open up Burp again and go to the ‘Proxy’ and then the ‘Intercept’ tab under it. Your HTTP request should be there. This just means that Burp intercepted your HTTP request for tinkering. Click on the ‘Intercept is on’ button so it changes to ‘Intercept is off’, and that will allow the website to load. If you tried to load an HTTPS URL though, you would get a warning from your browser. To allow you to work with HTTPS URL’s, you need to download Burp’s CA certificate, which is different for each browser.

Internet Explorer
With Burp running, go to http://burp/ and click on CA Certificate at the top. Download the file and open it. Click ‘Install Certificate’, then ‘Next’, then ‘Place all certificates in the following store’ and ‘Browse’. Here it should give you a small window with a bunch of different folders. Select ‘Trusted Root Certification Authorities’ and then just click ‘Next’, ‘Finish’, and ‘Yes’ to complete the installation process. Restart IE and you should be able to go to any HTTPS website.

Just as before, Chrome uses the same settings as IE does so just follow the instructions for that.

With Burp running, go to http://burp/ and click on CA Certificate at the top. Download the file, but you don’t have to open it. Press the three little lines at the top right and then ‘Options’. Click on the ‘Advanced’ tab, and then the ‘Certificates’ subtab. Click on ‘View Certificates’. Select the ‘Authorities’ tab, and ‘Import’. Find the file you downloaded just now and click ‘Open’. A dialog box should pop up, check ‘Trust this CA to identify web sites’ and click ‘OK’. Close everything and after restarting Firefox you should be able to go to any HTTPS website.

In The End
If everything is running smoothly, you should be able to intercept HTTP and HTTPS websites without a hitch. In a couple of day I’ll start posting about the different bits and pieces of Burp, and what makes it such a powerful tool.

One of the main aspects of security is penetration testing and vulnerability assessments. Simply put, these terms are just fancy ways of saying that the only safe way to know how you can be hacked is to hack yourself. Companies hire security consultants to legally tear apart their websites piece by piece and put them back together again, stronger and more secure than they were before. Security consultants (and malicious hackers) employ several tools to do their jobs, one of which being Burp Suite.

Burp Suite is an interception proxy. What a proxy is, is it’s a program, computer, or server that acts as a hub that your network will use to access the internet. They’re usually used to anonymize the user by hiding his or her IP address, and replacing it with the address of the proxy instead. This allows the user to hide their identity from the rest of the world. Burp Suite works on the same principle. It takes the internet traffic going through it and (here’s the fun part) lets us mess with this traffic. That’s where the “interception” part of “interception proxy” comes in. I’ll make a separate post on how to set up the program itself and how to configure it with your machine because there are quite a few steps to do that; this post is just to help you understand what you can do with Burp.

Burp has a number of tools that you can use to perform a wide variety of tasks, ranging from simple to incredibly advanced. These tools are shown as subsections in the program.

  • The first is Spider, which you can use to crawl a site or web application. “Crawling” is the act of sifting through every page that a site has to offer in order to gain the scope of the task. Without it, you might miss a couple of vulnerabilities that you could have caught. If you have the time for it, crawl manually without Spider, or at the very least don’t rely solely on the program to do it for you, it can make mistakes too.
  • Next is Scanner, a premium-only program that makes your job easier by scanning the site for any vulnerabilities. This is a pretty important tool and is worth Premium’s price point.
  • The Intruder tool comes next, and it’s a powerful one. This is your main attacking tool that you’ll use to prod and poke at a website to see what makes it tick. You can use it for a very large variety of purposes, for example, if the site has the option of letting a user sign up or log in, you can try to see what characters work, what don’t, and what crash the site or give administrator access by accident. 
  • Repeater, similarly to Intruder, can be used to repeatedly (thus the name) issue HTTP requests into different input or manipulation fields.
  • Sequencer looks over the site’s random elements, the important stuff that you want to be encrypted or randomised, and analyses just how random it is.
  • Decoder, a relatively simple tool, decodes and encodes (translates) different types of data. It takes HTML, URL, Base64, GZIP, hexadecimal, ASCII hexadecimal, Octal, and Binary.
  • Finally, the Comparer tool makes comparisons between two pieces of data. If two pieces of data are both much too long you can pop them both into the Comparer and it’ll tell how they differ.

This is a very, very, very basic look at what Burp Suite is and what it can do for you. I’ll be rolling out blog posts with specific instructions and examples for each tool in the coming weeks. Keep these in mind until then and remember to always stay on your toes. See you next week.

Think about what security means to you. It’s not too hard to come up with a few lessons or adages that help us stay safe in our everyday lives. Lock your door when leaving the house, walk in well-lit areas, know your emergency numbers. Security gives us a sense of comfort, knowing that we, our loved ones, and our assets, are safe. The strategies that companies and governments employ in order to maintain their security against physical, real-world threats are well-known and can be easily observed (although just as easily misinterpreted) by anyone. I’m talking guards, cameras, vaults. Big, glaring signs of power that have “Don’t Mess With Me” written all over them. Our hi-tech age, however, is changing things. People are communicating globally, entire libraries are uploaded to the cloud, and information has never been more abundant or easier to obtain. With this come new security risks, more subtle, and yet more devastating as well. I’m talking, as you may have guessed, about hacking.

Hacking portrayed in movies and TV is at the same time exactly the same and completely different from how it is in real life. This is because the term is so broad and generalised that it can encompass a myriad of individuals and professions. Hackers who live in their vans, sustaining themselves on a steady diet of Cheetos and Diet Pepsi which they pay for by selling email accounts they acquired from phishing bots do exist, along with suit-and-tie businessmen who make good money, legal money in fact, from hacking the world’s top companies and selling them the flaws. There also exist those who would release an entire database of user information to the world for no other reason than poops and giggles. A hacker can shut down a power station, or take control of a million PC’s that’ll run DDoS attacks to shut down a bank’s website. Point is, whilst before planning a security breach consisted mostly of “shoot X, blow up Y”, the possibilities of digital crime now are endless.

These new digital dangers are the reason this blog was made. Every week or so, I will make a blog post summarising a concept in security. If a concept is too big for one post (or if I just really like it), then I’ll spread it out into several. I’ll try to keep the topics as varied as possible, from how the CIA plans to open the Boston Bomber’s iPhone to why you should never trust a Nigerian Prince begging for money. However, know that I am explaining these concepts purely with the intention to help protect and inform, not breach or destroy. You are forbidden, dear reader, from going out into the world and hacking into McDonalds’ Corporate office using a Starbucks’ WiFi. Be warned that this is not only unethical but more importantly illegal as all hell. Keep this in mind and remember to always stay on your toes. See you next week.

Apache HDFS                            2.3.0
Apache MapReduce (for MR1) 1.2.1
Apache YARN (for MR2)          2.3.0
Apache Hive                              0.12.0
Cloudera Impala                       2.0.0
Apache HBase                           0.98.0
Apache Accumulo                     1.6.0
Apache Solr                               4.4.0
Apache Oozie                             4.0.0
Cloudera Hue                             3.5.0
Apache ZooKeeper                     3.4.5
Apache Flume                            1.5.0
Apache Sqoop                             1.4.4
Apache Sentry (Incubating)      1.4.0-incubating

In short:

  • Hadoop Common: A set of shared libraries
  • HDFS: The Hadoop filesystem
  • MapReduce: Parallel computation framework
  • ZooKeeper: Configuration management and coordination
  • HBase: Column-oriented database on HDFS
  • Hive: Data warehouse on HDFS with SQL-like access
  • Pig: Higher-level programming language for Hadoop computations
  • Oozie: Orchestration and workflow management
  • Mahout: A library of machine learning and data mining algorithms
  • Flume: Collection and import of log and event data
  • Sqoop: Imports data from relational databases
  • The Hadoop Distributed File System, or HDFS, is often considered the foundation component for the rest of the Hadoop ecosystem. HDFS is the storage layer for Hadoop and provides the ability to store mass amounts of data while growing storage capacity and aggregate bandwidth in a linear fashion. HDFS is a logical filesystem that spans many servers, each with multiple hard drives. This is important to understand from a security perspective because a given file in HDFS can span many or all servers in the Hadoop cluster. This means that client interactions with a given file might require communication with every node in the cluster. This is made possible by a key implementation feature of HDFS that breaks up files into blocks. Each block of data for a given file can be stored on any physical drive on any node in the cluster. The important security takeaway is that all files in HDFS are broken up into blocks, and clients using HDFS will communicate over the network to all of the servers in the Hadoop cluster when reading and writing files.
    • NameNode
      The NameNode is responsible for keeping track of all the metadata related to the files in HDFS, such as filenames, block locations, file permissions, and replication. From a security perspective, it is important to know that clients of HDFS, such as those reading or writing files, always communicate with the NameNode.
    • DataNode
      The DataNode is responsible for the actual storage and retrieval of data blocks in HDFS. Clients of HDFS reading a given file are told by the NameNode which DataNode in the cluster has the block of data requested. When writing data to HDFS, clients write a block of data to a DataNode determined by the NameNode. From there, that DataNode sets up a write pipeline to other DataNodes to complete the write based on the desired replication factor.
    • JournalNode
      The JournalNode is a special type of component for HDFS. When HDFS is configured for high availability (HA), JournalNodes take over the NameNode responsibility for writing HDFS metadata information. Clusters typically have an odd number of JournalNodes (usually three or five) to ensure majority. For example, if a new file is written to HDFS, the metadata about the file is written to every JournalNode. When the majority of the JournalNodes successfully write this information, the change is considered durable.
    • HttpFS
      HttpFS is a component of HDFS that provides a proxy for clients to the Name‐Node and DataNodes. This proxy is a REST API and allows clients to communicate to the proxy to use HDFS without having direct connectivity to any of the other components in HDFS. HttpFS will be a key component in certain cluster architectures.
    • NFS Gateway
      The NFS gateway, as the name implies, allows for clients to use HDFS like an NFS-mounted filesystem. The NFS gateway is an actual daemon process that facilitates the NFS protocol communication between clients and the underlying HDFS cluster. Much like HttpFS, the NFS gateway sits between HDFS and clients and therefore affords a security boundary that can be useful in certain cluster architectures.
    • KMS
      The Hadoop Key Management Server, or KMS, plays an important role in HDFS transparent encryption at rest. Its purpose is to act as the intermediary between HDFS clients, the NameNode, and a key server, handling encryption operations such as decrypting data encryption keys and managing encryption zone keys.
  • Apache YARN
    •  Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.
    • Other processing frameworks and applications, such as Impala and Spark, use YARN as the resource management framework. While YARN provides a more general resource management framework, MapReduce is still the canonical application that runs on it. MapReduce that runs on YARN is considered version 2, or MR2 for short.
  • Apache MapReduce
    • MapReduce is the processing counterpart to HDFS and provides the most basic mechanism to batch process data. When MapReduce is executed on top of YARN, it is often called MapReduce2, or MR2. This distinguishes the YARN-based verison of
      MapReduce from the standalone MapReduce framework, which has been retroactively named MR1. MapReduce jobs are submitted by clients to the MapReduce framework and operate over a subset of data in HDFS, usually a specified directory. MapReduce itself is a programming paradigm that allows chunks of data, or blocks in the case of HDFS, to be processed by multiple servers in parallel, independent of one another. While a Hadoop developer needs to know the intricacies of how MapReduce works, a security architect largely does not. What a security architect needs to know is that clients submit their jobs to the MapReduce framework and from that point on,the MapReduce framework handles the distribution and execution of the client code across the cluster. Clients do not interact with any of the nodes in the cluster to make their job run. Jobs themselves require some number of tasks to be run to complete the work. Each task is started on a given node by the MapReduce framework’s scheduling algorithm.
    • A key point about MapReduce is that other Hadoop ecosystem components are frameworks and libraries on top of MapReduce, meaning that MapReduce handles the actual processing of data, but these frameworks and libraries abstract the MapReduce job execution from clients. Hive, Pig, and Sqoop are examples of components that use MapReduce in this fashion.
  • Apache Hive
    • The Apache Hive project was started by Facebook. The company saw the utility of MapReduce to process data but found limitations in adoption of the framework due to the lack of Java programming skills in its analyst communities. Most of Facebook’s analysts did have SQL skills, so the Hive project was started to serve as a SQL abstraction layer that uses MapReduce as the execution engine.
  • Cloudera Impala
    • Cloudera Impala is a massive parallel processing (MPP) framework that is purposebuilt for analytic SQL. Impala reads data from HDFS and utilizes the Hive metastore for interpreting data structures and formats.
    • New users to the Hadoop ecosystem often ask what the difference is between Hive and Impala because they both offer SQL access to data in HDFS. Hive was created to allow users that are familiar with SQL to process data in HDFS without needing to know anything about MapReduce. It was designed to abstract the innards of MapReduce to make the data in HDFS more accessible. Hive is largely used for batch access and ETL work. Impala, on the other hand, was designed from the ground up to be a fast analytic processing engine to support ad hoc queries and business intelligence (BI) tools. There is utility in both Hive and Impala, and they should be treated as complementary components.
  • Apache Sentry
    Sentry is the component that provides fine-grained role-based access controls (RBAC) to several of the other ecosystem components, such as Hive and Impala. While individual components may have their own authorization mechanism, Sentry
    provides a unified authorization that allows centralized policy enforcement across components. It is a critical component of Hadoop security.

    • Sentry server
      The Sentry server is a daemon process that facilitates policy lookups made by other Hadoop ecosystem components. Client components of Sentry are configured to delegate authorization decisions based on the policies put in place by Sentry.
    • Policy database
      The Sentry policy database is the location where all authorization policies are stored. The Sentry server uses the policy database to determine if a user is allowed to perform a given action. Specifically, the Sentry server looks for a matching policy that grants access to a resource for the user. In earlier versions of Sentry, the policy database was a text file that contained all of the policies.
  • Apache HBase
    • HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. HBase features compression, in-memory operation, and Bloom filters on a per-column basis. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs. Hbase is a column-oriented key -value data store. HBase typically utilizes HDFS as the underlying storage layer for data.
  • Apache Accumulo
    • Apache Accumulo is a sorted and distributed key/value store designed to be a robust, scalable, high-performance storage and retrieval system. Like HBase, Accumulo was originally based on the Google BigTable design, but was built on top of the Apache Hadoop ecosystem of projects (in particular, HDFS, ZooKeeper, and Apache Thrift). Accumulo uses roughly the same data model as HBase.
  • Apache Solr
    • The Apache Solr project, and specifically SolrCloud, enables the search and retrieval
      of documents that are part of a larger collection that has been sharded across multiple physical servers. Search is one of the canonical use cases for big data and is one of the most common utilities used by anyone accessing the Internet. Solr is built on top of the Apache Lucene project, which actually handles the bulk of the indexing and search capabilities. Solr expands on these capabilities by providing enterprise search features such as faceted navigation, caching, hit highlighting, and an administration interface.
      Solr has a single component, the server. There can be many Solr servers in a single deployment, which scale out linearly through the sharding provided by SolrCloud. SolrCloud also provides replication features to accommodate failures in a distributed environment.
  • Apache Oozie
    • Apache Oozie is a workflow management and orchestration system for Hadoop. It allows for setting up workflows that contain various actions, each of which can utilize a different component in the Hadoop ecosystem. For example, an Oozie workflow could start by executing a Sqoop import to move data into HDFS, then a Pig script to transform the data, followed by a Hive script to set up metadata structures. Oozie allows for more complex workflows, such as forks and joins that allow multiple steps to be executed in parallel, and other steps that rely on multiple steps to be completed before continuing. Oozie workflows can run on a repeatable schedule based on different types of input conditions such as running at a certain time or waiting until a certain path exists in HDFS.
      Oozie consists of just a single server component, and this server is responsible for handling client workflow submissions, managing the execution of workflows, and reporting status.
  • Apache ZooKeeper
    • Apache ZooKeeper is a distributed coordination service that allows for distributed systems to store and read small amounts of data in a synchronized way. It is often used for storing common configuration information. Additionally, ZooKeeper is heavily used in the Hadoop ecosystem for synchronizing high availability (HA) services, such as NameNode HA and ResourceManager HA. ZooKeeper itself is a distributed system that relies on an odd number of servers called a ZooKeeper ensemble to reach a quorum, or majority, to acknowledge a given transaction. ZooKeeper has only one component, the ZooKeeper server.
  • Apache Flume
    • Apache Flume is an event-based ingestion tool that is used primarily for ingestion into Hadoop, but can actually be used completely independent of it. Flume, as the name would imply, was initially created for the purpose of ingesting log events into HDFS. The Flume architecture consists of three main pieces: sources, sinks, and channels. A Flume source defines how data is to be read from the upstream provider. This would include things like a syslog server, a JMS queue, or even polling a Linux directory. A Flume sink defines how data should be written downstream. Common Flume sinks include an HDFS sink and an HBase sink. Lastly, a Flume channel defines how data is stored between the source and sink. The two primary Flume channels are the memory channel and file channel. The memory channel affords speed at the cost of reliability, and the file channel provides reliability at the cost of speed. Flume consists of a single component, a Flume agent. Agents contain the code for sources, sinks, and channels. An important part of the Flume architecture is that Flume agents can be connected to each other, where the sink of one agent connects to the source of another.
  • Apache Sqoop
    Apache Sqoop provides the ability to do batch imports and exports of data to and from a traditional RDBMS, as well as other data sources such as FTP servers. Sqoop itself submits map-only MapReduce jobs that launch tasks to interact with the RDBMS in a parallel fashion. Sqoop is used both as an easy mechanism to initially seed a Hadoop cluster with data, as well as a tool used for regular ingestion and extraction routines. Sqoop1 is a set of client libraries that are invoked from the command line using the sqoop binary.
  • Cloudera Hue
    • Cloudera Hue is a web application that exposes many of the Hadoop ecosystem components in a user-friendly way. Hue allows for easy access into the Hadoop cluster without requiring users to be familiar with Linux or the various command-line interfaces the components have. Hue has a number different security controls available. Hue is comprised of the following components:
  • Hue server
    • This is the main component of Hue. It is effectively a web server that serves web content to users. Users are authenticated at first logon and from there, actions performed by the end user are actually done by Hue itself on behalf of the user. This concept is known as impersonation.
  • Kerberos Ticket Renewer
    • As the name implies, this component is responsible for periodically renewing the Kerberos ticket-granting ticket (TGT), which Hue uses to interact with the Hadoop cluster when the cluster has Kerberos enabled.,_Sqoop,_Flume_and_More:_Apache_Hadoop_Defined


  • Practical Hadoop Security by Bhushan Lakhe
  • Securing Hadoop by Sudheesh Narayanan
  • Hadoop Security by Ben Spivey and Joey Echeverria
  • Big Data Forensics – Learning Hadoop Investigations by Joe Sremack
  • Zed Attack Proxy is a web application penetration tool
  • Used as a framework for automated security tests
  • It’s a cross platform tool and can be used on UNIX, Windows or Mac OS
  • ZAP is intercepting proxy
  • It provides both active and passive scanners, passive scanner just examines our requests and responses, active scanner performs wide range of attacks
  • It has an excellent report generation ability
  • ZAP can also find hidden directories and files using Brute Force(based on OWASP DirBuster code) component
  • It can also fuzz parameters including fuzzing libraries (using fuzzdb & OWASP JBroFuzz)
  • ZAP has the following additional features:
    • Auto tagging, this feature tag messages that you can easily see which message has hidden fields
    • Port scanner, so you can see which ports are open on a computer
    • Parameter analysis, it analyzes all requests and shows you the summary of all of parameters that application uses
    • Smart card support, it’s very useful if an application you are testing uses smart card or tokens for authentication 
    • Session comparison
    • Invoke external applications
    • API + Headless mode
    • Dynamic SSL Certificates allows to intercept HTTPs trafic
    • Anti CSRF token handling
  • During initial installation ZAP offers you to create SSL Root CA certificate, it allows proxy to intercept all HTTPs traffic, you will need it if you are planning to test any application using HTTPs protocol, steps are the following: 
    • Generate SSL certificate
    • Save it
    • Import it to your browser
  • Don’t forget to amend Connection Settings in your browser and specify ZAP as your HTTP proxy
  • After successful installation you can perform basic penetration test
  • A basic penetration test
    • Configure your browser to use ZAP as a proxy
    • Explore the application manually
    • Use the Spider to find hidden content
    • See what issues the Passive Scanner has found
    • Use the Active Scanner to find vulnerabilities
    • Review all vulnerabilities that were found during Active Scanning
  • ZAP can be used for completely automated security tests in conjunction with Apache Ant and Selenium framework
  • ZAP has three modes: Safe mode doesn’t allow you to do anything potentially dangerous, Protected mode allows you to do potentially dangerous things on item in Scope and Standard mode allows you to do dangerous things on anything
  • ZAP can keep track of all HTTP sessions and allows to switch between them
  • Nowadays web sockets are very popular and currently  ZAP has one of the best support for web sockets
  • Password encryption
    • Never store passwords in plain text
    • Ideal is one way encryption
    • Don’t use MD5 anymore, good choices are SHA-1, SHA-2 (SHA-256, SHA-512), Whirlpool, Tiger, AES, Blowfish
    • The best is Blowfish it’s secure, free, easy, and slow
  • Salting passwords
    • Salt is an additional data added to the password before encryption, the main purpose of salts is to defend against dictionary attacks
    • Unique to each user salts can be created
    • Salts can be created using pseudo random string using time functions, in this case salts need to be saved in the database, salts can be hashed as well
  • Password requirements
    • Require certain length, but not limit length
    • Require non-alphanumeric characters
    • Ask user to confirm password
    • Report password strength to user
    • Do not record password hint
    • Security questions may be vulnerable to attacks, internet research could reveal information to security questions, user’s friends or family members might know answers to security questions
  • Brute force attacks
    • Hacker tries all possible passwords over and over again until the correct solution is found
    • To strengthen the password allow all characters and long strings
    • Enforce clipping level and slow password hashing algorithms as well as timing and throttling
  • SSL
    • Provides communication security
    • Verifies authenticity of remote server
    • Encrypts all data exchanged with server
    • Prevents snooping, session hijacking
    • Requires all assets on a webpage such as JavaScript, CSS, images to be secure
    • With SSL you must encrypt all credit card transactions, username/passwords being sent to the server
  • Protecting Cookies
  • Regulating Access Privilege
    • Least privileges
    • Make privileges easy to revoke
    • Restrict access to access privilege administration tools
    • Divide restricted actions into “privilege areas”
    • Regulate access by user access level or category
  • Handling Forgotten Passwords
    • Ask about privileged information
    • Ask security challenge questions
    • Since the email of the person is his identity we can send email with with reset token
  • Multi factor authentication
    • Authentication requires two or more factors
    • Something only the user knows, something only the user has, something only the use is