Hadoop and big data
In this section, we'll consider why Hadoop is actually a very good choice for storing and accessing big data.
Imagine you want to process data, a lot of data. In our previous example, we considered the scenario where machine generated web logging files are being produced and we want to leverage information within those files to perform some analytics and produce some (hopefully) compelling data visualizations.
Using R worked here, but if we extend the scenario with the idea that we will continue to receive web log files over time and the size of those files will increase, R might not be a feasible answer.
Entering Hadoop
Hadoop (as the product documentation says) is not your average database. In fact, Hadoop can store all kinds of data from many servers and websites and corporate vaults--as much as you might need or want to gather. In addition, Hadoop spreads your work across hundreds or thousands of processors and storage drives working in parallel all at the same time. Let's take a look at two practical examples using Hadoop.
AWS for Hadoop projects
If you are new to Hadoop and that is to say do not have a Hadoop environment already available, you can begin evaluating the power of Hadoop by downloading and installing one of the free Hadoop distributions. Good advice is to start any initial evaluation by running Hadoop in either local standalone or pseudo-distributed mode on a single machine. However, I strongly recommend to the reader who is new to Hadoop to not waste time downloading and configuring, but instead consider (temporally perhaps) subscribing to Hadoop as a service.
There are a variety of viable Software as a Service (SaaS) options of which Amazon is one of the very best. Amazon Elastic MapReduce (EMR) is a subscription web service that really does make it easy and cost effective to manipulate your big data projects. Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances. Additionally, with Amazon EMR, you get a secure and reliable environment with log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.
By deciding to do Hadoop on Amazon EMR, you get the benefits of the cloud:
- The ability to provision clusters of virtual servers within minutes
- You can scale the number of virtual servers in your cluster to manage your computation needs, and only pay for what you use
- Integration with other Amazon Web Services (AWS)
- Open source projects that run on top of the Hadoop architecture can also be run on Amazon EMR
- You can use trending business intelligence tools such as Microsoft Excel, MicroStrategy, QlikView, and Tableau with Amazon EMR to explore and visualize your data
In this book, it was easy to leverage Amazon EMR for our Hadoop use case examples.