Big Data Visualization
上QQ阅读APP看书,第一时间看更新

About Hadoop

Let's start out with an explanation of Hadoop that is generally circulated.

As per Apache Hadoop wikipedia.org, 2016:

"Hadoop is an open-source software "framework" for distributed storage and distributed processing (of very large datasets) on computer clusters built from commodity hardware."

The following is a visualization that may help understand the master-to-slave architecture used by Hadoop:

Hadoop uses an architecture called MapReduce. This is a design that designates a processor (in a cluster of processors) as the master, which controls distributing or mapping tasks to other slave processors to process your data, thus reducing the processing performed by the cluster of processors to a single output result. So, you can now see that the name mapped reduction or MapReduce (of processing tasks) makes sense.

Hadoop is able to take your data and split it up (or distribute it) over a number of computers that have space or resources available.

These computers need not be high-end, overcapable devices (that is, they can be easily available, average machines that are therefore called commodities), they just need to be named as part of a group or cluster available to the Hadoop framework. That's the first part of the magic, the other side of Hadoop is that it keeps track of where every file was placed and is able to make it all available (seemingly as one coherent database) with minimal response time. It is important to note that simply scattering files of data about isn't all that clever; what is clever is that Hadoop is capable of knowing which computers are closest to the data it wants to access at any time. This vastly cuts down on network traffic that would be caused by searching for specific data when needed (that is, now where did I put that file?).

Hadoop FAQs (Hadoop, WhatIs.com Rouse 2015) include:

  • Hadoop is free
  • Hadoop is Java-based
  • Hadoop was initially released in 2011
  • Hadoop is part of the Apache project sponsored by the Apache Software Foundation (ASF)
  • Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts
  • Hadoop was named after its creator's (Doug Cutting) child's stuffed toy elephant
  • Hadoop consists of the Hadoop kernel, MapReduce, the Hadoop Distributed File System (HDFS), and a number of related projects such as Apache Hive, HBase, and Zookeeper
  • The Hadoop framework is used by major players, including Google, Yahoo, and IBM, largely for applications involving search engines and advertising
  • The preferred operating systems are Windows and Linux, but Hadoop can also work with BSD and OS X

Certainly, there are many more interesting Hadoop FAQs, but I'm sure that you get the picture, so let's move on.

What else but Hadoop?

Is Hadoop the only choice for storing and processing your big data?

You'll perhaps find that Hadoop is the tool that is most known for handling big data. In fact, there are misconceptions with some that when talking about big data, you're talking about Hadoop or, if discussing Hadoop, you're discussing big data.

Obviously, that notion is incorrect.

In fact, there are a number of alternatives to using Hadoop and some are gaining popularity every day. There are (as with any technology choice) both pros and cons with choosing to implement Hadoop and those (among other reasons) are driving the interest in other options.

Two popular alternatives to Hadoop are Apache Spark and Cluster MapReduce.

Apache Spark is open source (like Hadoop), runs in-memory, and promises faster speed than Hadoop and offers unique application programming interface (API). Cluster MapReduce was developed on top of the Hadoop MapReduce framework concepts by an online ad company that was using Hadoop but wanted more. Compared to Hadoop, Cluster MapReduce supposedly offers a more efficient solution that:

  • Uses more straightforward creation of data queries
  • Has a lighter footprint
  • Has greater ability to be customized
  • Has more resilience to failures

IBM too!

It would be prudent to take a moment here to consider IBM's enterprise version of Hadoop.

IBM has taken the Hadoop concepts and created their own version of a Hadoop-like platform (IBM Open Platform) for big data projects using the most current Apache Hadoop open source content. This is offered as a free download and (as one would expect) there is also a paid support offering, should you be interested (perhaps in an effort to instill confidence in an organization considering developing with an open source tool?).

In addition, IBM offers the IBM BigInsights Quick Start edition, which combines their open platform with what they are calling enterprise-grade features for data visualization (and advanced analytics) projects.

Note

You can review this at: www.ibm.com/analytics/us/en/technology/hadoop.

So, there really are other options (rather than using Hadoop) and more will come.

Hadoop is, without question, extremely powerful, but it uses complex methods for moving data and isn't all that efficient when dealing with unstructured data (a data type increasingly prevalent today). Given the new options, the aforementioned automatic association of big data and Hadoop is becoming less observed.

In this book, we are using Hadoop, but those who deal with big data or unstructured data would be judicious to scope out all the available options (including simple scripting alternatives) when considering their own needs.

Part of any project decision making process is getting intimate with the detailed requirements. On the topic of data visualization, it is imperative to know your data intimately. One big decision to make first is, does your data really qualify as big data?

In an article named The Big Data Conundrum: How to Define it? by Stuart Ward, he writes:

"Some organizations point out that large datasets are not always complex and small datasets are not always simple. Their point is that the complexity of a dataset is an important factor in deciding whether it is "big."

An interesting point here is data that is overly complex in nature can be considered big data and require the big data mindset (even though you may not be dealing with large volumes of data).

Before getting started with Hadoop and our Hadoop example use cases, let's take a few moments to consider a simpler solution, such as simple data file processing with a scripting language.