Hands-On Big Data Analytics with PySpark
上QQ阅读APP看书,第一时间看更新

Getting Your Big Data into the Spark Environment Using RDDs

Primarily, this chapter will provide a brief overview of how to get your big data into the Spark environment using resilient distributed datasets (RDDs). We will be using a wide array of tools to interact with and modify this data so that useful insights can be extracted. We will first load the data on Spark RDDs and then carry out parallelization with Spark RDDs.

In this chapter, we will cover the following topics:

  • Loading data onto Spark RDDs
  • Parallelization with Spark RDDs
  • Basics of RDD operation