Hands-On Big Data Analytics with PySpark

上QQ阅读APP看书，第一时间看更新

Getting data into Spark

Next, load the KDD cup data into PySpark using sc, as shown in the following command:

raw_data = sc.textFile("./kddcup.data.gz")

In the following command, we can see that the raw data is now in the raw_data variable:

raw_data

This output is as demonstrated in the following code snippet:

./kddcup.data,gz MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

If we enter the raw_data variable, it gives us details regarding kddcup.data.gz, where raw data underlying the data file is located, and tells us about MapPartitionsRDD.

Now that we know how to load the data into Spark, let's learn about parallelization with Spark RDDs.

本周热推：

一本书读懂大数据指标体系与指标平台：方法与实践 LabVIEW 完全自学手册 PostgreSQL修炼之道：从小工到专家（第2版）SQL必知必会（第5版）