Big Data Analysis with Python
上QQ阅读APP看书,第一时间看更新

Pandas DataFrames and Grouped Data

As we learned in the previous chapter, when analyzing data and using Pandas to do so, we can use the plot functions from Pandas or use Matplotlib directly. Pandas uses Matplotlib under the hood, so the integration is great. Depending on the situation, we can either plot directly from pandas or create a figure and an axes with Matplotlib and pass it to pandas to plot. For example, when doing a GroupBy, we can separate the data into a GroupBy key. But how can we plot the results of GroupBy? We have a few approaches at our disposal. We can, for example, use pandas directly, if the DataFrame is already in the right format:

Note

The following code is a sample and will not get executed.

fig, ax = plt.subplots()

df = pd.read_csv('data/dow_jones_index.data')

df[df.stock.isin(['MSFT', 'GE', 'PG'])].groupby('stock')['volume'].plot(ax=ax)

Or we can just plot each GroupBy key on the same plot:

fig, ax = plt.subplots()

df.groupby('stock').volume.plot(ax=ax)

For the following activity, we will use what we've learned in the previous chapter and read a CSV file from a URL and parse it. The dataset is the Auto-MPG dataset (https://raw.githubusercontent.com/TrainingByPackt/Big-Data-Analysis-with-Python/master/Lesson02/Dataset/auto-mpg.data).

Note

This dataset is a modified version of the dataset provided in the StatLib library. The original dataset is available in the auto-mpg.data-original file.

The data concerns city-cycle fuel consumption in miles per gallon, in terms of three multivalued discrete and five continuous attributes.

Activity 4: Line Graphs with the Object-Oriented API and Pandas DataFrames

In this activity, we will create a time series line graph from the Auto-MPG dataset as a first example of plotting using pandas and the object-oriented API. This kind of graph is common in analysis and helps to answer questions such as "is the average horsepower increasing or decreasing with time?"

Now, follow these procedures to plot a graph of average horsepower per year using pandas and while using the object-oriented API:

  1. Import the required libraries and packages into the Jupyter notebook.
  2. Read the Auto-MPG dataset into the Spark object.
  3. Provide the column names to simplify the dataset, as illustrated here:

    column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']

  4. Now read the new dataset with column names and display it.
  5. Convert the horsepower and year data types to float and integer.
  6. Now plot the graph of average horsepower per year using pandas:

Figure 2.10: Line Graphs with the Object-Oriented API and Pandas DataFrame

Note

The solution for this activity can be found on page 205.

Note that we are using the plot functions from pandas but passing the axis that we created directly with Matplotlib as an argument. As we saw in the previous chapter, this is not required, but it will allow you to configure the plot outside pandas and change its configurations later. This same behavior can be applied to the other kinds of graphs. Let's now work with scatter plots.

Scatter Plots

To understand the correlation between two variables, scatter plots are generally used because they allow the distribution of points to be seen. Creating a scatter plot with Matplotlib is similar to creating a line plot, but instead of using the plot method, we use the scatter method.

Let's look at an example using the Auto-MPG dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/):

fig, ax = plt.subplots()

ax.scatter(x = df['horsepower'], y=df['weight'])

Figure 2.11: Scatter plot using Matplotlib library

Note that we called the scatter method directly from the axis. In Matplotlib parlance, we added a scatter plot to the axis, ax, that belongs to the fig figure. We can also add more dimensions to the plot, such as the color and point size, easily with Seaborn:

import seaborn as sns

sns.scatterplot(data=df, x='horsepower', y='weight', hue='cylinders', size='mpg')

Figure 2.12: Scatter plot using Seaborn library

As we can see, scatter plots are quite helpful for understanding the relationship between two variables, or even more. We can infer, for example, that there is a positive correlation between horsepower and weight. We can also easily see an outlier with the scatter plot, which could be more complicated when working with other kinds of graphs. The same principles for grouped data and pandas DataFrames that we saw on the line graphs apply here for the scatter plot.

We can generate a scatter plot directly from pandas using the kind parameter:

df.plot(kind='scatter', x='horsepower', y='weight')

Create a figure and pass it to Pandas:

fig, ax = plt.subplots()

df.plot(kind='scatter', x='horsepower', y='weight', ax =ax)

Activity 5: Understanding Relationships of Variables Using Scatter Plots

To continue our data analysis and learn how to plot data, let's look at a situation where a scatter plot can help. For example, let's use a scatter plot to answer the following question:

Is there a relationship between horsepower and weight?

To answer this question, we need to create a scatter plot with the data from Auto-MPG:

  1. Use the Auto-MPG dataset, already ingested.

    Note

    Please refer to the previous exercise for how to ingest the dataset.

  2. Use the object-oriented API for Matplotlib:

    %matplotlib inline

    import matplotlib.pyplot as plt

    fig, ax = plt.subplots()

  3. Create a scatter plot using the scatter method:

    ax.scatter(x = df['horsepower'], y=df['weight'])

    Note

    The solution for this activity can be found on page 208.

We can identify a roughly linear relationship between horsepower and weight, with some outliers with higher horsepower and lower weight. This is the kind of graph that would help an analyst interpret the data's behavior.

Histograms

Histograms are a bit different from the graphs that we've seen so far, as they only try to visualize the distribution of one variable, instead of two or more. Histograms have the goal of visualizing the probability distribution of one variable, or in other words, counting the number of occurrences of certain values divided into fixed intervals, or bins.

The bins are consecutive and adjacent but don't need to have the same size, although this is the most common arrangement.

The choice of the number of bins and bin size is more dependent on the data and the analysis goal than any fixed, general rule. The larger the number of bins, the smaller (narrower) the size of each bin, and vice versa. When data has a lot of noise or variation, for example, a small number of bins (with a large bin) will show the general outline of the data, reducing the impact of the noise in a first analysis. A larger number of bins is more useful when the data has a higher density.

Exercise 12: Creating a Histogram of Horsepower Distribution

As we strive to understand the data, we now want to see the horsepower distribution over all cars. Analysis questions with an adequate histogram are, for example: what is the most frequent value of a variable? Is the distribution centered or does it have a tail? Let's plot a histogram of the horsepower distribution:

  1. Import the required libraries into the Jupyter notebook and read the dataset from the Auto-MPG dataset repository:

    import pandas as pd

    import numpy as np

    import matplotlib as mpl

    import matplotlib.pyplot as plt

    import seaborn as sns

    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

    df = pd.read_csv(url)

  2. Provide the column names to simplify the dataset as illustrated here:

    column_names = ['mpg', 'Cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']

  3. Now read the new dataset with column names and display it:

    df = pd.read_csv(url, names= column_names, delim_whitespace=True)

    df.head()

    The plot is as follows:

    Figure 2.13: The auto-mpg dataset

  4. Convert the horsepower and year data types to float and integer using the following command:

    df.loc[df.horsepower == '?', 'horsepower'] = np.nan

    df['horsepower'] = pd.to_numeric(df['horsepower'])

    df['full_date'] = pd.to_datetime(df.year, format='%y')

    df['year'] = df['full_date'].dt.year

  5. Create a graph directly from the Pandas DataFrame using the plot function and kind='hist':

    df.horsepower.plot(kind='hist')

    Figure 2.14: Histogram plot

  6. Identify the horsepower concentration:

    sns.distplot(df['weight'])

Figure 2.15: Histogram concentration plot

We can see in this graph that the value distribution is skewed to the left, with more cars with horsepower of between 50 and 100 than greater than 200, for example. This could be quite useful in understanding how some data varies in an analysis.

Boxplots

Boxplots are also used to see variations in values, but now within each column. We want to see how values compare when grouped by another variable, for example. Because of their format, boxplots are sometimes called whisker plots or box and whisker plots because of the lines that extend vertically from the main box:

Figure 2.16: Boxplot

Source

https://en.wikipedia.org/wiki/File:Michelsonmorley-boxplot.svg

A boxplot uses quartiles (first and third) to create the boxes and whiskers. The line in the middle of the box is the second quartile – the median. The whiskers definition can vary, such as using one standard deviation above and below the mean of the data, but it's common to use 1.5 times the interquartile range (Q3 – Q1) from the edges of the box. Anything that passes these values, either above or below, is plotted as a dot and is usually considered an outlier.

Exercise 13: Analyzing the Behavior of the Number of Cylinders and Horsepower Using a Boxplot

Sometimes we want not only to see the distribution of each variable, but also to see the variation of the variable of interest with respect to another attribute. We would like to know, for instance, how the horsepower varies given the number of cylinders. Let's create a boxplot with Seaborn, comparing the horsepower distribution to the number of cylinders:

  1. Import the required libraries into the Jupyter notebook and read the dataset from the Auto-MPG dataset repository:

    %matplotlib inline

    import pandas as pd

    import numpy as np

    import matplotlib as mpl

    import matplotlib.pyplot as plt

    import seaborn as sns

    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

    df = pd.read_csv(url)

  2. Provide the column names to simplify the dataset, as illustrated here:

    column_names = ['mpg', 'Cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']

  3. Now read the new dataset with column names and display it:

    df = pd.read_csv(url, names= column_names, delim_whitespace=True)

    df.head()

    The plot is as follows:

    Figure 2.17: The auto-mpg dataset

  4. Convert the data type of horsepower and year to float and integer using the following command:

    df.loc[df.horsepower == '?', 'horsepower'] = np.nan

    df['horsepower'] = pd.to_numeric(df['horsepower'])

    df['full_date'] = pd.to_datetime(df.year, format='%y')

    df['year'] = df['full_date'].dt.year

  5. Create a boxplot using the Seaborn boxplot function:

    sns.boxplot(data=df, x="cylinders", y="horsepower")

    Figure 2.18: Boxplot using the Seaborn boxplot function

  6. Now, just for comparison purposes, create the same boxplot using pandas directly:

    df.boxplot(column='horsepower', by='cylinders')

Figure 2.19: Boxplot using pandas

On the analysis side, we can see that the variation range from 3 cylinders is smaller than for 8 cylinders for horsepower. We can also see that 6 and 8 cylinders have outliers in the data. As for the plotting, the Seaborn function is more complete, showing different colors automatically for different numbers of cylinders, and including the name of the DataFrame columns as labels in the graph.