Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Visualizing a dataset by basic plotting

Plots are a great way to visualize a dataset and gauge possible relationships between the columns of a dataset. There are various kinds of plots that can be drawn. For example, a scatter plot, histogram, box-plot, and so on.

Let's import the Customer Churn Model dataset and try some basic plots:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt')

While plotting any kind of plot, it helps to keep these things in mind:

  • If you are using IPython Notebook, write % matplotlib inline in the input cell and run it before plotting to see the output plot inline (in the output cell).
  • To save a plot in your local directory as a file, you can use the savefig method. Let's go back to the example where we plotted four scatter plots in a 2x2 panel. The name of this image is specified in the beginning of the snippet, as a figure parameter of the plot. To save this image one can write the following code:
    figure.savefig('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Scatter Plots.jpeg')

As you can see, while saving the file, one can specify the local directory to save the file and the name of the image and the format in which to save the image (jpeg in this case).

Scatter plots

We suspect the Day Mins and Day Charge to be highly correlated, as the calls are generally charged based on their duration. To confirm or validate our hypothesis, we can draw a scatter plot between Day Mins and Day Charge. To draw this scatter plot, we write something similar to the following code:

data.plot(kind='scatter',x='Day Mins',y='Day Charge')

The output looks similar to the following figure where the points lie on a straight line confirming our suspicion that they are (linearly) related. As we will see later in the chapter on linear regression, such a situation will give a perfect linear fit for the two variables:

Fig. 2.18: Scatter plot of Day Charge versus Day Mins

The same is the case when we plot Night Mins and Night Charge against one another. However, when we plot Night Calls with Night Charge or Day Calls with Day Charge, we don't get to see much of a relationship.

Using the matplotlib library, we can get good quality plots and with a lot of flexibility. Let us see how we can plot multiple plots (in different panels) in the same image:

import matplotlib.pyplot as plt
figure,axs = plt.subplots(2, 2,sharey=True,sharex=True)
data.plot(kind='scatter',x='Day Mins',y='Day Charge',ax=axs[0][0])
data.plot(kind='scatter',x='Night Mins',y='Night Charge',ax=axs[0][1])
data.plot(kind='scatter',x='Day Calls',y='Day Charge',ax=axs[1][0])
data.plot(kind='scatter',x='Night Calls',y='Night Charge',ax=axs[1][1])
Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Here, we are plotting four graphs in one image in a 2x2 panel using the subplots method of the matplotlib library. As you can see in the preceding snippet, we have defined the panel to be 2x2 and set sharex and sharey parameters to be True. For each plot, we specify their location by passing appropriate values for the ax parameter in the plot method. The result looks similar to the following screenshot:

Fig. 2.19: Four plots in a 2x2 panel using the subplots method

Histograms

Plotting histograms is a great way to visualize the distribution of a numerical variable. Plotting a histogram is a method to understand the most frequent ranges (or bins as they are called) in which the variable lies. One can also check whether the variable is normally distributed or skewed on one side.

Let's plot a histogram for the Day Calls variable. We can do so by writing the following code:

import matplotlib.pyplot as plt
plt.hist(data['Day Calls'],bins=8)
plt.xlabel('Day Calls Value')
plt.ylabel('Frequency')
plt.title('Frequency of Day Calls')

The first line of the snippet is of prime importance. There we specify the variable for which we have to plot the histogram and the number of bins or ranges we want. The bins parameters can be passed as a fixed number or as a list of numbers to be passed as bin-edges. Suppose, a numerical variable has a minimum value of 1 and a maximum value of 1000. While plotting histogram for this variable, one can either specify bins=10 or 20, or one can specify bins=[0,100,200,300,…1000] or [0,50,100,150,200,…..,1000].

The output of the preceding code snippet appears similar to the following snapshot:

Fig. 2.20: Histogram of the Day Calls variable

Boxplots

Boxplots are another way to understand the distribution of a numerical variable. It specifies something called quartiles.

Note

If the numbers in a distribution with 100 numbers are arranged in an increasing order; the 1st quartile will occupy the 25th position, the 3rd quartile will occupy the 75th position, and so on. The median will be the average of the 50th and 51st terms. (I hope you brush up on some of the statistics you have read till now because we are going to use a lot of it, but here is a small refresher). Median is the middle term when the numbers in the distribution are arranged in the increasing order. Mode is the one that occurs with the maximum frequency, while mean is the sum of all the numbers pided by their total count.

Plotting a boxplot in Python is easy. We need to write this to plot a boxplot for Day Calls:

import matplotlib.pyplot as plt
plt.boxplot(data['Day Calls'])
plt.ylabel('Day Calls')
plt.title('Box Plot of Day Calls')

The output looks similar to the following snapshot:

Fig. 2.21: Box Plot for the Day Calls variable

The blue box is of prime importance. The lower-horizontal edge of the box specifies the 1st quartile, while the upper-horizontal edge specifies the 3rd quartile. The horizontal line in the red specifies the median value. The difference in the 1st and 3rd quartile values is called the Inter Quartile Range or IQR. The lower and upper horizontal edges in black specify the minimum and maximum values respectively.

The boxplots are important plots because of the following reasons:

  • Boxplots are potent tools to spot outliers in a distribution. Any value that is 1.5*IQR below the 1st quartile and is 1.5*IQR above the 1st quartile can be classified as an outlier.
  • For a categorical variable, boxplots are a great way to visualize and compare the distribution of each category at one go.

There are a variety of other types of plots that can be drawn depending on the problem at hand. We will learn about them as and when needed. For exploratory analysis, these three types are enough to provide us enough evidence to further or discard our initial hypotheses. These three types can have multiple variations and together with the power of looping and panel-wise plotting, we can make the plotting; hence, the data exploration process is very efficient.