上QQ阅读APP看书，第一时间看更新

Statsmodels

The first library we'll cover is the statsmodels library (http://statsmodels.sourceforge.net/). Statsmodels is a Python package that is well documented and developed for exploring data, estimating models, and running statistical tests. Let's use it here to build a simple linear regression model of the relationship between sepal length and sepal width for the setosa species.

First, let's visually inspect the relationship with a scatterplot:

fig, ax = plt.subplots(figsize=(7,7)) 
ax.scatter(df['sepal width (cm)'][:50], df['sepal length (cm)'][:50]) 
ax.set_ylabel('Sepal Length') 
ax.set_xlabel('Sepal Width') 
ax.set_title('Setosa Sepal Width vs. Sepal Length', fontsize=14, y=1.02)

The preceding code generates the following output:

So, we can see that there appears to be a positive linear relationship; that is, as the sepal width increases, the sepal length does as well. We'll next run a linear regression on the data using statsmodels to estimate the strength of that relationship:

import statsmodels.api as sm 
 
y = df['sepal length'][:50] 
x = df['sepal width'][:50] 
X = sm.add_constant(x) 
 
results = sm.OLS(y, X).fit() 
print results.summary()

The preceding code generates the following output:

In the preceding diagram, we have the results of our simple regression model. Since this is a linear regression, the model takes the format of Y = Β₀+ Β₁X, where B₀ is the intercept and B₁ is the regression coefficient. Here, the formula would be Sepal Length = 2.6447 + 0.6909 * Sepal Width. We can also see that the R2 for the model is a respectable 0.558, and the p-value, (Prob), is highly significant—at least for this species.

Let's now use the results object to plot our regression line:

fig, ax = plt.subplots(figsize=(7,7)) 
ax.plot(x, results.fittedvalues, label='regression line') 
ax.scatter(x, y, label='data point', color='r') 
ax.set_ylabel('Sepal Length') 
ax.set_xlabel('Sepal Width') 
ax.set_title('Setosa Sepal Width vs. Sepal Length', fontsize=14, y=1.02) 
ax.legend(loc=2)

The preceding code generates the following output:

By plotting results.fittedvalues, we can get the resulting regression line from our regression.

There are a number of other statistical functions and tests in the statsmodels package, and I invite you to explore them. It is an exceptionally useful package for standard statistical modeling in Python. Let's now move on to the king of Python machine learning packages: scikit-learn.