Statsmodels
The first library we'll cover is the statsmodels library (http://statsmodels.sourceforge.net/). Statsmodels is a Python package that is well documented and developed for exploring data, estimating models, and running statistical tests. Let's use it here to build a simple linear regression model of the relationship between sepal length and sepal width for the setosa species.
First, let's visually inspect the relationship with a scatterplot:
fig, ax = plt.subplots(figsize=(7,7)) ax.scatter(df['sepal width (cm)'][:50], df['sepal length (cm)'][:50]) ax.set_ylabel('Sepal Length') ax.set_xlabel('Sepal Width') ax.set_title('Setosa Sepal Width vs. Sepal Length', fontsize=14, y=1.02)
The preceding code generates the following output:
So, we can see that there appears to be a positive linear relationship; that is, as the sepal width increases, the sepal length does as well. We'll next run a linear regression on the data using statsmodels to estimate the strength of that relationship:
import statsmodels.api as sm y = df['sepal length'][:50] x = df['sepal width'][:50] X = sm.add_constant(x) results = sm.OLS(y, X).fit() print results.summary()
The preceding code generates the following output:
In the preceding diagram, we have the results of our simple regression model. Since this is a linear regression, the model takes the format of Y = Β0+ Β1X, where B0 is the intercept and B1 is the regression coefficient. Here, the formula would be Sepal Length = 2.6447 + 0.6909 * Sepal Width. We can also see that the R2 for the model is a respectable 0.558, and the p-value, (Prob), is highly significant—at least for this species.
Let's now use the results object to plot our regression line:
fig, ax = plt.subplots(figsize=(7,7)) ax.plot(x, results.fittedvalues, label='regression line') ax.scatter(x, y, label='data point', color='r') ax.set_ylabel('Sepal Length') ax.set_xlabel('Sepal Width') ax.set_title('Setosa Sepal Width vs. Sepal Length', fontsize=14, y=1.02) ax.legend(loc=2)
The preceding code generates the following output:
By plotting results.fittedvalues, we can get the resulting regression line from our regression.
There are a number of other statistical functions and tests in the statsmodels package, and I invite you to explore them. It is an exceptionally useful package for standard statistical modeling in Python. Let's now move on to the king of Python machine learning packages: scikit-learn.