Building Machine Learning Systems with Python
上QQ阅读APP看书,第一时间看更新

Stepping back to go forward - another look at our data

So, we step back and take another look at the data. It seems that there is an inflection point between weeks 3 and 4. Let's separate the data and train two lines using week 3.5 as a separation point:

>>> inflection = int(3.5*7*24) # calculate the inflection point in hours
>>> xa = x[:inflection] # data before the inflection point
>>> ya = y[:inflection]
>>> xb = x[inflection:] # data after
>>> yb = y[inflection:]

>>> fa = sp.poly1d(sp.polyfit(xa, ya, 1))
>>> fb = sp.poly1d(sp.polyfit(xb, yb, 1))

>>> fa_error = error(fa, xa, ya)
>>> fb_error = error(fb, xb, yb)
>>> print("Error inflection=%f" % (fa_error + fb_error))

Error inflection=132950348.197616

From the first line (straight), we train with the data up to week 3, and in the second line (dashed), we train with the remaining data:

Clearly, the combination of these two lines seems to be a much better fit to the data than anything we have modeled before. But still, the combined error is higher than the higher order polynomials. Can we trust the error at the end?

Asked differently, why do we trust the straight line fitted only during the last week of our data more than any of the more complex models? It is because we assume that it will capture future data better. If we plot the models into the future, we can see how right we are (d = 1 is again our initial straight line):

The models of degree 10 and 53 don't seem to expect a bright future of our start-up. They tried so hard to model the given data correctly that they are clearly useless to extrapolate beyond. This is called overfitting.

On the other hand, the lower degree models seem not to be capable of capturing the data well enough. This is called underfitting.

So, let's play fair to models of degree 2 and more and look at how they behave if we fit them only to the data of the last week. After all, we believe that the last week says more about the future than the data prior to it. The result can be seen in the following psychedelic chart, which further shows how bad the problem of overfitting is:

See following commands:

>>> fb1 = np.poly1d(np.polyfit(xb, yb, 1))
>>> fb2 = np.poly1d(np.polyfit(xb, yb, 2))
>>> fb3 = np.poly1d(np.polyfit(xb, yb, 3))
>>> fb10 = np.poly1d(np.polyfit(xb, yb, 10))
>>> fb100 = np.poly1d(np.polyfit(xb, yb, 100))

>>> print("Errors for only the time after inflection point")
>>> for f in [fb1, fb2, fb3, fb10, fb100]:
... print("td=%i: %f" % (f.order, error(f, xb, yb)))

>>> plot_web_traffic(x, y, [fb1, fb2, fb3, fb10, fb100],
... mx=np.linspace(0, 6 * 7 * 24, 100),
... ymax=10000)

Following table shows errors and time after inflection point:

   
          
Errors           Time after inflection point
d = 1           22140590.598233
d = 2           19764355.660080
d = 3           19762196.404203
d = 10:           18942545.482218
d = 53:            18293880.824253

 

Still, judging from the errors of the models when trained only on the data from week 3.5 and later, we should still choose the most complex one (note that we also calculate the error when trained only on datapoints that occur after the inflection point).