Python and its packages for predictive modelling
In this section, we will discuss some commonly used packages for predictive modelling.
pandas: The most important and versatile package that is used widely in data science domains is pandas
and it is no wonder that you can see import pandas
at the beginning of any data science code snippet, in this book, and anywhere in general. Among other things, the pandas
package facilitates:
- The reading of a dataset in a usable format (data frame in case of Python)
- Calculating basic statistics
- Running basic operations like sub-setting a dataset, merging/concatenating two datasets, handling missing data, and so on
The various methods in pandas
will be explained in this book as and when we use them.
Note
To get an overview, navigate to the official page of pandas here: http://pandas.pydata.org/index.html
NumPy: NumPy, in many ways, is a MATLAB equivalent in the Python environment. It has powerful methods to do mathematical calculations and simulations. The following are some of its features:
- A powerful and widely used a N-d array element
- An ensemble of powerful mathematical functions used in linear algebra, Fourier transforms, and random number generation
- A combination of random number generators and an N-d array elements is used to generate dummy datasets to demonstrate various procedures, a practice we will follow extensively, in this book
Note
To get an overview, navigate to official page of NumPy at http://www.NumPy.org/
matplotlib: matplotlib is a Python library that easily generates high-quality 2-D plots. Again, it is very similar to MATLAB.
- It can be used to plot all kind of common plots, such as histograms, stacked and unstacked bar charts, scatterplots, heat diagrams, box plots, power spectra, error charts, and so on
- It can be used to edit and manipulate all the plot properties such as title, axes properties, color, scale, and so on
Note
To get an overview, navigate to the official page of matplotlib at: http://matplotlib.org
IPython: IPython provides an environment for interactive computing.
It provides a browser-based notebook that is an IDE-cum-development environment to support codes, rich media, inline plots, and model summary. These notebooks and their content can be saved and used later to demonstrate the result as it is or to save the codes separately and execute them. It has emerged as a powerful tool for web based tutorials as the code and the results flow smoothly one after the other in this environment. At many places in this book, we will be using this environment.
Note
To get an overview, navigate to the official page of IPython here http://ipython.org/
Scikit-learn: scikit-learn
is the mainstay of any predictive modelling in Python. It is a robust collection of all the data science algorithms and methods to implement them. Some of the features of scikit-learn
are as follows:
- It is built entirely on Python packages like
pandas
,NumPy
, andmatplotlib
- It is very simple and efficient to use
- It has methods to implement most of the predictive modelling techniques, such as linear regression, logistic regression, clustering, and Decision Trees
- It gives a very concise method to predict the outcome based on the model and measure the accuracy of the outcomes
Note
To get an overview, navigate to the official page of scikit-learn
here: http://scikit-learn.org/stable/index.html
Python packages, other than these, if used in this book, will be situation based and can be installed using the method described earlier in this section.