The Statistics and Calculus with Python Workshop
上QQ阅读APP看书,第一时间看更新

Python's Other Statistics Tools

In the previous chapter, we considered Python's three main libraries, which make up the majority of a common data science/scientific computing pipeline: NumPy for multi-dimensional matrix computation, pandas for tabular data manipulation, and Matplotlib for data visualization.

Along the way, we have also discussed a number of supporting tools that complement those three libraries well; they are seaborn for the implementation of complex visualizations, SciPy for statistical and scientific computing capability, and scikit-learn for advanced data analysis needs.

Needless to say, there are also other tools and libraries that, even though they did not fit into our discussions well, offer unique and powerful capabilities for particular tasks in scientific computing. In this section, we will briefly consider some of them so that we can gain a comprehensive understanding of what Python tools are available for which specific tasks.

These tools include:

  • statsmodels: This library was originally part of SciPy's overarching ecosystem but ultimately split off into its own project. The library offers a wide range of statistical tests and analysis techniques, models, and plotting functionalities, all grouped into one comprehensive tool with a consistent API, including time-series analysis capabilities, which its predecessor SciPy somewhat lacks.

    The main website for statsmodels can be found here: http://www.statsmodels.org/stable/index.html.

  • PyMC3: In a subfield of statistics called Bayesian statistics, there are many unique concepts and procedures that can offer powerful capabilities in modeling and making predictions that are not well supported by the libraries that we have considered.

    In PyMC3, Bayesian statistical modeling and probabilistic programming techniques are implemented to make up its own ecosystem with plotting, testing, and diagnostic capabilities, making it arguably the most popular probabilistic programming tool, not just for Python users but for all scientific computing engineers.

    More information on how to get started with PyMC3 can be found on its home page, at https://docs.pymc.io/.

  • SymPy: Moving away from statistics and machine learning, if you are looking for a Python library that supports symbolic mathematics, SymPy is most likely your best bet. The library covers a wide range of core mathematical subfields such as algebra, calculus, discrete math, geometry, and physics-related applications. SymPy is also known to have quite a simple API and extensible source code, making it a popular choice for users looking for a symbolic math library in Python.

    You can learn more about SymPy from its website at https://www.sympy.org/en/index.html.

  • Bokeh: Our last entry on this list is a visualization library. Unlike Matplotlib or seaborn, Bokeh is a visualization tool specifically designed for interactivity and web browsing. Bokeh is typically the go-to tool for visualization engineers who need to process a large amount of data in Python but would like to generate interactive reports/dashboards as web applications.

    To read the official documentation and see the gallery of some of its examples, you can visit the main website at https://docs.bokeh.org/en/latest/index.html.

These libraries offer great support to their respective subfields of statistics and mathematics. Again, it is also always possible to find other tools that fit your specific needs. One of the biggest advantages of using a programming language as popular as Python is the fact that many developers are working to develop new tools and libraries every day for all purposes and needs. The libraries we have discussed so far will help us achieve most of the basic tasks in statistical computing and modeling, and from there we can incorporate other more advanced tools to extend our projects further.

Before we close out this chapter, we will go through an activity as a way to reinforce some of the important concepts that we have learned so far.

Activity 3.01: Revisiting the Communities and Crimes Dataset

In this activity, we will once again consider the Communities and Crimes dataset that we analyzed in the previous chapter. This time, we will apply the concepts we have learned in this chapter to gain additional insights from this dataset:

  1. In the same directory that you stored the dataset in, create a new Jupyter notebook. Alternatively, you can download the dataset again at https://packt.live/2CWXPdD.
  2. In the first code cell, import the libraries that we will be using: numpy, pandas, matplotlib, and seaborn.
  3. As we did in the previous chapter, read in the dataset and print out its first five rows.
  4. Replace every '?' character with a nan object from NumPy.
  5. Focus on the following columns: 'population' (which includes the total population count of a given region), 'agePct12t21', 'agePct12t29', 'agePct16t24', and 'agePct65up', each of which includes the percentage of different age groups in that population.
  6. Write the code that creates new columns in our dataset that contain the actual number of people in these age groups. These should be the product of the data in the column 'population' and each of the age percentage columns.
  7. Use the groupby() method from pandas to compute the total number of people in different age groups for each state.
  8. Call the describe() method on our dataset to print out its various descriptive statistics.
  9. Focus on the 'burglPerPop', 'larcPerPop', 'autoTheftPerPop', 'arsonsPerPop', and 'nonViolPerPop' columns, each of which describes the number of various crimes (burglary, larceny, auto theft, arson, and non-violent crimes) committed per 100,000 people.
  10. Visualize the distribution of the data in each of these columns in a boxplot while having all the boxplots in a single visualization. From the plot, identify which type of crime out of the five is the most common and which is the least common.
  11. Focus on the 'PctPopUnderPov', 'PctLess9thGrade', 'PctUnemployed', 'ViolentCrimesPerPop', and 'nonViolPerPop' columns. The first three describe the percentage of the population in a given region that falls into the corresponding categories (percentages of people living under the poverty level, over 25 years old with less than a ninth-grade education, and in the labor force but unemployed). The last two give us the number of violent and non-violent crimes per 100,000 people.
  12. Compute the appropriate statistical object and visualize it accordingly to answer this question. Identify the pair of columns that correlate with each other the most.

    Note

    The solution for this activity can be found on page 659.