Preface
About the Book
Do you find it difficult to understand how popular companies like WhatsApp and Amazon find valuable insights from large amounts of unorganized data? The Unsupervised Learning Workshop will give you the confidence to deal with cluttered and unlabeled datasets, using unsupervised algorithms in an easy and interactive manner.
The book starts by introducing the most popular clustering algorithms of unsupervised learning. You'll find out how hierarchical clustering differs from k-means, along with understanding how to apply DBSCAN to highly complex and noisy data. Moving ahead, you'll use autoencoders for efficient data encoding.
As you progress, you'll use t-SNE models to extract high-dimensional information into a lower dimension for better visualization, in addition to working with topic modeling for implementing Natural Language Processing. In later chapters, you'll find key relationships between customers and businesses using Market Basket Analysis, before going on to use Hotspot Analysis for estimating the population density of an area.
By the end of this book, you'll be equipped with the skills you need to apply unsupervised algorithms on cluttered datasets to find useful patterns and insights.
Audience
If you are a data scientist who is just getting started and want to learn how to implement machine learning algorithms to build predictive models, then this book is for you. To expedite the learning process, a solid understanding of the Python programming language is recommended, as you'll be editing classes and functions instead of creating them from scratch.
About the Chapters
Chapter 1, Introduction to Clustering, introduces clustering (the most well-known family of unsupervised learning algorithms), before digging into the simplest and most popular clustering algorithm—k-means.
Chapter 2, Hierarchical Clustering, covers another clustering technique, hierarchical clustering, and explains how it differs from k-means. The chapter teaches you two main approaches to this type of clustering: agglomerative and pisive.
Chapter 3, Neighborhood Approaches and DBSCAN, explores clustering approaches that involve neighbors. Unlike the two other clustering approaches, the neighborhood approaches allow outlier points that are not assigned to any particular cluster.
Chapter 4, Dimensionality Reduction and PCA, teaches you how to navigate large feature spaces by leveraging principal component analysis to reduce the number of features while maintaining the explanatory power of the whole feature space.
Chapter 5, Autoencoders, shows you how neural networks can be leveraged to find data encodings. Data encodings are like combinations of features that reduce the dimensionality of the feature space. Autoencoders also decode the data and put it back into its original form.
Chapter 6, t-Distributed Stochastic Neighbor Embedding, discusses the process of reducing high-dimensional datasets down to two or three dimensions for the purpose of visualization. Unlike PCA, t-SNE is a non-linear, probabilistic model.
Chapter 7, Topic Modeling, explores the fundamental methodology of natural language processing. You will learn how to work with text data and fit Latent Dirichlet Allocation and Non-negative Matrix Factorization models to tag topics relevant to the text.
Chapter 8, Market Basket Analysis, explores a classic analytical technique used in retail businesses. You will, in a scalable way, build association rules that explain the relationships between groups of items.
Chapter 9, Hotspot Analysis, teaches you to estimate the true population density of some random variable using sample data. This technique is applicable to many fields, including epidemiology, weather, crime, and demography.
Conventions
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"Plot the coordinate points using the scatterplot functionality we imported from matplotlib.pyplot."
Words that you see on the screen (for example, in menus or dialog boxes) appear in the same format.
A block of code is set as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cdist
seeds = pd.read_csv('Seed_Data.csv')
New terms and important words are shown like this:
"Unsupervised learning is the field of practice that helps find patterns in cluttered data and is one of the most exciting areas of development in machine learning today."
Long code snippets are truncated and the corresponding names of the code files on GitHub are placed at the top of the truncated code. The permalinks to the entire code are placed below the code snippet. It should look as follows:
Exercise1.04-Exercise1.05.ipynb
def k_means(X, K):
# Keep track of history so you can see K-Means in action
centroids_history = []
labels_history = []
rand_index = np.random.choice(X.shape[0], K)
centroids = X[rand_index]
centroids_history.append(centroids)
The complete code for this step can be found at https://packt.live/2JM8Q1S.
Code Presentation
Lines of code that span multiple lines are split using a backslash ( \ ). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
For example:
history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \
validation_split=0.2, shuffle=False)
Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:
# Print the sizes of the dataset
print("Number of Examples in the Dataset = ", X.shape[0])
print("Number of Features for each example = ", X.shape[1])
Multi-line comments are enclosed by triple quotes, as shown below:
"""
Define a seed for the random number generator to ensure the
result will be reproducible
"""
seed = 1
np.random.seed(seed)
random.set_seed(seed)
Setting up Your Environment
Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.
Hardware Requirements
For the optimal user experience, we recommend 8 GB RAM.
Installing Python
The following section will help you to install Python in Windows, macOS, and Linux systems.
Installing Python on Windows
- Find your desired version of Python on the official installation page at https://www.python.org/downloads/windows/.
- Ensure that you install the correct "-bit" version depending on your computer system, either 32-bit or 64-bit. You can find out this information in the System Properties window of your OS.
- After you download the installer, simply double-click the file and follow the user-friendly prompts on the screen.
Installing Python on Linux
- Open a Terminal and verify Python 3 is not already installed by running python3 --version.
- To install Python 3, run the following:
sudo apt-get update
sudo apt-get install python3.7
- If you encounter problems, there are numerous sources online that can help you troubleshoot the issue.
Installing Python on macOS
Here are the steps to install Python on macOS:
- Open the Terminal by holding Cmd + Space, typing terminal in the open search box, and hitting Enter.
- Install Xcode through the command line by running xcode-select --install.
- The easiest way to install Python 3 is with Homebrew, which is installed through the command line by running ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
- Add Homebrew to your PATH environment variable. Open your profile in the command line by running sudo nano ~/.profile and inserting export PATH="/usr/local/opt/python/libexec/bin:$PATH" at the bottom.
- The final step is to install Python. In the command line, run brew install python.
- Note that if you install Anaconda, the latest version of Python will be installed automatically.
Installing pip
Python does not come with pip (the package manager for Python) pre-installed, so we need to install it manually. Once pip is installed, the remaining libraries can be installed as mentioned in the Installing Libraries section. The steps to install pip are as follows:
- Go to https://bootstrap.pypa.io/get-pip.py and save the file as get-pip.py.
- Go to the folder where you have saved get-pip.py. Open the command line in that folder (Bash for Linux users and Terminal for Mac users).
- Execute following command in the command line:
python get-pip.py
Please note that you should have Python installed before executing this command.
- Once pip is installed, you can install the desired libraries. To install pandas, you can simply execute pip install pandas. To install a specific version of a library, for example, version 0.24.2 of pandas, you can execute pip install pandas=0.24.2.
Installing Anaconda
Anaconda is a Python package manager that easily allows you to install and use the libraries needed for this course.
Installing Anaconda on Windows
- Anaconda installation for Windows is very user-friendly. Visit the download page to get the installation executable at https://www.anaconda.com/distribution/#download-section.
- Double-click the installer on your computer.
- Follow the prompts on screen to complete the installation of Anaconda.
- After installation, you can access Anaconda Navigator, which will be available alongside the rest of your applications as normal.
Installing Anaconda on Linux
- Visit the Anaconda download page to get the installation shell script, at https://www.anaconda.com/distribution/#download-section.
- To download the shell script directly to your Linux instance you can use the curl or wget retrieval libraries. The example here shows how to use curl to retrieve the file located at the URL you found on the Anaconda download page:
curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
- After downloading the shell script, you can run it with the following command:
bash Anaconda3-2019.03-Linux-x86_64.sh
Running the preceding command will move you to a very user-friendly installation process. You will be prompted on where you want to install Anaconda and how you wish Anaconda to work. In this case, you should just keep all the standard settings.
Installing Anaconda on macOS X
- Anaconda installation for macOS is very user-friendly. Visit the download page to get the installation executable, at https://www.anaconda.com/distribution/#download-section.
- Make sure macOS is selected and double-click the Download button for the Anaconda installer.
- Follow the prompts on screen to complete the installation of Anaconda.
- After installation, you can access Anaconda Navigator, which will be available alongside the rest of your applications as normal.
Setting up a Virtual Environment
- After Anaconda is installed, you must create environments where you will install packages you wish to use. The great thing about Anaconda environments is that you can build inpidual environments for specific projects you're working on. To create a new environment, use the following command:
conda create --name my_packt_env python=3.7
Here, we are naming our environment my_packt_env and specifying the version of Python to be 3.7. Thus you can have multiple versions of Python installed in the environment that will be virtually separate.
- Once the environment is created, you can activate it using the well-named activate command:
conda activate my_packt_env
That's it. You are now in your own customized environment that will allow you to install packages as needed for your projects. To exit your environment, you can simply use the conda deactivate command.
Installing Libraries
pip comes pre-installed with Anaconda. Once Anaconda is installed on your machine, all the required libraries can be installed using pip, for example, pip install numpy. Alternatively, you can install all the required libraries using pip install –r requirements.txt. You can find the requirements.txt file at https://packt.live/2CnpCEp.
The exercises and activities will be executed in Jupyter Notebooks. Jupyter is a Python library and can be installed in the same way as the other Python libraries – that is, with pip install jupyter, but fortunately, it comes pre-installed with Anaconda. To open a notebook, simply run the command jupyter notebook in the Terminal or Command Prompt.
In Chapter 9, Hotspot Analysis, the basemap module from mpl_toolkits is used to generate maps. This library can be difficult to install. The easiest way is to install Anaconda, which includes mpl_toolkits. Once Anaconda is installed, basemap can be installed using conda install basemap. If you want to avoid installing libraries repeatedly, and instead want to install them all at once, you can follow the instructions in the next section.
Setting up the Machine
It might be that if you are installing dependencies chapter by chapter, the version of the libraries could be different. In order to sync the system, we provide a requirements.txt file that contains the versions of the libraries used. Once you have installed the libraries using this, you don't have to install any other libraries throughout the book. Assuming you have installed Anaconda by now, you can follow these steps:
- Download the requirements.txt file from GitHub.
- Go to the folder where requirements.txt is placed and open Command Prompt (Bash for Linux and Terminal for Mac).
- Execute the following command on it:
conda install --yes --file requirements.txt --channel conda-forge
It should install all the packages necessary for the coding activities in the book.
Accessing the Code Files
You can find the complete code files of this book at https://packt.live/34kXeMw. You can also run many activities and exercises directly in your web browser by using the interactive lab environment at https://packt.live/2ZMUWW0.
We've tried to support interactive versions of all activities and exercises, but we recommend a local installation as well for instances where this support isn't available.
If you have any issues or questions about installation, please email us at workshops@packt.com.