Nearest neighbors
The Nearest neighbors algorithm is our new sample. We take the most similar samples
and predict the same class that most of these nearby samples have. This vote is often simply a simple count,although more complicated methods do exist such as weighted voting.
As an example in the below diagram, we wish to predict the class of the triangle, based on which class it is more like (represented here by having similar objects closer together). We seek the three nearest neighbors, which are the two diamonds and one square within the drawn circle. There are more diamonds than circles, and the predicted class for the triangle is, therefore, a diamond:
Nearest neighbors are used for nearly any dataset - however, it can be computationally expensive to compute the distance between all pairs of samples. For example, if there are ten samples in the dataset, there are 45 unique distances to compute. However, if there are 1000 samples, there are nearly 500,000! Various methods exist for improving this speed, such as the use of tree structures for distance computation. Some of these algorithms can be quite complex, but thankfully a version is implemented in scikit-learn already, enabling us to classify on larger datasets. As these tree structures are the default in scikit-learn, we do not need to configure anything to use it.
Nearest neighbors can do poorly in categorical-based datasets, with categorical features, and another algorithm should be used for these instead. Nearest Neighbor's issue is due to the difficulty in comparing differences in categorical values, something better left to an algorithm that gives weight to each feature's importance. Comparing categorical features can be done with some distance metrics or pre-processing steps such as one hot encoding that we use in later chapters. Choosing the correct algorithm for the task is one of the difficult issues in data mining, often it can be easiest to test a set of algorithms and see which performs best on your task.