The Iris dataset
The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification.
The dataset is a collection of morphological measurements of several iris flowers. These measurements will enable us to distinguish multiple species of flower. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered.
The following four attributes of each plant were measured:
- Sepal length
- Sepal width
- Petal length
- Petal width
In general, we call the individual numeric measurements we use to describe our data features. These features can be directly measured or computed from intermediate data.
This dataset has four features. Additionally, for each plant, the species is recorded. The problem we want to solve is: "given these examples, if we see a new flower out in the field, could we make a good prediction about its species from its measurements?
This is the classification problem: given labeled examples, can we design a rule to be later applied to other examples?
Later in the book, we will look at problems dealing with text. For the moment, the Iris dataset serves our purposes well. It is small (150 examples, four features each) and can be easily visualized and manipulated.