Sorting the Facets statistics overview
You can sort the features of the datasets in several interesting ways, as shown in Figure 3.2:
Figure 3.2: Sorting the features of the datasets
We will start by sorting the feature columns by feature order.
Sorting data by feature order
The feature order sorting option displays the features as defined in the DataFrame in the Reading the data files section of this chapter:
features = ["colored_sputum", "cough", "fever", "headache", "days",
"france", "chicago", "class"]
The order of features can be used as a way to explain why a decision is made.
XAI motivation for sorting features
You can use feature order to explain the reasoning behind a decision. For a given patient, we could sort the features in descending order. The first feature would contain the highest probability value for a given person. The last feature would contain the lowest probability for a given person.
With that in mind, suppose we design our dataset with the features in the following order:
features = ["fever", "cough", "days", "headache", "colored_sputum",
"france", "chicago", "class"]
Such an order could help a general practitioner make a diagnosis by suggesting a diagnosis process, as follows:
- A patient has a mild fever and has been coughing for two days. The doctor cannot easily reach a diagnosis.
- After two days, the fever increases, the coughing increases, and the headaches are unbearable. The doctor can use the AI and XAI process described in Chapter 1, Explaining Artificial Intelligence with Python. The diagnosis could be that the patient has the West Nile virus.
Consider a different feature order:
features = ["colored_sputum", "fever", "cough", "days", "headache",
"france", "chicago", "class"]
colored_sputum and fever will immediately trigger an alert. Does this patient have pneumonia, bronchitis, or is this patient infected with one of the strains of coronavirus? The doctor sends the patient directly to the hospital for further medical examinations.
You can create as many scenarios as you wish with preprocessing scripts before loading and displaying the data of your project.
We will now sort by non-uniformity.
Sorting by non-uniformity
The uniformity of a data distribution needs to be determined before deciding whether a dataset will provide reliable results or not. Facets measures the non-uniformity of a data distribution.
The following data distribution is uniform because the elements of the set are balanced between an equal number of 0 and 1 values:
dd1 = {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
dd1 could represent a coin toss dataset with heads (1) and tails (0). Predicting values with dd1 is relatively easy.
The values of dd1 do not vary much, and the results of ML will be more reliable than a non-uniform data distribution such as dd2:
dd2 = {1, 1, 1, 1, 5, 1, 1, 0, 0, 2, 3, 3, 9, 9, 9, 7}
It is difficult to predict values with dd2 because the values are not uniform. There is only one value to represent the {2, 5, 7} subset and six values representing the {1, 1, 1, 1, 1, 1} subset.
When Facets sorts the features by non-uniformity, it displays the features with most non-uniform features first.
In our case, for this dataset, we will analyze the features by uniformity to see which features are the most stable.
Select Non-uniformity from the dropdown Sort by list:
Figure 3.3: Sorting the data by non-uniformity
Then click on Reverse order to see the data distributions:
Figure 3.4: The data distribution interface
We see that the first line has a better data distribution than the second and third ones.
We now click on expand to get a better visualization of each feature:
Figure 3.5: Selecting the standard view
cough has a relatively uniform data distribution with values spread out over the x axis:
Figure 3.6: Visualizing the data distribution of the features of the dataset
headache is not as uniform as cough:
Figure 3.7: An example of data that is not evenly distributed
Suppose a doctor has to make a diagnosis of a patient that has coughed for several days with a headache that disappeared, then reappeared several days after. A headache that reappears after several days could mean that the patient has a virus, or nothing at all.
Facets provides more information on the data distribution. Uncheck expand to obtain an overview of the features:
Figure 3.8: Numerical information on the data distribution of the dataset
The fields displayed help to explain many aspects of the datasets:
- count on line 1 is the number of records of the training dataset.
- count on line 2 is the number of records in the test dataset.
- missing is the number of missing records. If the number of missing records is too large, then the dataset might be corrupt. You might have to check the dataset before continuing.
- mean indicates the average value of the numerical values of the feature.
- std dev measures the dispersion of the data. It represents the distance between the data points and the mean.
- zeros helps us to visualize the percentage of values that are equal to 0. If there are too many zero values, it might prove challenging to obtain reliable results.
- min represents the minimum value of a feature.
- median is the middle value of all of the values of a feature.
- max represents the maximum value of a feature.
For example, if the median is very close to the maximum value and far from the minimum value, your dataset might produce biased results.
Analyzing the data distributions of the features will improve your vision of the AI model when things go wrong.
You can also sort the dataset by alphabetical order.
Sorting by alphabetical order
Sorting by alphabetical order can help you reach a feature faster, as shown in Figure 3.9:
Figure 3.9: Sorting features by alphabetical order
We can also find the features with missing or zero amounts.
Sorting by amount missing/zero
Missing records or features with zero values can distort the training of an AI model. Facets sorts the features by the number of missing or zero values:
Figure 3.10: Numeric features information
100% of the values of france are equal to 0. While 1.01% of the values of colored_sputum are missing. Observing these values will lead to the improvement of the quality of ML datasets. In turn, the quality of the datasets will produce better outputs when training an ML model.
We will now explore the distribution distance option.
Sorting by distribution distance
Calculating the distribution distance between the training set and the test set, for example, can be implemented with the Kullback-Leibler pergence, also named relative entropy.
We can calculate the distribution distance with three variables:
- S is the relative entropy
- X is the dtrain dataset
- Y is the dtest dataset
The equation used by scikit-learn for Kullback-Leibler pergence is as follows:
S = sum(X * log(Y/X))
If the values of X or Y do not add up to 1, they will be normalized.
In the cell below Facets Overview, a few examples show that entropy increases as distribution distance increases.
We can start with two data distributions that are similar:
from scipy.stats import entropy
X = [1, 1, 1, 2, 1, 1, 4]
Y = [1, 2, 3, 4, 2, 2, 5]
entropy(X, Y)
The relative entropy is 0.05.
However, if the two data distributions begin to change, they will perge, producing higher entropy values:
from scipy.stats import entropy
X = [10, 1, 1, 20, 1, 10, 4]
Y = [1, 2, 3, 4, 2, 2, 5]
entropy(X, Y)
The relative entropy has increased. The value now is 0.53.
With this approach in mind, we can now examine the features Facets has sorted by descending relative entropy:
Figure 3.11: Sorting by distribution distance
If the training and test datasets perge beyond a specific limit, this could explain why some of the predictions of an ML model are false.
We have explored the many functions of Facets Overview to detect missing data, the percentage of 0 values, non-uniformity, distribution distances, and more. We saw that the data distribution of each feature contained valuable information for fine-tuning the training and testing datasets.
We will now learn to build an instance of Facets Dive.