Feature Engineering Made Easy
上QQ阅读APP看书,第一时间看更新

Mathematical operations allowed

We have a few new abilities to work with at the ordinal level compared to the nominal level. At the ordinal level, we may still do basic counts as we did at the nominal level, but we can also introduce comparisons and orderings into the mix. For this reason, we may utilize new graphs at this level. We may use bar and pie charts like we did at the nominal level, but because we now have ordering and comparisons, we can calculate medians and percentiles. With medians and percentiles, stem-and-leaf plots, as well as box plots, are possible.

Some examples of data at the ordinal level include:

  • Using a Likert scale (rating something on a scale from one to ten, for example)

  • Grade levels on an exam (F, D, C, B, A)

For a real-world example of data at the ordinal scale, let's bring in a new dataset. This dataset holds key insights into how much people enjoy the San Francisco International Airport or SFO. This dataset is also publicly available on SF's open database (https://data.sfgov.org/Transportation/2013-SFO-Customer-Survey/mjr8-p6m5):

# load in the data set
customer = pd.read_csv('../data/2013_SFO_Customer_survey.csv')

This CSV has many, many columns:

customer.shape

(3535, 95)

95 columns, to be exact. For more information on the columns available for this dataset, check out the data dictionary on the website (https://data.sfgov.org/api/views/mjr8-p6m5/files/FHnAUtMCD0C8CyLD3jqZ1-Xd1aap8L086KLWQ9SKZ_8?download=true&filename=AIR_DataDictionary_2013-SFO-Customer-Survey.pdf)

For now, let's focus on a single column, Q7A_ART. As described by the publicly available data dictionary, Q7A_ART is about artwork and exhibitions. The possible choices are 0, 1, 2, 3, 4, 5, 6 and each number has a meaning:

  • 1: Unacceptable
  • 2: Below Average
  • 3: Average
  • 4: Good
  • 5: Outstanding
  • 6: Have Never Used or Visited
  • 0: Blank

We can represent it as follows:

art_ratings = customer['Q7A_ART']
art_ratings.describe()


count 3535.000000 mean 4.300707 std 1.341445 min 0.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 6.000000 Name: Q7A_ART, dtype: float64

The pandas is considering the column numerical because it is full of numbers, however, we must remember that even though the cells' values are numbers, those numbers represent a category, and therefore this data belongs to the qualitative side, and more specifically, ordinal. If we remove the 0 and 6 category, we are left with five ordinal categories which basically resemble the star rating of restaurant ratings:

# only consider ratings 1-5
art_ratings = art_ratings[(art_ratings >=1) & (art_ratings <=5)]

We will then cast the values as strings:

# cast the values as strings
art_ratings = art_ratings.astype(str)

art_ratings.describe()

count 2656 unique 5 top 4 freq 1066 Name: Q7A_ART, dtype: object

Now that we have our ordinal data in the right format, let's look at some visualizations:

# Can use pie charts, just like in nominal level
art_ratings.value_counts().plot(kind='pie')

The following is the result of the preceding code:

We can also visualize this as a bar chart as follows:

# Can use bar charts, just like in nominal level
art_ratings.value_counts().plot(kind='bar')

The following is the output of the preceding code:

However, now we can also introduce box plots since we are at the ordinal level:

# Boxplots are available at the ordinal level
art_ratings.value_counts().plot(kind='box')

The following is the output of the preceding code:

This box plot would not be possible for the Grade column in the salary data, as finding a median would not be possible.