The pandas DataFrame
A pandas Series can only have a single value associated with each index label. To have multiple values per index label we can use a data frame. A data frame represents one or more Series objects aligned by index label. Each series will be a column in the data frame, and each column can have an associated name.
The following creates a DataFrame object with two columns and uses the temperature Series objects:
The resulting data frame has two columns named Missoula and Philadelphia. These columns are new Series objects contained within the data frame with the values copied from the original Series objects.
Columns in a DataFrame object can be accessed using an array indexer [] with the name of the column or a list of column names. The following code retrieves the Missoula column:
And the following code retrieves the Philadelphia column:
A Python list of column names can also be used to return multiple columns:
If the name of a column does not have spaces, it can be accessed using property-style:
Arithmetic operations between columns within a data frame are identical in operation to those on multiple Series. To demonstrate, the following code calculates the difference between temperatures using property notation:
A new column can be added to DataFrame simply by assigning another Series to a column using the array indexer [] notation. The following adds a new column in the DataFrame with the temperature differences:
The names of the columns in a DataFrame are accessible via the .columns property:
The DataFrame and Series objects can be sliced to retrieve specific rows. The following slices the second through fourth rows of temperature difference values:
Entire rows from a data frame can be retrieved using the .loc and .iloc properties. .loc ensures that the lookup is by index label, where .iloc uses the 0-based position. -
The following retrieves the second row of the data frame:
Notice that this result has converted the row into a Series with the column names of the data frame pivoted into the index labels of the resulting Series. The following shows the resulting index of the result:
Rows can be explicitly accessed via index label using the .loc property. The following code retrieves a row by the index label:
Specific rows in a DataFrame object can be selected using a list of integer positions. The following selects the values from the Difference column in rows at integer locations 1, 3, and 5:
Rows of a data frame can be selected based upon a logical expression that is applied to the data in each row. The following shows values in the Missoula column that are greater than 82 degrees:
The results from an expression can then be applied to the [] operator of a data frame (and a series) which results in only the rows where the expression evaluated to True being returned:
This technique is referred to as Boolean Selection in pandas terminology and will form the basis of selecting rows based upon values in specific columns (like a query in SQL using a WHERE clause - but as we will see it is much more powerful).