Merging/joining datasets
Merging or joining is a mission critical step for predictive modelling and, more often than not, while working on actual problems, an analyst will be required to do it. The readers who are familiar with relational databases know how there are multiple tables connected by a common key column across which the required columns are scattered. There can be instances where two tables are joined by more than one key column. The merges and joins in Python are very similar to a table merge/join in a relational database except that it doesn't happen in a database but rather on the local computer and that these are not tables, rather data frames in pandas. For people familiar with Excel, you can find similarity with the VLOOKUP
function in the sense that both are used to get an extra column of information from a sheet/table joined by a key column.
There are various ways in which two tables/data frames can be merged/joined. The most commonly used ones are Inner Join, Left Join, Right Join, and so on. We will go in to detail and understand what each of these mean. But before that, let's go ahead and perform a simple merge to get a feel of how it is done.
We will be using a different dataset to illustrate the concept of merge and join. These datasets can be found in the Google Drive folder in Merge
and the Join
/Medals
folder. The main dataset Medals.csv
contains details of medals won by inpidual players at different Olympic events. The two subsidiary datasets contain details of the nationality and sports of the inpidual player. What if we want to see the nationality or sport played by the player together with all the other medal information for each player? The answer is to merge both the datasets and to get the relevant columns. In data science parlance, merging, joining, and mapping are used synonymously; although, there are minor technical differences.
Let us import all of them and have a cursory look at them:
import pandas as pd data_main=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/Medals/Medals.csv') data_main.head()
The Medals.csv
looks similar to the following screenshot:
Fig. 3.47: First few entries of the Medals dataset
As we can see, this is the information about the Olympic Year in which the medals were won, details of how many Gold, Silver, and Bronze medals were won and the Age of the player. There are 8,618 rows in the dataset. One more thing one might be interested to know about this dataset is how many unique athletes are there in the dataset, which will come in handy later when we learn and apply different kinds of joins:
a=data_main['Athlete'].unique().tolist() len(a)
The output of this snippet is 6956
, which means that there are many athletes for whom we have records in the datasets. The other entries come because many athletes may have participated in more than one Olympics.
Let us now import the Athelete Country Map.csv
and have a look at it:
country_map=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/Medals/Athelete_Country_Map.csv') country_map.head()
The output data frame looks similar to the following screenshot, with two columns: Athlete and Country:
Fig. 3.48: First few entries of the Athelete_Country_Map dataset
There are 6,970 rows in this dataset. If you try to find out the unique number of athletes in this data frame, it will still be 6,956. The 14 extra rows come from the fact that some players have played for two countries in different Olympics and have won medals. Search for Aleksandar Ciric
and you will find that he has played for both Serbia
and Serbia and Montenegro
.
Note
(Disclaimer: This might not be the actual case and this might be an issue with the mapping file, which can be taken care of by removing duplicate values, as we would show later in this chapter).
You can do this by using the following code snippet:
country_map[country_map['Athlete']=='Aleksandar Ciric']
Fig. 3.49: Subsetting the country_map data frame for Aleksandar Ciric
Let us finally import the Athelete Sports Map.csv
and have a look at it:
sports_map=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/Medals/Athelete_Sports_Map.csv') sports_map.head()
The sports_map
data frame looks as shown in the following screenshot:
Fig. 3.50: First few entries of the Athelete_Sports_Map dataset
There are 6,975 rows in this dataset because, yes you guessed it right, there are very few athletes in this mapping data frame who have played more than one game and have won medals. Watch out for athletes, such as Chen Jing
, Richard Thompson
and Matt Ryan
who have played more than one game.
This can be done by writing a code, such as the following snippet:
sports_map[(sports_map['Athlete']=='Chen Jing') | (sports_map['Athlete']=='Richard Thompson') | (sports_map['Athlete']=='Matt Ryan')]
The output looks similar to the following screenshot:
Fig. 3.51: Subsetting the sports_map data frame for athletes Richard Thompson and Matt Ryan
Let's now merge the data_main
and country_map
data frames to get the country for all the athletes. There is a merge
method in pandas, which facilitates this:
import pandas as pd merged=pd.merge(left=data_main,right=country_map,left_on='Athlete',right_on='Athlete') merged.head()
The output looks, as follows. It has a country column as expected:
Fig. 3.52: First few entries of the merged data frame. It has a country column.
The length of the merged data frame is 8,657, which is more than the total number of rows (8,618) in the data_main
data frame. This is because when we join these two data frames without any specified conditions, an inner join is performed wherein the join happens based on the common key-values present in both the data frames. Also, we saw that some athletes have played for two countries and the entries for such athletes will be duplicated for such athletes. If you look at Aleksandar Ciric
in the merged data frame, you will find something similar to this:
merged[merged['Athlete']=='Aleksandar Ciric']
Fig. 3.53 Subsetting the merged data frame for athlete Aleksandar Ciric
The problem is not with the type of join but with the kind of mapping file we have. This mapping file is one-many and hence the number increases because for each key multiple rows are created in such a case.
To rectify this issue, one can remove the duplicate entries from the country_map
data frame and then perform the merge with data_main
. Let's do that. This can be done using the drop_duplicates
method, as shown:
country_map_dp=country_map.drop_duplicates(subset='Athlete')
The length of the country_map_dp
is 6,956 rows, which is the same as the number of unique athletes. Let us now merge this with data_main
.
merged_dp=pd.merge(left=data_main,right=country_map_dp,left_on='Athlete',right_on='Athlete') len(merged_dp)
The number of rows in the merged_dp
is indeed 8,618, which is the actual number of rows in the data_main
.
The next step is to merge sports_map
with the merged_dp
to get the country and sports along with other details in the same data frame.
We have seen similar issue of increase in the number of rows for sports_map
, as was the case for country_map
data frame. To take care of that, let's remove the duplicates from the sports_map
before merging it with merged_dp
:
sports_map_dp=sports_map.drop_duplicates(subset='Athlete') len(sports_map_dp)
The length of the sports_map_dp
is 6,956, which is the same as the number of rows in the data_main
data frame, as expected.
The next step is to merge this with the merge_pd
data frame to get the sports played by the athlete in the final merged table:
merged_final=pd.merge(left=merged_dp,right=sports_map_dp,left_on='Athlete',right_on='Athlete') merged_final.head()
Fig. 3.54: First few entries of the merged_final dataset. The duplicates from country_map were deleted before the merge
As we can see, the Sport column is present in the merged_final
data frame after the merge. The merged_final
data frame has 8,618 rows as expected.
Let us now look at various kinds of merge/joins that we can apply to two data frames. Although you would come across many kinds of joins in different texts, it is sufficient to know the concept behind the three of them—Inner Join, Left Join, and Right Join. If you consider the two tables/data frames as sets, then these joins can be well represented by Venn Diagrams.
Inner Join
The characteristics of the Inner Join are as follows:
- Returns a data frame containing rows, which have a matching value in both the original data frames being merged.
- The number of rows will be equal to the minimum of the row numbers of the two data frames. If data frame A containing 100 rows is being merged with data frame B having 80 rows, the merged data frame will have 80 rows.
- The Inner Join can be thought of as an intersection of two sets, as illustrated in the following figure:
Fig. 3.55: Inner Join illustrated via a Venn diagram
Left Join
The characteristics of the Left Join are, as follows:
- Returns a data frame containing rows, which contains all the rows from the left data frame irrespective of whether it has a match in the right data frame or not.
- In the final data frame, the rows with no matches in the right data frame will return NAs in the columns coming from right data frame.
- The number of rows will be equal to the number of rows in the left data frame. If data frame A containing 100 rows is being merged with data frame B having 80 rows, the merged data frame would have 100 rows.
- The Left Join can be thought of as the set containing the entire left data frame, as illustrated in the following figure:
Fig. 3.56: Left Join illustrated via a Venn Diagram
Right Join
The characteristics of the Right Join are as follows:
- Returns a data frame containing rows, which contains all the rows from the right data frame irrespective of whether it has a match in the left data frame or not.
- In the final data frame, the rows with no matches in the left data frame will return NAs in the columns coming from left data frame.
- The number of rows will be equal to the number of rows in the left data frame. If data frame A containing 100 rows is being merged with data frame B having 80 rows, the merged data frame will have 80 rows
- The Right Join can be thought of as the set containing the entire right data frame, as illustrated in the following figure:
Fig. 3.57: Right Join illustrated via a Venn diagram
The comparison between join type and set operation is summarized in the following table:
Let us see some examples of how different kinds of mappings actually work. For that, a little data preparation is needed. Currently, both our mapping files contain matching entries for all the rows in the actual data frame data_main
. So, we can't see the effects of different kind of merges. Let's create a country and sports mapping file which doesn't have the information for some of the athletes and let's see how it reflects in the merged table. This can be done by creating a new data frame that doesn't have country/sports information for some of the athletes, as shown in the following code:
country_map_dlt=country_map_dp[(country_map_dp['Athlete']<>'Michael Phelps') & (country_map_dp['Athlete']<>'Natalie Coughlin') & (country_map_dp['Athlete']<>'Chen Jing') & (country_map_dp['Athlete']<>'Richard Thompson') & (country_map_dp['Athlete']<>'Matt Ryan')] len(country_map_dlt)
Using this snippet, we have created a country_map_dlt
data frame that doesn't have country mapping for five athletes, that is Michael Phelps
, Natalie Coughlin
, Chen Jing
, Richard Thompson
, and Matt Ryan
. The length of this data frame is 6,951; it is five less than the actual mapping file, indicating that the information for five athletes has been removed.
Let's do the same for sports_map
as well as the data_main
data frame using the following snippets:
sports_map_dlt=sports_map_dp[(sports_map_dp['Athlete']<>'Michael Phelps') & (sports_map_dp['Athlete']<>'Natalie Coughlin') & (sports_map_dp['Athlete']<>'Chen Jing') & (sports_map_dp['Athlete']<>'Richard Thompson') & (sports_map_dp['Athlete']<>'Matt Ryan')] len(sports_map_dlt) data_main_dlt=data_main[(data_main['Athlete']<>'Michael Phelps') & (data_main['Athlete']<>'Natalie Coughlin') & (data_main['Athlete']<>'Chen Jing') & (data_main['Athlete']<>'Richard Thompson') & (data_main['Athlete']<>'Matt Ryan')] len(data_main_dlt)
The length of data_main_dlt
becomes 8,605 because the data_main
contains multiple rows for an athlete.
An example of the Inner Join
One example of Inner join would be to merge data_main
data frame with country_map_dlt
. This can be done using the following snippet:
merged_inner=pd.merge(left=data_main,right=country_map_dlt,how='inner',left_on='Athlete',right_on='Athlete') len(merged_inner)
This merge should give us information for the athletes who are present in both the data frames. As the country_map_dlt
doesn't contain information about five athletes present in data_main
, these five athletes wouldn't be a part of the merged table.
The length of the merged_inner
comes out to be 8,605 (similar to data_main_dlt
) indicating that it doesn't contain information about those five athletes.
An example of the Left Join
One example of Left Join would be to merge data_main
data frame with country_map_dlt
. This can be done using the following snippet:
merged_left=pd.merge(left=data_main,right=country_map_dlt,how='left',left_on='Athlete',right_on='Athlete') len(merged_left)
This merge should give us the information about all the athletes that are present in the left data frame (data_main
) even if they aren't present in the right data frame (country_map_dlt
). So, the merged_left
data frame should contain 8,618 rows (similar to the data_main
) even if the country_map_dlt
doesn't contain information about five athletes present in data_main
. These five athletes will have a NaN value in the Country column.
The length of merged_left
indeed comes out to be 8,618. Let's check the merged_left
for an athlete whose information is not present in the country_map_dlt
. It should contain NaN for the Country column:
merged_left_slt=merged_left[merged_left['Athlete']=='Michael Phelps'] merged_left_slt
The output is similar to the following screenshot. It indeed contains NaN for Michael Phelps' Country because it doesn't have a mapping in country_map_dlt
:
Fig. 3.58: Merged_left data frame sub-setted for Michael Phelps contains NaN values, as expected
An example of the Right Join
One example of Right Join will be to merge data frame data_main
with country_map_dlt.
This can be done using the following snippet:
merged_right=pd.merge(left=data_main_dlt,right=country_map_dp,how='right',left_on='Athlete',right_on='Athlete') len(merged_right)
This should contain the NaN values for the columns coming from data_main_dlt
, in the rows where there is no athlete information in data_main_dlt
.
As shown in the following table:
Fig. 3.59: merged_right data frame sub-setted for Michael Phelps contains NaN values, as expected
There will be one row created for each athlete who is not there in the data_main_dlt
but is present in the country_map_dp
. Hence, there will be five extra rows, one for each deleted athlete. The number of rows in the merged_right
is thus equal to 8,610.
There are other joins like Outer Joins, which can be illustrated as the Union of two data frames. The Outer join would contain rows from both the data frames, even if they are not present in the other. It will contain NaN
for the columns which it can't get values for. It can be easily performed setting the how
parameter of the merge
method to outer
.
Summary of Joins in terms of their length
The effect of these joins can be more effectively explained if we summarize the number of samples present in the data frames that were used for merging and in the resultant data frames.
The first table provides the number of samples present in the data frames that were used for merging. All these data frames have been defined earlier in this section of the chapter: