The Pandas library provides a user-friendly API to concatenate two data frames together with the methods merge
and
join
.
When using these methods, it is possible to specify how the merge will be performed:
- The parameter
how
specifies the type of merge (left
, inner
, outer
, etc..).
- The parameter
on
specifies the column(s) on which the merge will be performed.
- The parameter
validate
specifies a way to verify if the merge result is what was expected.
import pandas as pd
age_df = pd.DataFrame({"user_id":[1,2,4], "age":[42,45, 35]})
name_df = pd.DataFrame({"user_id":[1,2,3,4], "name":["a","b","c","d"]})
result = age_df.merge(name_df, on="user_id", how="right", validate="1:1")
In the example above, both data frames will be merged together based on the column user_id
, specified by the parameter
on
.
The parameter how
set to right
states that the resulting data frame will contain all the user_id
s present in
the data frame name_df
(including 3
, which is absent from age_df
and will therefore be assigned a
NaN
value for the age
column).
Lastly, setting the validate
parameter to 1:1
means a check will be performed to verify that the column used for the
merge only contains unique keys in both data frames. If this check fails a MergeError
will be raised.
Here is the resulting data frame:
row |
user_id |
age |
name |
0 |
1 |
42 |
a |
1 |
2 |
45 |
b |
2 |
3 |
NaN |
c |
3 |
4 |
35 |
d |
More information about these methods and their parameters can be found in the pandas documentation: merge and join.
The how
, on
and validate
parameters are optional and pandas provides sensible default values.
This means merge
could be used as follow:
import pandas as pd
age_df = pd.DataFrame({"user_id":[1,2,4], "age":[42,45, 35]})
name_df = pd.DataFrame({"user_id":[1,2,3,4], "name":["a","b","c","d"]})
result = age_df.merge(name_df)
In this example:
- The
how
parameter defaults to inner
.
- The
on
parameter defaults to the columns which have a similar name, in our case user_id
.
- The
validate
parameter will be set to many_to_many
, meaning no validation will be performed.
Here is the resulting data frame:
row |
user_id |
age |
name |
0 |
1 |
42 |
a |
1 |
2 |
45 |
b |
2 |
4 |
35 |
d |
While the example above is perfectly valid, using the merge
and join
methods without providing the how
,
on
and validate
arguments has two main drawbacks:
- It makes the code intention unclear: without the
how
parameter set, it is unclear if the developer noticed that a
user_id
(3
) will be missing from the resulting data frame, or if it is done on purpose.
- It makes the code harder to maintain: if one of the data frame would change its
user_id
column name to id
, the code
would still run but the result would be entirely different.
In order to mitigate these drawbacks, setting the how
parameter to inner
would better convey that the intention is to
only keep user_id
s present in both data frame. Setting the on
parameter to user_id
could also avoid issues in
the future, for example if the input data frames came to change.
The information provided by these parameters is extremely valuable, especially when another developer is in charge of refactoring or debugging this
particular piece of code.
This is why it is a good practice to provide the parameters how
, on
(or left_on
and right_on
)
and validate
to the pandas' merge
and join
.