Piping with Pandas¶
In this article, I'll demonstrate how to use the pipe
method of the pandas.DataFrame
object. pipe
is really nifty if you want to make your preprocessing code more modular and easier to maintain. It also drastically increases the readability, so the handover of your code becomes much easier!
For this demonstration, I'll use the dataset below:
import pandas as pd
txt = """
Index Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 60 '2020/12/16' 98 120 215.2
4 60 '2020/12/17' 100 120 300.0
5 45 '2020/12/18' 90 112 NaN
6 60 '2020/12/21' 108 131 364.2
7 45 NaN 100 119 282.0
8 60 '2020/12/25' 102 126 334.5
9 60 2020/12/26 100 120 250.0
10 60 '2020/12/27' 92 118 241.0
11 60 '2020/12/28' 103 132 NaN
12 60 '2020/12/29' 100 132 280.0
"""
#Let's just get the data in pandas real quick!
df = pd.DataFrame([x.split() for x in txt.splitlines()[1:]])
df.head()
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | Index | Duration | Date | Pulse | Maxpulse | Calories |
1 | 0 | 60 | '2020/12/01' | 110 | 130 | 409.1 |
2 | 1 | 60 | '2020/12/02' | 117 | 145 | 479.0 |
3 | 2 | 60 | '2020/12/03' | 103 | 135 | 340.0 |
4 | 3 | 60 | '2020/12/16' | 98 | 120 | 215.2 |
This dataset sure needs some cleaning.
- The column names are in the first row.
- The
Duration
column needs to be cleaned and dates parsed properly. - We need to deal with missing values.
Now, the method pipe
takes in a function which should take a pd.DataFrame
(and perhaps some arguments), and return a pd.DataFrame
. Let's try to write a function that first fixes the columns and drops the index, which we don't really care about.
def prepare_column_names(dataf : pd.DataFrame) -> pd.DataFrame:
#Set the columns.
dataf.columns = dataf.iloc[0]
#Drop the first row and reset the index.
dataf = dataf.iloc[1:,].reset_index(drop=True)
#Drop the index column.
dataf = dataf.drop(['Index'], axis=1)
return dataf
Let's use the function prepare_column_names
on our dataframe.
df.pipe(lambda x : x.copy())\
.pipe(prepare_column_names)\
.head()
Duration | Date | Pulse | Maxpulse | Calories | |
---|---|---|---|---|---|
0 | 60 | '2020/12/01' | 110 | 130 | 409.1 |
1 | 60 | '2020/12/02' | 117 | 145 | 479.0 |
2 | 60 | '2020/12/03' | 103 | 135 | 340.0 |
3 | 60 | '2020/12/16' | 98 | 120 | 215.2 |
4 | 60 | '2020/12/17' | 100 | 120 | 300.0 |
That fixed our first problem! Note that I used pipe(lambda x : x.copy())
first. If we hadn't, then it would have altered df
itself. You can try and run the code yourself and see what happens without it. Now, let's write a function that cleans the Date
column,
def clean_datestring(dataf : pd.DataFrame) -> pd.DataFrame:
dataf['Date'] = dataf['Date'].str.replace("'", "")
return dataf
one that sets the dtypes of the dataframe,
def set_dtypes(dataf : pd.DataFrame,
dtypes : dict) -> pd.DataFrame:
#Set dtypes. Note that this function has an argument too.
#This should be passed to the pipe function when called.
for col, dtype in dtypes.items():
dataf[col] = dataf[col].astype(dtype)
return dataf
one that converts the date from str
to datetime64[ns]
,
def convert_date(dataf : pd.DataFrame) -> pd.DataFrame:
dataf['Date'] = pd.to_datetime(dataf['Date'],
format='%Y/%m/%d', errors='coerce')
return dataf
and finally, one that deals with the missing values in our dataframe.
def remove_na(dataf : pd.DataFrame) -> pd.DataFrame:
#Fill rows with NA calories with mean.
dataf['Calories'] = dataf['Calories'].fillna(dataf['Calories'].mean())
#Drop any remaining rows with NA.
dataf = dataf.dropna(how='any')
return dataf
Now we're the lucky owners of 5 functions that each can be placed in a seperate .py
file, and then we could import the preprocessing functions as we needed them. This code is easier to maintain, easier to change (e.g. changing Pulse
to float
) and it's just prettier to look at.
Of course, this might be a bit of an exaggerated example - maybe some of the functions could be combined - but I hope you get the idea.
df_clean = df.pipe(lambda x : x.copy())\
.pipe(prepare_column_names)\
.pipe(clean_datestring)\
.pipe(set_dtypes, dtypes={'Duration': int,
'Pulse': int,
'Maxpulse': int,
'Calories': float})\
.pipe(convert_date)\
.pipe(remove_na)
df_clean
Duration | Date | Pulse | Maxpulse | Calories | |
---|---|---|---|---|---|
0 | 60 | 2020-12-01 | 110 | 130 | 409.100000 |
1 | 60 | 2020-12-02 | 117 | 145 | 479.000000 |
2 | 60 | 2020-12-03 | 103 | 135 | 340.000000 |
3 | 60 | 2020-12-16 | 98 | 120 | 215.200000 |
4 | 60 | 2020-12-17 | 100 | 120 | 300.000000 |
5 | 45 | 2020-12-18 | 90 | 112 | 317.727273 |
6 | 60 | 2020-12-21 | 108 | 131 | 364.200000 |
8 | 60 | 2020-12-25 | 102 | 126 | 334.500000 |
9 | 60 | 2020-12-26 | 100 | 120 | 250.000000 |
10 | 60 | 2020-12-27 | 92 | 118 | 241.000000 |
11 | 60 | 2020-12-28 | 103 | 132 | 317.727273 |
12 | 60 | 2020-12-29 | 100 | 132 | 280.000000 |
and let's just check the dtypes as well.
df_clean.dtypes
0 Duration int64 Date datetime64[ns] Pulse int64 Maxpulse int64 Calories float64 dtype: object
![](https://i.kym-cdn.com/photos/images/newsfeed/000/591/928/94f.png)
Thanks for reading! If want to learn more about the pipe
method, I recommended checking out calmcode.io's tutorial here - they make some pretty cool decorators so they can log and time their functions!