Piping with Pandas¶

In this article, I'll demonstrate how to use the pipe method of the pandas.DataFrame object. pipe is really nifty if you want to make your preprocessing code more modular and easier to maintain. It also drastically increases the readability, so the handover of your code becomes much easier!

For this demonstration, I'll use the dataset below:

In [1]:

            
                Copied!
                
                    
                    
                
                

        
import pandas as pd

txt = """
Index Duration          Date  Pulse  Maxpulse  Calories
  0         60  '2020/12/01'    110       130     409.1
  1         60  '2020/12/02'    117       145     479.0
  2         60  '2020/12/03'    103       135     340.0
  3         60  '2020/12/16'     98       120     215.2
  4         60  '2020/12/17'    100       120     300.0
  5         45  '2020/12/18'     90       112       NaN
  6         60  '2020/12/21'    108       131     364.2
  7         45           NaN    100       119     282.0
  8         60  '2020/12/25'    102       126     334.5
  9         60    2020/12/26    100       120     250.0
  10        60  '2020/12/27'     92       118     241.0
  11        60  '2020/12/28'    103       132       NaN
  12        60  '2020/12/29'    100       132     280.0
"""

#Let's just get the data in pandas real quick!
df = pd.DataFrame([x.split() for x in txt.splitlines()[1:]])
df.head()
import pandas as pd

txt = """
Index Duration          Date  Pulse  Maxpulse  Calories
  0         60  '2020/12/01'    110       130     409.1
  1         60  '2020/12/02'    117       145     479.0
  2         60  '2020/12/03'    103       135     340.0
  3         60  '2020/12/16'     98       120     215.2
  4         60  '2020/12/17'    100       120     300.0
  5         45  '2020/12/18'     90       112       NaN
  6         60  '2020/12/21'    108       131     364.2
  7         45           NaN    100       119     282.0
  8         60  '2020/12/25'    102       126     334.5
  9         60    2020/12/26    100       120     250.0
  10        60  '2020/12/27'     92       118     241.0
  11        60  '2020/12/28'    103       132       NaN
  12        60  '2020/12/29'    100       132     280.0
"""

#Let's just get the data in pandas real quick!
df = pd.DataFrame([x.split() for x in txt.splitlines()[1:]])
df.head()

Out[1]:

	0	1	2	3	4	5
0	Index	Duration	Date	Pulse	Maxpulse	Calories
1	0	60	'2020/12/01'	110	130	409.1
2	1	60	'2020/12/02'	117	145	479.0
3	2	60	'2020/12/03'	103	135	340.0
4	3	60	'2020/12/16'	98	120	215.2

This dataset sure needs some cleaning.

The column names are in the first row.
The Duration column needs to be cleaned and dates parsed properly.
We need to deal with missing values.

Now, the method pipe takes in a function which should take a pd.DataFrame (and perhaps some arguments), and return a pd.DataFrame. Let's try to write a function that first fixes the columns and drops the index, which we don't really care about.

In [2]:

            
                Copied!
                
def prepare_column_names(dataf : pd.DataFrame) -> pd.DataFrame:
    #Set the columns.
    dataf.columns = dataf.iloc[0]

    #Drop the first row and reset the index.
    dataf = dataf.iloc[1:,].reset_index(drop=True)

    #Drop the index column.
    dataf = dataf.drop(['Index'], axis=1)
    return dataf
def prepare_column_names(dataf : pd.DataFrame) -> pd.DataFrame:
    #Set the columns.
    dataf.columns = dataf.iloc[0]

    #Drop the first row and reset the index.
    dataf = dataf.iloc[1:,].reset_index(drop=True)

    #Drop the index column.
    dataf = dataf.drop(['Index'], axis=1)
    return dataf

Let's use the function prepare_column_names on our dataframe.

In [3]:

            
                Copied!
                
df.pipe(lambda x : x.copy())\
  .pipe(prepare_column_names)\
  .head()
df.pipe(lambda x : x.copy())\
  .pipe(prepare_column_names)\
  .head()

Out[3]:

	Duration	Date	Pulse	Maxpulse	Calories
0	60	'2020/12/01'	110	130	409.1
1	60	'2020/12/02'	117	145	479.0
2	60	'2020/12/03'	103	135	340.0
3	60	'2020/12/16'	98	120	215.2
4	60	'2020/12/17'	100	120	300.0

That fixed our first problem! Note that I used pipe(lambda x : x.copy()) first. If we hadn't, then it would have altered df itself. You can try and run the code yourself and see what happens without it. Now, let's write a function that cleans the Date column,

In [4]:

            
                Copied!
                
def clean_datestring(dataf : pd.DataFrame) -> pd.DataFrame:
    dataf['Date'] = dataf['Date'].str.replace("'", "")
    return dataf
def clean_datestring(dataf : pd.DataFrame) -> pd.DataFrame:
    dataf['Date'] = dataf['Date'].str.replace("'", "")
    return dataf

one that sets the dtypes of the dataframe,

In [5]:

            
                Copied!
                
                    
                    
                
                

        
def set_dtypes(dataf : pd.DataFrame, 
               dtypes : dict) -> pd.DataFrame:
    #Set dtypes. Note that this function has an argument too.
    #This should be passed to the pipe function when called.
    for col, dtype in dtypes.items():
        dataf[col] = dataf[col].astype(dtype)
    return dataf
def set_dtypes(dataf : pd.DataFrame, 
               dtypes : dict) -> pd.DataFrame:
    #Set dtypes. Note that this function has an argument too.
    #This should be passed to the pipe function when called.
    for col, dtype in dtypes.items():
        dataf[col] = dataf[col].astype(dtype)
    return dataf

one that converts the date from str to datetime64[ns],

In [6]:

            
                Copied!
                
def convert_date(dataf : pd.DataFrame) -> pd.DataFrame:
    dataf['Date'] = pd.to_datetime(dataf['Date'], 
                                   format='%Y/%m/%d', errors='coerce')
    return dataf
def convert_date(dataf : pd.DataFrame) -> pd.DataFrame:
    dataf['Date'] = pd.to_datetime(dataf['Date'], 
                                   format='%Y/%m/%d', errors='coerce')
    return dataf

and finally, one that deals with the missing values in our dataframe.

In [7]:

            
                Copied!
                
def remove_na(dataf : pd.DataFrame) -> pd.DataFrame:
    #Fill rows with NA calories with mean.
    dataf['Calories'] = dataf['Calories'].fillna(dataf['Calories'].mean())

    #Drop any remaining rows with NA.
    dataf = dataf.dropna(how='any')
    return dataf
def remove_na(dataf : pd.DataFrame) -> pd.DataFrame:
    #Fill rows with NA calories with mean.
    dataf['Calories'] = dataf['Calories'].fillna(dataf['Calories'].mean())

    #Drop any remaining rows with NA.
    dataf = dataf.dropna(how='any')
    return dataf

Now we're the lucky owners of 5 functions that each can be placed in a seperate .py file, and then we could import the preprocessing functions as we needed them. This code is easier to maintain, easier to change (e.g. changing Pulse to float) and it's just prettier to look at.

Of course, this might be a bit of an exaggerated example - maybe some of the functions could be combined - but I hope you get the idea.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
df_clean = df.pipe(lambda x : x.copy())\
             .pipe(prepare_column_names)\
             .pipe(clean_datestring)\
             .pipe(set_dtypes, dtypes={'Duration': int, 
                                       'Pulse': int, 
                                       'Maxpulse': int, 
                                       'Calories': float})\
             .pipe(convert_date)\
             .pipe(remove_na)

df_clean
df_clean = df.pipe(lambda x : x.copy())\
             .pipe(prepare_column_names)\
             .pipe(clean_datestring)\
             .pipe(set_dtypes, dtypes={'Duration': int, 
                                       'Pulse': int, 
                                       'Maxpulse': int, 
                                       'Calories': float})\
             .pipe(convert_date)\
             .pipe(remove_na)

df_clean

Out[8]:

	Duration	Date	Pulse	Maxpulse	Calories
0	60	2020-12-01	110	130	409.100000
1	60	2020-12-02	117	145	479.000000
2	60	2020-12-03	103	135	340.000000
3	60	2020-12-16	98	120	215.200000
4	60	2020-12-17	100	120	300.000000
5	45	2020-12-18	90	112	317.727273
6	60	2020-12-21	108	131	364.200000
8	60	2020-12-25	102	126	334.500000
9	60	2020-12-26	100	120	250.000000
10	60	2020-12-27	92	118	241.000000
11	60	2020-12-28	103	132	317.727273
12	60	2020-12-29	100	132	280.000000

and let's just check the dtypes as well.

In [9]:

            
                Copied!
                
df_clean.dtypes
df_clean.dtypes

Out[9]:

0
Duration             int64
Date        datetime64[ns]
Pulse                int64
Maxpulse             int64
Calories           float64
dtype: object

Thanks for reading! If want to learn more about the pipe method, I recommended checking out calmcode.io's tutorial here - they make some pretty cool decorators so they can log and time their functions!