Skip to content Skip to sidebar Skip to footer

How To Get Missing Date In Columns Using Python Pandas

I have following data frame in which i want to get missing date along with its keys in pandas. size number key date 0 153.2 K 12345 Hello 20181001 1

Solution 1:

You can do this with some fancy reshaping like this:

(df.pivot('date', 'key')
   .reindex(np.arange(df['date'].min(), df['date'].max()+1))
   .stack('key', dropna=False)
   .loc[lambda x: x['size'].isna()]
   .index
   .to_frame(index=False))

Output:

       date    key
0  20181002  Hello
1  20181002     No

How?

  • Reshape the dataframe such that you have a single date per row

  • Next, reindex the dataframe to fill in missing dates

  • Reshape the dataframe stacking key but keeping NaN values

  • Filter dataframe to only missing values using isna

  • Convert the index to a dataframe with to_frame

Update address date concern mentioned by @Cimbali below

df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
(df.pivot('date', 'key')
   .reindex(pd.date_range(df['date'].min(), df['date'].max(), freq='D'))
   .stack('key', dropna=False)
   .loc[lambda x: x['size'].isna()]
   .index
   .to_frame(index=False))

Output:

           0    key
0 2018-10-02  Hello
1 2018-10-02     No

Solution 2:

If we align the dates along one dimension, it becomes easier to see the common values (on the index) and where to fill (on the columns). We can do this with pivot_table. (The value here is just a placeholder with all 1s.)

>>> tab = df.assign(value=1).pivot_table(index='key', columns='date', values='value')
>>> tab
date   20181001  20181003  20181004
key                                
Hello         1         1         1
No            1         1         1

melt allows us to do the opposite transformation:

>>> tab.reset_index().melt(id_vars='key').drop(columns='value')
     key      date
0  Hello  20181001
1     No  20181001
2  Hello  20181003
3     No  20181003
4  Hello  20181004
5     No  20181004

So if we want an intermediate step to add missing dates, we should probably convert them to dates first and use pd.date_range:

>>> avail_dates = pd.to_datetime(tab.columns, format='%Y%m%d')
>>> avail_dates
DatetimeIndex(['2018-10-01', '2018-10-03', '2018-10-04'], dtype='datetime64[ns]', name='date', freq=None)
>>> all_dates = pd.date_range(avail_dates.min(), avail_dates.max(), freq='D')
>>> tab_filled = tab.reindex(all_dates.strftime('%Y%m%d').astype(int), axis='columns')
>>> tab_filled
       20181001  20181002  20181003  20181004
key                                          
Hello         1       NaN         1         1
No            1       NaN         1         1

Finally get only the new columns, and do our melt trick:

>>> missing = tab_filled.drop(columns=tab.columns).reset_index().melt('key').drop(columns=['value'])
>>> missing
     key  variable
0  Hello  20181002
1     No  20181002

Here’s a shorter variant on the same principle, where we first build the dates, then a synthetic dataframe that we can melt:

>>> dates = pd.date_range(
...     *pd.to_datetime(df['date'], format='%Y%m%d').agg(['min', 'max']), freq='D'
... ).strftime('%Y%m%d').astype(int)
>>> dates
Int64Index([20181001, 20181002, 20181003, 20181004], dtype='int64')
>>> pd.DataFrame(index=pd.Index(df['key'].unique(), name='key'),
...              columns=dates.difference(df['date']))\
... .reset_index().melt('key').drop(columns=['value'])
     key  variable
0  Hello  20181002
1     No  20181002


Post a Comment for "How To Get Missing Date In Columns Using Python Pandas"