How To Get Missing Date In Columns Using Python Pandas
Solution 1:
You can do this with some fancy reshaping like this:
(df.pivot('date', 'key')
.reindex(np.arange(df['date'].min(), df['date'].max()+1))
.stack('key', dropna=False)
.loc[lambda x: x['size'].isna()]
.index
.to_frame(index=False))
Output:
date key
0 20181002 Hello
1 20181002 No
How?
Reshape the dataframe such that you have a single date per row
Next, reindex the dataframe to fill in missing dates
Reshape the dataframe stacking key but keeping NaN values
Filter dataframe to only missing values using
isna
Convert the index to a dataframe with to_frame
Update address date concern mentioned by @Cimbali below
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
(df.pivot('date', 'key')
.reindex(pd.date_range(df['date'].min(), df['date'].max(), freq='D'))
.stack('key', dropna=False)
.loc[lambda x: x['size'].isna()]
.index
.to_frame(index=False))
Output:
0 key
0 2018-10-02 Hello
1 2018-10-02 No
Solution 2:
If we align the dates along one dimension, it becomes easier to see the common values (on the index) and where to fill (on the columns). We can do this with pivot_table
. (The value
here is just a placeholder with all 1
s.)
>>> tab = df.assign(value=1).pivot_table(index='key', columns='date', values='value')
>>> tab
date 20181001 20181003 20181004
key
Hello 1 1 1
No 1 1 1
melt
allows us to do the opposite transformation:
>>> tab.reset_index().melt(id_vars='key').drop(columns='value')
key date
0 Hello 20181001
1 No 20181001
2 Hello 20181003
3 No 20181003
4 Hello 20181004
5 No 20181004
So if we want an intermediate step to add missing dates, we should probably convert them to dates first and use pd.date_range
:
>>> avail_dates = pd.to_datetime(tab.columns, format='%Y%m%d')
>>> avail_dates
DatetimeIndex(['2018-10-01', '2018-10-03', '2018-10-04'], dtype='datetime64[ns]', name='date', freq=None)
>>> all_dates = pd.date_range(avail_dates.min(), avail_dates.max(), freq='D')
>>> tab_filled = tab.reindex(all_dates.strftime('%Y%m%d').astype(int), axis='columns')
>>> tab_filled
20181001 20181002 20181003 20181004
key
Hello 1 NaN 1 1
No 1 NaN 1 1
Finally get only the new columns, and do our melt
trick:
>>> missing = tab_filled.drop(columns=tab.columns).reset_index().melt('key').drop(columns=['value'])
>>> missing
key variable
0 Hello 20181002
1 No 20181002
Here’s a shorter variant on the same principle, where we first build the dates, then a synthetic dataframe that we can melt
:
>>> dates = pd.date_range(
... *pd.to_datetime(df['date'], format='%Y%m%d').agg(['min', 'max']), freq='D'
... ).strftime('%Y%m%d').astype(int)
>>> dates
Int64Index([20181001, 20181002, 20181003, 20181004], dtype='int64')
>>> pd.DataFrame(index=pd.Index(df['key'].unique(), name='key'),
... columns=dates.difference(df['date']))\
... .reset_index().melt('key').drop(columns=['value'])
key variable
0 Hello 20181002
1 No 20181002
Post a Comment for "How To Get Missing Date In Columns Using Python Pandas"