Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:

df = pandas.DataFrame(
    {'value_a': values_a, 'value_b': values_b},
    index=[states, cities, dates])
df.index.names = ['State', 'City', 'Date']
df

                               value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        0       10
                   2012-01-02        1       11
                   2012-01-03        2       12
                   2012-01-04        3       13
        Savanna    2012-01-01        4       14
                   2012-01-02        5       15
                   2012-01-03        6       16
                   2012-01-04        7       17
Alabama Mobile     2012-01-01        8       18
                   2012-01-02        9       19
                   2012-01-03       10       20
                   2012-01-04       11       21
        Montgomery 2012-01-01       12       22
                   2012-01-02       13       23
                   2012-01-03       14       24
                   2012-01-04       15       25

I’d like to perform time resampling per city, so something like

df.resample("2D", how="sum")

would output

                             value_a  value_b
State   City       Date                        
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49

as is, df.resample('2D', how='sum') gets me

TypeError: Only valid with DatetimeIndex or PeriodIndex

Fair enough, but I’d sort of expect this to work:

>>> df.swaplevel('Date', 'State').resample('2D', how='sum')
TypeError: Only valid with DatetimeIndex or PeriodIndex

at which point I’m really running out of ideas… is there some way stack and unstack might be able to help me?

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

pd.Grouper
allows you to specify a “groupby instruction for a target object”. In
particular, you can use it to group by dates even if df.index is not a DatetimeIndex:

df.groupby(pd.Grouper(freq='2D', level=-1))

The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex.
Moreover, you can use this in conjunction with other level values from the index:

level_values = df.index.get_level_values
result = (df.groupby([level_values(i) for i in [0,1]]
                      +[pd.Grouper(freq='2D', level=-1)]).sum())

It looks a bit awkward, but using_Grouper turns out to be much faster than my original
suggestion, using_reset_index:

import numpy as np
import pandas as pd
import datetime as DT

def using_Grouper(df):
    level_values = df.index.get_level_values
    return (df.groupby([level_values(i) for i in [0,1]]
                       +[pd.Grouper(freq='2D', level=-1)]).sum())

def using_reset_index(df):
    df = df.reset_index(level=[0, 1])
    return df.groupby(['State','City']).resample('2D').sum()

def using_stack(df):
    # http://stackoverflow.com/a/15813787/190597
    return (df.unstack(level=[0,1])
              .resample('2D').sum()
              .stack(level=[2,1])
              .swaplevel(2,0))

def make_orig():
    values_a = range(16)
    values_b = range(10, 26)
    states = ['Georgia']*8 + ['Alabama']*8
    cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4
    dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)
    df = pd.DataFrame(
        {'value_a': values_a, 'value_b': values_b},
        index = [states, cities, dates])
    df.index.names = ['State', 'City', 'Date']
    return df

def make_df(N):
    dates = pd.date_range('2000-1-1', periods=N)
    states = np.arange(50)
    cities = np.arange(10)
    index = pd.MultiIndex.from_product([states, cities, dates], 
                                       names=['State', 'City', 'Date'])
    df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,
                      columns=['value_a', 'value_b'])
    return df

df = make_orig()
print(using_Grouper(df))

yields

                               value_a  value_b
State   City       Date                        
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:

In [30]: df = make_df(10)

In [34]: len(df)
Out[34]: 5000

In [32]: %timeit using_Grouper(df)
100 loops, best of 3: 6.03 ms per loop

In [33]: %timeit using_stack(df)
10 loops, best of 3: 22.3 ms per loop

In [31]: %timeit using_reset_index(df)
1 loop, best of 3: 659 ms per loop

Solution 2

You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.

To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.

This should give you the DataFrame you need:

df.groupby([pd.Grouper(level='State'), 
            pd.Grouper(level='City'), 
            pd.Grouper(level='Date', freq='2D')]
          ).sum()

The Time Series Guide in the pandas documentation describes resample() as:

… a time-based groupby, followed by a reduction method on each of its groups.

Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.

The same paragraph points to the cookbook section on resampling for more advanced examples, where the ‘Grouping using a MultiIndex‘ entry is highly relevant for this question. Hope that helps.

Solution 3

An alternative using stack/unstack

df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)

                               value_a  value_b
State   City       Date
Georgia Atlanta    2012-01-01        1       21
Alabama Mobile     2012-01-01       17       37
        Montgomery 2012-01-01       25       45
Georgia Savanna    2012-01-01        9       29
        Atlanta    2012-01-03        5       25
Alabama Mobile     2012-01-03       21       41
        Montgomery 2012-01-03       29       49
Georgia Savanna    2012-01-03       13       33

Notes:

  1. No idea about performance comparison
  2. Possible pandas bug – stack(level=[2,1]) worked, but stack(level=[1,2]) failed

Solution 4

This works:

df.groupby(level=[0,1]).apply(lambda x: x.set_index('Date').resample('2D', how='sum'))

                               value_a  value_b
State   City       Date
Alabama Mobile     2012-01-01       17       37
                   2012-01-03       21       41
        Montgomery 2012-01-01       25       45
                   2012-01-03       29       49
Georgia Atlanta    2012-01-01        1       21
                   2012-01-03        5       25
        Savanna    2012-01-01        9       29
                   2012-01-03       13       33

If the Date column is strings, then convert to datetime beforehand:

df['Date'] = pd.to_datetime(df['Date'])

Solution 5

I had the same issue, was breaking my head for a while, but then I read the documentation of the .resample function in the 0.19.2 docs, and I see there’s a new kwarg called “level” that you can use to specify a level in a MultiIndex.

Edit: More details in the “What’s New” section.

Solution 6

I know this question is a few years old, but I had the same problem and came to a simpler solution that requires 1 line:

>>> import pandas as pd
>>> ts = pd.read_pickle('time_series.pickle')
>>> ts
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-07-01     1
                                  2012-07-02    13
                                  2012-07-03     1
                                  2012-07-04     1
                                  2012-07-05    10
                                  2012-07-06     4
                                  2012-07-07    47
                                  2012-07-08     0
                                  2012-07-09     3
                                  2012-07-10    22
                                  2012-07-11     3
                                  2012-07-12     0
                                  2012-07-13    22
                                  2012-07-14     1
                                  2012-07-15     2
                                  2012-07-16     2
                                  2012-07-17     8
                                  2012-07-18     0
                                  2012-07-19     1
                                  2012-07-20    10
                                  2012-07-21     0
                                  2012-07-22     3
                                  2012-07-23     0
                                  2012-07-24    35
                                  2012-07-25     6
                                  2012-07-26     1
                                  2012-07-27     0
                                  2012-07-28     6
                                  2012-07-29    23
                                  2012-07-30     0
                                                ..
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2014-06-02     0
                                  2014-06-03     1
                                  2014-06-04     0
                                  2014-06-05     0
                                  2014-06-06     0
                                  2014-06-07     0
                                  2014-06-08     2
                                  2014-06-09     0
                                  2014-06-10     0
                                  2014-06-11     0
                                  2014-06-12     0
                                  2014-06-13     0
                                  2014-06-14     0
                                  2014-06-15     0
                                  2014-06-16     0
                                  2014-06-17     0
                                  2014-06-18     0
                                  2014-06-19     0
                                  2014-06-20     0
                                  2014-06-21     0
                                  2014-06-22     0
                                  2014-06-23     0
                                  2014-06-24     0
                                  2014-06-25     4
                                  2014-06-26     0
                                  2014-06-27     1
                                  2014-06-28     0
                                  2014-06-29     0
                                  2014-06-30     1
                                  2014-07-01     0
dtype: int64
>>> ts.unstack().T.resample('W', how='sum').T.stack()
xxxxxx1  yyyyyyyyyyyyyyyyyyyyyy1  2012-06-25/2012-07-01      1
                                  2012-07-02/2012-07-08     76
                                  2012-07-09/2012-07-15     53
                                  2012-07-16/2012-07-22     24
                                  2012-07-23/2012-07-29     71
                                  2012-07-30/2012-08-05     38
                                  2012-08-06/2012-08-12    258
                                  2012-08-13/2012-08-19    144
                                  2012-08-20/2012-08-26    184
                                  2012-08-27/2012-09-02    323
                                  2012-09-03/2012-09-09    198
                                  2012-09-10/2012-09-16    348
                                  2012-09-17/2012-09-23    404
                                  2012-09-24/2012-09-30    380
                                  2012-10-01/2012-10-07    367
                                  2012-10-08/2012-10-14    163
                                  2012-10-15/2012-10-21    338
                                  2012-10-22/2012-10-28    252
                                  2012-10-29/2012-11-04    197
                                  2012-11-05/2012-11-11    336
                                  2012-11-12/2012-11-18    234
                                  2012-11-19/2012-11-25    143
                                  2012-11-26/2012-12-02    204
                                  2012-12-03/2012-12-09    296
                                  2012-12-10/2012-12-16    146
                                  2012-12-17/2012-12-23     85
                                  2012-12-24/2012-12-30    198
                                  2012-12-31/2013-01-06    214
                                  2013-01-07/2013-01-13    229
                                  2013-01-14/2013-01-20    192
                                                          ...
xxxxxxN  yyyyyyyyyyyyyyyyyyyyyyN  2013-12-09/2013-12-15      3
                                  2013-12-16/2013-12-22      0
                                  2013-12-23/2013-12-29      0
                                  2013-12-30/2014-01-05      1
                                  2014-01-06/2014-01-12      3
                                  2014-01-13/2014-01-19      6
                                  2014-01-20/2014-01-26     11
                                  2014-01-27/2014-02-02      0
                                  2014-02-03/2014-02-09      1
                                  2014-02-10/2014-02-16      4
                                  2014-02-17/2014-02-23      3
                                  2014-02-24/2014-03-02      1
                                  2014-03-03/2014-03-09      4
                                  2014-03-10/2014-03-16      0
                                  2014-03-17/2014-03-23      0
                                  2014-03-24/2014-03-30      9
                                  2014-03-31/2014-04-06      1
                                  2014-04-07/2014-04-13      1
                                  2014-04-14/2014-04-20      1
                                  2014-04-21/2014-04-27      2
                                  2014-04-28/2014-05-04      8
                                  2014-05-05/2014-05-11      7
                                  2014-05-12/2014-05-18      5
                                  2014-05-19/2014-05-25      2
                                  2014-05-26/2014-06-01      8
                                  2014-06-02/2014-06-08      3
                                  2014-06-09/2014-06-15      0
                                  2014-06-16/2014-06-22      0
                                  2014-06-23/2014-06-29      5
                                  2014-06-30/2014-07-06      1
dtype: int64

ts.unstack().T.resample('W', how='sum').T.stack() is all it took! Very easy and seems quite performant. The pickle I’m reading in is 331M, so this is a pretty beefy data structure; the resampling takes just a couple seconds on my MacBook Pro.

Solution 7

I haven’t checked the efficiency of this, but my instinctual way of performing datetime operations on a multi-index was by a kind of manual “split-apply-combine” process using a dictionary comprehension.

Assuming your DataFrame is unindexed. (You can do .reset_index() first), this works as follows:

  1. Group by the non-date columns
  2. Set “Date” as index and resample each chunk
  3. Reassemble using pd.concat

The final code looks like:

pd.concat({g: x.set_index("Date").resample("2D").mean()
                   for g, x in house.groupby(["State", "City"])})

Solution 8

I have tried this on my own, pretty short and pretty simple too (I will only work with 2 indexes, and you would get the full idea):

Step 1: resample the date but that would give you the date without the other index :

new=df.reset_index('City').groupby('crime', group_keys=False).resample('2d').sum().pad()

That would give you the date and its count

Step 2: get the categorical index in the same order as the the date :

col=df.reset_index('City').groupby('City', group_keys=False).resample('2D').pad()[['City']]

That would give you a new column with the city names and in the same order as the date.

Step 3: merge the dataframes together

new_df=pd.concat([new, col], axis=1)

It’s pretty simple, you can make it really shorter tho.

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply