Rolling pivot table in python

2018-06-14 13:35:06

I have a df (sample is pasted here at the end). I am looking to find which tradePrice had the most tradeVolume per last 10 min, or any other rolling period.

This is the pivot table done in xls which is based on the sample data attached.

                                      minute                
tradePrice  Data                   0    1   10  Total Result    
12548       Sum - tradeVolume               3   3   
            Count - tradePrice              2   2   
12548.5     Sum - tradeVolume               1   1   
            Count - tradePrice              1   1   
12549       Sum - tradeVolume               1   1   
            Count - tradePrice              1   1   
12549.5     Sum - tradeVolume               95  95  
            Count - tradePrice              5   5   
12550       Sum - tradeVolume               6   6   
            Count - tradePrice              4   4   
12559       Sum - tradeVolume       93          93  
            Count - tradePrice      1           1   
12559.5     Sum - tradeVolume           1       1   
            Count - tradePrice          1       1   
12560       Sum - tradeVolume           5       5   
            Count - tradePrice          4       4   
12560.5     Sum - tradeVolume       3   5       8   
            Count - tradePrice     3    2       5   
12561       Sum - tradeVolume      4    5       9   
            Count - tradePrice     2    3       5   
12561.5     Sum - tradeVolume      3    2       5   
            Count - tradePrice     3    1       4   
12562       Sum - tradeVolume      9    7       16  
            Count - tradePrice     8    1       9   
12562.5     Sum - tradeVolume       6   2       8   
            Count - tradePrice     2    2       4   
12563       Sum - tradeVolume      2            2   
            Count - tradePrice     1            1   
Total Sum - tradeVolume     120 27  106 253 
Total Count - tradePrice    20  14  13  47

The result need to be a df something like this searching for the price with the most volume traded):

    Price   Volume
02:00:00 AM 12559   93
02:01:00 AM 12562   7
02:10:00 AM 12549.5 95

In order to get 1 min. results I groupby & applied the following function

def f(x):  # function to find the POC price and volume
    a = x['tradePrice'].value_counts().index[0]
    b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
    return pd.Series([a, b], ['POC_Price', 'POC_Volume'])

groupbytime = (str(Time)+"min")#ther is a column name by this
groups = df.groupby(groupbytime,as_index=True)
df_POC = groups.apply(f)  #applys the function of the POC on the grouped data

My question is: How can I get the same solution but per rolling time period (which can not be less than 1 min) so the expected result (which price was traded with the max volume) for the last 10 minutes is:

    Price   Volume
02:10:00 AM 12549.5 95

thanks in advance!

The sample data:

    dateTime    tradePrice  tradeVolume 1min    time_of_day_10  time_of_day_30  date    hour    minute
0   2017-09-19 02:00:04 12559   93  2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
49  2017-09-19 02:00:11 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
50  2017-09-19 02:00:12 12563   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
51  2017-09-19 02:00:12 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
122 2017-09-19 02:00:34 12561.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
123 2017-09-19 02:00:34 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
127 2017-09-19 02:00:34 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
129 2017-09-19 02:00:35 12561   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
130 2017-09-19 02:00:35 12560.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
131 2017-09-19 02:00:35 12561.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
135 2017-09-19 02:00:39 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
136 2017-09-19 02:00:39 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
137 2017-09-19 02:00:43 12561.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
138 2017-09-19 02:00:43 12561   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
139 2017-09-19 02:00:43 12560.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
140 2017-09-19 02:00:43 12560.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
152 2017-09-19 02:00:45 12562   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
153 2017-09-19 02:00:46 12562.5 4   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
166 2017-09-19 02:00:58 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
167 2017-09-19 02:00:58 12562.5 2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
168 2017-09-19 02:01:00 12562   7   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
169 2017-09-19 02:01:00 12562.5 1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
170 2017-09-19 02:01:00 12562.5 1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
171 2017-09-19 02:01:00 12561.5 2   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
175 2017-09-19 02:01:03 12561   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
176 2017-09-19 02:01:03 12561   3   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
187 2017-09-19 02:01:07 12560.5 2   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
188 2017-09-19 02:01:08 12561   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
189 2017-09-19 02:01:10 12560   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
190 2017-09-19 02:01:10 12560   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
191 2017-09-19 02:01:10 12559.5 1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
192 2017-09-19 02:01:11 12560   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
193 2017-09-19 02:01:12 12560   2   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
194 2017-09-19 02:01:12 12560.5 3   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
593 2017-09-19 02:10:00 12550   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
594 2017-09-19 02:10:00 12549.5 12  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
604 2017-09-19 02:10:12 12548.5 1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
605 2017-09-19 02:10:15 12549.5 22  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
606 2017-09-19 02:10:16 12549.5 21  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
636 2017-09-19 02:10:45 12548   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
637 2017-09-19 02:10:47 12548   2   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
638 2017-09-19 02:10:47 12549.5 23  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
639 2017-09-19 02:10:48 12549.5 17  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
640 2017-09-19 02:10:49 12549   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
665 2017-09-19 02:10:58 12550   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
666 2017-09-19 02:10:58 12550   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
667 2017-09-19 02:10:58 12550   3   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10

If i understand your problem correctly you need to chose a time granuality and a time window. Then you can do this with a combination of groupby + unstack + rolling.

First the groupby:

time_grain = '1min'
df = df.groupby([pd.Grouper(key='dateTime', freq=time_grain),'tradePrice']).tradeVolume.sum()
dateTime             tradePrice
2017-09-19 02:00:00  12559.0       93
                     12560.5        3
                     12561.0        4
                     12561.5        3
                     12562.0        9
                     12562.5        6
                     12563.0        2
2017-09-19 02:01:00  12559.5        1
                     12560.0        5
                     12560.5        5
                     12561.0        5
                     12561.5        2
                     12562.0        7
                     12562.5        2
2017-09-19 02:10:00  12548.0        3
                     12548.5        1
                     12549.0        1
                     12549.5       95
                     12550.0        6
Name: tradeVolume, dtype: int64

Then unstack + rolling:

window_size = '10min'
df = df.unstack('tradePrice').rolling(window_size).sum()
tradePrice           12548.0  12548.5  12549.0  12549.5  12550.0  12559.0  
dateTime                                                                    
2017-09-19 02:00:00      NaN      NaN      NaN      NaN      NaN     93.0   
2017-09-19 02:01:00      NaN      NaN      NaN      NaN      NaN     93.0   
2017-09-19 02:10:00      3.0      1.0      1.0     95.0      6.0      NaN   

tradePrice           12559.5  12560.0  12560.5  12561.0  12561.5  12562.0  
dateTime                                                                    
2017-09-19 02:00:00      NaN      NaN      3.0      4.0      3.0      9.0   
2017-09-19 02:01:00      1.0      5.0      8.0      9.0      5.0     16.0   
2017-09-19 02:10:00      1.0      5.0      5.0      5.0      2.0      7.0   

tradePrice           12562.5  12563.0  
dateTime                               
2017-09-19 02:00:00      6.0      2.0  
2017-09-19 02:01:00      8.0      2.0  
2017-09-19 02:10:00      2.0      NaN

Finally stack back tradePrice into the index and find the index with highest value for every time period:

df = df.stack('tradePrice')
idx_list = df.groupby('dateTime').idxmax()
result = df.loc[idx_list]
dateTime             tradePrice
2017-09-19 02:00:00  12559.0       93.0
2017-09-19 02:01:00  12559.0       93.0
2017-09-19 02:10:00  12549.5       95.0
dtype: float64

Note that rolling defaults to a min number of observations of 1 if you use it with a time offset. Thats why you get 3 result rows.

I think the biggest drawback of this approach is that for a big dataframe with a high number of price points this will take a lot of memory (Because for every price point a new column is generated).

链接地址: http://www.djcxy.com/p/41432.html

上一篇: 为什么这个简单的左连接返回来自不匹配行的数据？

下一篇: 在Python中滚动数据透视表