Slicing Pandas rows with string match slow

2018-07-04 18:49:48

I basically want to learn a faster way to slice a Pandas dataframe with conditional slicing based on regex. For example the following df (there's more than 4 variations in the string_column, they are only for illustrative purposes):

index, string_col1, string_col2, value
0, 'apple', 'this', 10
1, 'pen', 'is', 123
2, 'pineapple', 'sparta', 20
3, 'pen pineapple apple pen', 'this', 234
4, 'apple', 'is', 212
5, 'pen', 'sparta', 50
6, 'pineapple', 'this', 69
7, 'pen pineapple apple pen', 'is',  79
8, 'apple pen', 'sparta again', 78
...
100000, 'pen pineapple apple pen', 'this is sparta', 392

I have to do Boolean conditional slicing according to the string_column using regex, while finding the indices with minimum and maximum in the value column, and then finally finding the difference between the min and max value. I do this by the following method, but it's SUPER SLOW when I have to match many different regex patterns:

pat1 = re.compile('apple')
pat2 = re.compile('sparta')
mask = (df['string_col1'].str.contains(pat1)) & (df['string_col2'].str.contains(pat2))
max_idx = df[mask].idxmax()
min_idx = df[mask].idxmin()
difference = df['value'].loc[max_idx] - df['value'].loc[min_idx]

I think to get one "difference" answer, I'm slicing the df too many times, but I can't figure out how to do it less. Furthermore, is there a faster way to slice it?

This is an optimization question since I know my code gets me what I need. Any tips will be appreciated!

您可以通过不使用&而是使用scipy.logical_and()来加快逻辑比较50倍

a = pd.Series(sp.rand(10000) > 0.5)
b = pd.Series(sp.rand(10000) > 0.5)

%timeit sp.logical_and(a.values,b.values)
100000 loops, best of 3: 6.31 µs per loop

%timeit a & b
1000 loops, best of 3: 390 µs per loop

I've been trying to profile your example, but I'm actually getting pretty great performance on my synthetic data, so I may need some clarification. (Also, for some reason .idxmax() breaks for me whenever I have a string in my dataframe).

Here's my testing code:

import pandas as pd
import re
import numpy as np
import random
import IPython
from timeit import default_timer as timer

possibilities_col1 = ['apple', 'pen', 'pineapple', 'joseph', 'cauliflower']
possibilities_col2 = ['sparta', 'this', 'is', 'again']
entries = 100000
potential_words_col1 = 4
potential_words_col2 = 3
def create_function_col1():
    result = []
    for x in range(random.randint(1, potential_words_col1)):
        result.append(random.choice(possibilities_col1))
    return " ".join(result)

def create_function_col2():
    result = []
    for x in range(random.randint(1, potential_words_col2)):
        result.append(random.choice(possibilities_col2))
    return " ".join(result)

data = {'string_col1': pd.Series([create_function_col1() for _ in range(entries)]),
        'string_col2': pd.Series([create_function_col2() for _ in range(entries)]),
        'value': pd.Series([random.randint(1, 500) for _ in range(entries)])}


df = pd.DataFrame(data)
pat1 = re.compile('apple')
pat2 = re.compile('sparta')
pat3 = re.compile('pineapple')
pat4 = re.compile('this')
#IPython.embed()
start = timer()
mask = df['string_col1'].str.contains(pat1) & 
       df['string_col1'].str.contains(pat3) & 
       df['string_col2'].str.contains(pat2) & 
       df['string_col2'].str.contains(pat4)
valid = df[mask]
max_idx = valid['value'].argmax()
min_idx = valid['value'].argmin()
#max_idx = result['max']
#min_idx = result['min']
difference = df.loc[max_idx, 'value'] - df.loc[min_idx, 'value']
end = timer()
print("Difference: {}".format(difference))
print("# Valid: {}".format(len(valid)))
print("Time Elapsed: {}".format(end-start))

Can you explain how many conditions you're applying? (Each regex I add only adds a roughly linear increase in time (ie 2->3 regex means a 1.5x increase in run time)). I'm also getting linear scaling on the number of entries, and both potential string lengths (the potential_words variables).

For reference, this code is evaluating in ~ .15 seconds on my machine (1 million entries takes ~1.5 seconds).

Edit: I'm an idiot and wasn't doing the same thing you were (I was taking the difference between values at the smallest and largest indices in the dataset, not the difference between the smallest and largest values), but fixing it didn't really add much in the way of runtime.

Edit 2: How does idxmax() know which column to select a maximum along in your example code?

将每个掩码传递给数据帧的下一个子集，每个新的过滤发生在原始数据帧的较小子集上：

pat1 = re.compile('apple')
pat2 = re.compile('sparta')
mask1 = df['string_col1'].str.contains(pat1)
mask = (df[mask1]['string_col2'].str.contains(pat2))
df1=df[mask1][mask]
max_idx = df1['value'].idxmax()
min_idx = df1['value'].idxmin()
a,b=df1['value'].loc[max_idx],df1['value'].loc[min_idx]

链接地址: http://www.djcxy.com/p/96846.html

上一篇: JS数字函数在最后加上零

下一篇: 切片与字符串匹配缓慢的熊猫行