本文的内容主要来源于A Beginner’s Guide to Optimizing Pandas Code for Speed这篇文章,入门级的讲了怎么优化Pandas DataFrame
的处理速度。
DataFrame
,其head
如下: d m1 m2
0 GbGXR/7198718882 66 0.670074
1 ylaMAz/121108977765122 74 0.497126
2 TmMGuz/841097771117122 39 0.360868
3 RkzCzz/8210712267122122 76 0.293050
4 sWxCNIji/11587120677873106105 14 0.893429
# 该函数必须可以接收pd.Series或np.Array作为参数,因此函数里只有一些常规的运算操作
def simple_function(v):
return (v**2 - v) // 2 + (v**0.5) // 2
目的是将DataFrame
的d
列中的值/
后面的数字切割出来,生成一个新的id
列。
7.23s
。%%timeit
m3 = []
df = origin_df.copy(deep=True)
for i in range(0, len(df)):
m3.append(simple_function(df.iloc[i]['m1']))
df['m3'] = m3
1 loop, best of 3: 7.23 s per loop
iterrows
,平均3.27s
%%timeit
m3 = []
df = origin_df.copy(deep=True)
for _, row in df.iterrows():
m3.append(simple_function(row['m1']))
df['m3'] = m3
1 loop, best of 3: 3.27 s per loop
apply
,平均29.5ms
%%timeit
df = origin_df.copy(deep=True)
df['m3'] = df['m1'].apply(simple_function)
10 loops, best of 3: 29.5 ms per loop
Pandas series
,平均7.88ms
%%timeit
df = origin_df.copy(deep=True)
df['m3'] = simple_function(df['m1'])
100 loops, best of 3: 7.88 ms per loop
NumPy arrays
,平均5.31ms
%%timeit
df = origin_df.copy(deep=True)
df['m3'] = simple_function(df['m1'].values)
100 loops, best of 3: 5.31 ms per loop