numpy
v/s pandas
One subtle aspect that lot of engineers miss is the access control
design of numpy
and pandas
.
numpy
uses row-major access control.
pandas
uses column-major access control.
Genrally its seen that engineers complain about slow-nature of numpy
compared to pandas
. One of the prime culprits of this issue is the lack of knowledge about the access control used in these libraries.
By default, numpy
uses row-major but it allows the user to specify the access control while creating the ndarray
. Below, are all senarios of accessing the row/col using pandas or numpy is shown.
Itertating numpy ndarray
using column-major
| import numpy as np
df_np = df.to_numpy()
n_rows, n_cols = df_np.shape
start = time.time()
for j in range(n_cols):
for item in df_np[:, j]:
pass
print(time.time() - start, "seconds")
|
Itertating numpy ndarray
using row-major (default)
| import numpy as np
df_np = df.to_numpy()
n_rows, n_cols = df_np.shape
start = time.time()
for i in range(n_rows):
for item in df_np[i]:
pass
print(time.time()-start, "seconds")
|