Skip to content

numpy v/s pandas

One subtle aspect that lot of engineers miss is the access control design of numpy and pandas.

  1. numpy uses row-major access control.
  2. pandas uses column-major access control.

Genrally its seen that engineers complain about slow-nature of numpy compared to pandas. One of the prime culprits of this issue is the lack of knowledge about the access control used in these libraries.

By default, numpy uses row-major but it allows the user to specify the access control while creating the ndarray. Below, are all senarios of accessing the row/col using pandas or numpy is shown.

Itertating numpy ndarray using column-major

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import numpy as np
df_np = df.to_numpy()
n_rows, n_cols = df_np.shape

start = time.time()
for j in range(n_cols):
    for item in df_np[:, j]:
        pass

print(time.time() - start, "seconds")
1
>>> 0.0058 seconds

Itertating numpy ndarray using row-major (default)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np
df_np = df.to_numpy()
n_rows, n_cols = df_np.shape

start = time.time()

for i in range(n_rows):
    for item in df_np[i]:
        pass

print(time.time()-start, "seconds")
1
>>> 0.0195 seconds