Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas difference of successive elements

Assume I have a data frame like so

df = pd.DataFrame(data=np.random.random(10,10))

I need to create a dataframe(call it diff) such that for every i in diff meets the following criteria

diff[i] = df[i]-df[i-1]

I can do this iteratively but that doesn’t scale well. How would you do this in pandas with super fast speed.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

IIUC use DataFrame.diff:

np.random.seed(2022)

df = pd.DataFrame(data=np.random.random((3,3)))
print(df)
          0         1         2
0  0.009359  0.499058  0.113384
1  0.049974  0.685408  0.486988
2  0.897657  0.647452  0.896963

df1 = df.diff(-1)
print(df1)
          0         1         2
0 -0.040615 -0.186350 -0.373604
1 -0.847683  0.037956 -0.409975
2       NaN       NaN       NaN


df2 = df.diff()
print(df2)
          0         1         2
0       NaN       NaN       NaN
1  0.040615  0.186350  0.373604
2  0.847683 -0.037956  0.409975

Numpy alternatives for improve performance with numpy.diff and DataFrame constructor:

df1 = pd.DataFrame(np.diff(-df, axis=0, append=np.nan), 
                   index=df.index, columns=df.columns)
print(df1)
          0         1         2
0 -0.040615 -0.186350 -0.373604
1 -0.847683  0.037956 -0.409975
2       NaN       NaN       NaN

df2 = pd.DataFrame(np.diff(df, axis=0, prepend=np.nan), 
                   index=df.index, columns=df.columns)
print(df2)
          0         1         2
0       NaN       NaN       NaN
1  0.040615  0.186350  0.373604
2  0.847683 -0.037956  0.409975

Performance:

np.random.seed(2022)

df = pd.DataFrame(data=np.random.random((3000,3000)))


In [75]: %timeit df.diff()
142 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [76]: %timeit pd.DataFrame(np.diff(df, axis=0, prepend=np.nan), index=df.index, columns=df.columns)
77.1 ms ± 469 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading