Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Quickest way to iterate over a pandas dataframe and perform an operation on a column, when what the operation is depends on the row

I have a table that is laid out somewhat like this:

t linenum many other columns
1234567 0
1234568 0
1234569 0
1234570 1
1234571 1

Except it is very, very large. As in, the raw .dat files can get up to 20 gb. I have them converted into .h5 files so they are slightly smaller, but still large (about half the size, I’d say.)

I want to add a column that is time within line, so it subtracts the first time value for the line from each time, so I end up having something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

t linenum time within line
1234567 0 0
1234568 0 1
1234569 0 2
1234570 1 0
1234571 1 1

The thing is, while I know that doing an operation on the whole dataframe at once is much faster, I haven’t been able to figure out how to do this without using a for loop, since the number that needs to be subtracted depends on linenum, and it takes ages. (Yesterday, I tested this on a file about 9gb big, and I gave up and went home after it had been processing for half an hour, only to find this morning that my computer had restarted overnight so the jupyter server had to restart and I lost the processed dataframe…) Here is the relevant parts of the code I currently have:

import pandas as pd
file = [h5 file address]
df = pd.read_hdf(file)
for linenum in pd.unique(df['linenum']):
  line_df = df.loc[df['linenum'] == linenum]
  first_t = int(line_df['t'].iloc[0])
  df.loc[df['linenum'] == linenum, 't_adjusted'] = (df.loc[df['linenum'] == linenum, 't'] - first_t)

Is there any way to do this without a for loop, and if not, is there any way to make it faster? I’m trying to graph one of the other columns using matplotlib.pyplot.tricontourf, with linenum on the x axis and time within line on the y axis, if that’s relevant at all. There is another column I can use as a workaround because it’s approximately proportional to time within line but I’d prefer to find a way to use the time. Thank you!

Edit: Also, if it’s relevant, I am using Python 3.7. For some reason some of the computers my programs have to run on at my work are still on Windows 7 so I can’t update…

>Solution :

You can use a groupby on 'linenum' and then transform to populate each group

df['timewithinline'] = df.groupby('linenum')['t'].transform(lambda x: x - min(x))

If the times are already sorted, you can use:

df['timewithinline'] = df.groupby('linenum')['t'].transform(lambda x: x - x.iloc[0])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading