Home Quickest way to iterate over a pandas dataframe and perform an operation on a column, when what the operation is depends on the row

Questions

Quickest way to iterate over a pandas dataframe and perform an operation on a column, when what the operation is depends on the row

byMR

September 11, 2024

I have a table that is laid out somewhat like this:

t	linenum	many other columns
1234567	0	…
1234568	0	…
1234569	0	…
1234570	1	…
1234571	1	…

Except it is very, very large. As in, the raw .dat files can get up to 20 gb. I have them converted into .h5 files so they are slightly smaller, but still large (about half the size, I’d say.)

I want to add a column that is time within line, so it subtracts the first time value for the line from each time, so I end up having something like this:

t	linenum	time within line
1234567	0	0
1234568	0	1
1234569	0	2
1234570	1	0
1234571	1	1

The thing is, while I know that doing an operation on the whole dataframe at once is much faster, I haven’t been able to figure out how to do this without using a for loop, since the number that needs to be subtracted depends on linenum, and it takes ages. (Yesterday, I tested this on a file about 9gb big, and I gave up and went home after it had been processing for half an hour, only to find this morning that my computer had restarted overnight so the jupyter server had to restart and I lost the processed dataframe…) Here is the relevant parts of the code I currently have:

import pandas as pd
file = [h5 file address]
df = pd.read_hdf(file)
for linenum in pd.unique(df['linenum']):
  line_df = df.loc[df['linenum'] == linenum]
  first_t = int(line_df['t'].iloc[0])
  df.loc[df['linenum'] == linenum, 't_adjusted'] = (df.loc[df['linenum'] == linenum, 't'] - first_t)

Is there any way to do this without a for loop, and if not, is there any way to make it faster? I’m trying to graph one of the other columns using matplotlib.pyplot.tricontourf, with linenum on the x axis and time within line on the y axis, if that’s relevant at all. There is another column I can use as a workaround because it’s approximately proportional to time within line but I’d prefer to find a way to use the time. Thank you!

Edit: Also, if it’s relevant, I am using Python 3.7. For some reason some of the computers my programs have to run on at my work are still on Windows 7 so I can’t update…

>Solution :

You can use a groupby on 'linenum' and then transform to populate each group

df['timewithinline'] = df.groupby('linenum')['t'].transform(lambda x: x - min(x))

If the times are already sorted, you can use:

df['timewithinline'] = df.groupby('linenum')['t'].transform(lambda x: x - x.iloc[0])

large-files

byMR

Published September 11, 2024

Add a comment

Why is there no output when working with certain card numbers? cs50 credit problem

byMR

September 11, 2024

Questions

Map<?, Optional<T>> to Map<?, T>

byMR

September 11, 2024

Questions

Google sheets query not returning words with special characters

byMR

September 11, 2024

Questions

js how to remove unit separator code inside a string

byMR

September 11, 2024

Questions

With newest version of Vue, Vue router is not working correctly

byMR

September 11, 2024

Questions

Unable to scale up a video from edge to edge

byMR

September 11, 2024

Quickest way to iterate over a pandas dataframe and perform an operation on a column, when what the operation is depends on the row

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Why is there no output when working with certain card numbers? cs50 credit problem

Map<?, Optional<T>> to Map<?, T>

Google sheets query not returning words with special characters

js how to remove unit separator code inside a string

With newest version of Vue, Vue router is not working correctly

Unable to scale up a video from edge to edge

Keep Up to Date with the Most Important News

Quickest way to iterate over a pandas dataframe and perform an operation on a column, when what the operation is depends on the row

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Why is there no output when working with certain card numbers? cs50 credit problem

Map<?, Optional<T>> to Map<?, T>

Google sheets query not returning words with special characters

js how to remove unit separator code inside a string

With newest version of Vue, Vue router is not working correctly

Unable to scale up a video from edge to edge

Discover more from Dev solutions