Home Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

Questions

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

April 3, 2023

Let’s say I have the following dataframe:

Code	Price
AA1	10
AA1	20
BB2	30

And I want to perform the following operation on it:

df.groupby("code").aggregate({
    "price": "sum"
})

I have tried playing with the new pyarrow dtypes introduced in Pandas 2.0 and I created 3 copies, and for each copy I measured execution time (average of 5 executions) of the operation above.

Code column dtype	Price column dtype	Execution time
Object	float64	2.94 s
string[pyarrow]	double[pyarrow]	49.5 s
string[pyarrow]	float64	1.11 s

Can anyone explain why applying an aggregate function on a column with double pyarrow dtype is so slow compared to the standard numpy float64 dtype?

>Solution :

https://github.com/pandas-dev/pandas/issues/52070

Looks like groupby for arrow isn’t implemented yet – so there’s likely a arrow -> numpy happening internally leading to a loss of performance.

apache-arrow

byMR

Published April 03, 2023

Add a comment

How to get difference of columns in DataFrame Pandas?

byMR

April 3, 2023

Questions

Destructuring map function arguments in clojure: does the map need to go last?

byMR

April 3, 2023

Questions

JavaScript return other object if array string matches object property value

byMR

April 3, 2023

Questions

How do I remove special characters found on column names in R

byMR

April 3, 2023

Questions

What is the equivalent of this Python code in Hasekell?

byMR

April 3, 2023

Questions

I can't show my data from the API? Kotlin Android

byMR

April 3, 2023

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

How to get difference of columns in DataFrame Pandas?

Destructuring map function arguments in clojure: does the map need to go last?

JavaScript return other object if array string matches object property value

How do I remove special characters found on column names in R

What is the equivalent of this Python code in Hasekell?

I can't show my data from the API? Kotlin Android

Keep Up to Date with the Most Important News

Very slow aggregate on Pandas 2.0 dataframe with pyarrow as dtype_backend

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

How to get difference of columns in DataFrame Pandas?

Destructuring map function arguments in clojure: does the map need to go last?

JavaScript return other object if array string matches object property value

How do I remove special characters found on column names in R

What is the equivalent of this Python code in Hasekell?

I can't show my data from the API? Kotlin Android

Discover more from Dev solutions