Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

A lightweight approach to processing Django Queryset data

I am looking for a optimal way to perform simple data processing from Django Queryset. I would like to not need to install libraries with high volumes like Pandas or numpy. The number of rows in Queryset should not exceed 2000. The idea is to perform basic functions such as x1, x2, x3 below. I want to avoid a separate database query for each of them so I’m thinking to get all this data once from the database, convert it to a standard data structure and calculate the results.

Queryset Sample:

T1_id    T2_id   T1_value    T2_value
  1        2        2           0
  3        5        0           0
  4        1        1           1
  2        7        0           3

Pandas code equivalents:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data = [[1, 2, 2, 0], [3, 5, 0, 0], [4, 1, 1, 1], [2, 7, 0, 3]]

df = pd.DataFrame(data, columns=['T1_id', 'T2_id', 'T1_value', 'T2_value'])

x1 = df[df['T1_id'] == 1]['T1_value'].mean()  # Mean of T1_value from rows where T1_id == 1
x2 = df[df['T1_id'] == 1]['T2_value'].sum()  # Sum of T2_value from rows where T1_id == 1
x3 = len(df[df['T1_id'] == 1])  # Number of rows where T1_id == 1

>Solution :

You can easily calculate these all with an .aggregate(…) [Django-doc]:

from django.db.models import Avg, Count, Sum

MyModel.objects.filter(T1_id=1).aggregate(
    x1=Avg('T1_value'), x2=Sum('T2_value'), x3=Count('*')
)

this will return a dictionary that maps the items to the corresponding result, for example:

{'x1': 14.25, 'x2': 1302, 'x3': 42}

By doing this as a query, the database will calculate this, and database is closer to the data and often is optimized to make aggregates, so will likely outperform pandas quite significantly. Especially since if we would use pandas to aggregate, it would first need to move the data from the database to pandas, which will likely take longer than the database determining the values directly.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading