Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create a new data frame from a larger dataset

I am working with a dataset (10000 data points) that provides 100 different account numbers with transaction amounts, date and time of transactions etc.

From this dataset I want to create a separate data frame for one account number, which then contains all the transactions (ordered by time) that that account number made throughout the year.

I tried to do this by:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

group = df.groupby('account_num')

which then gives me

pandas.core.groupby.generic.DataFrameGroupBy

Then, when I want to get the group for a specific account number, say 51234:

group.get_group('51234') 

I receive an error:

KeyError: 51234

How can I make a separate data frame containing all the transaction for one single account number?

(Sorry if this is a very basic question, Im a newbie)

>Solution :

IIUC, you can get your output in a slightly different way. You can start by making sure your time column, which I assume is a date based on your description, is actually a datetime object, and then filtering your dataframe for the specific account number – there are plenty of ways to do this, a common one is loc, but in my case I use query. Then you can sort based on your date, using sort_values, and lastly you can use groupby on the year part of your date column:

# Convert your date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Filter and sort
>>> print(df.query('account_num == 51234')\
    .sort_values(by=['date'],ascending=True))

# Equivalently with loc  
print(
      df.loc[df['account_num'] == 51234]\
    .sort_values(by=['date'],ascending=True))
  
    account_num       date
0         51234 2020-01-01
1         51234 2020-02-01
2         51234 2020-03-01
7         51234 2020-08-01
9         51234 2020-08-01
11        51234 2020-08-01
13        51234 2020-08-01
3         51234 2021-04-01
4         51234 2021-05-01
5         51234 2023-06-01
6         51234 2023-07-01
8         51234 2023-07-01
10        51234 2023-07-01
12        51234 2023-07-01

# Filter, sort, and get yearly count
>>> print(
      df.query('account_num == 51234')\
    .sort_values(by=['date'],ascending=True)\
        .groupby(df['date'].dt.year).account_num.count())

date
2020    7
2021    2
2023    5

Based on the below sample DF:

{'account_num': {0: 51234,
  1: 51234,
  2: 51234,
  3: 51234,
  4: 51234,
  5: 51234,
  6: 51234,
  7: 51234,
  8: 51234,
  9: 51234,
  10: 51234,
  11: 51234,
  12: 51234,
  13: 51234,
  14: 512346,
  15: 512346,
  16: 512346,
  17: 512346,
  18: 512346,
  19: 512346,
  20: 512346,
  21: 512346,
  22: 512346,
  23: 13123,
  24: 13123,
  25: 13123,
  26: 13123,
  27: 13123,
  28: 13123,
  29: 13123,
  30: 13123,
  31: 13123},
 'date': {0: '01/01/2020',
  1: '02/01/2020',
  2: '03/01/2020',
  3: '04/01/2021',
  4: '05/01/2021',
  5: '06/01/2023',
  6: '07/01/2023',
  7: '08/01/2020',
  8: '07/01/2023',
  9: '08/01/2020',
  10: '07/01/2023',
  11: '08/01/2020',
  12: '07/01/2023',
  13: '08/01/2020',
  14: '09/01/2020',
  15: '10/01/2020',
  16: '11/01/2020',
  17: '12/01/2020',
  18: '13/01/2020',
  19: '14/01/2020',
  20: '15/01/2020',
  21: '16/01/2020',
  22: '17/01/2020',
  23: '18/01/2020',
  24: '19/01/2020',
  25: '20/01/2020',
  26: '21/01/2020',
  27: '22/01/2020',
  28: '23/01/2020',
  29: '24/01/2020',
  30: '25/01/2020',
  31: '26/01/2020'}}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading