How to create a new data frame from a larger dataset

January 11, 2022

I am working with a dataset (10000 data points) that provides 100 different account numbers with transaction amounts, date and time of transactions etc.

From this dataset I want to create a separate data frame for one account number, which then contains all the transactions (ordered by time) that that account number made throughout the year.

I tried to do this by:

group = df.groupby('account_num')

which then gives me

pandas.core.groupby.generic.DataFrameGroupBy

Then, when I want to get the group for a specific account number, say 51234:

group.get_group('51234')

I receive an error:

KeyError: 51234

How can I make a separate data frame containing all the transaction for one single account number?

(Sorry if this is a very basic question, Im a newbie)

>Solution :

IIUC, you can get your output in a slightly different way. You can start by making sure your time column, which I assume is a date based on your description, is actually a datetime object, and then filtering your dataframe for the specific account number – there are plenty of ways to do this, a common one is loc, but in my case I use query. Then you can sort based on your date, using sort_values, and lastly you can use groupby on the year part of your date column:

# Convert your date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Filter and sort
>>> print(df.query('account_num == 51234')\
    .sort_values(by=['date'],ascending=True))

# Equivalently with loc  
print(
      df.loc[df['account_num'] == 51234]\
    .sort_values(by=['date'],ascending=True))
  
    account_num       date
0         51234 2020-01-01
1         51234 2020-02-01
2         51234 2020-03-01
7         51234 2020-08-01
9         51234 2020-08-01
11        51234 2020-08-01
13        51234 2020-08-01
3         51234 2021-04-01
4         51234 2021-05-01
5         51234 2023-06-01
6         51234 2023-07-01
8         51234 2023-07-01
10        51234 2023-07-01
12        51234 2023-07-01

# Filter, sort, and get yearly count
>>> print(
      df.query('account_num == 51234')\
    .sort_values(by=['date'],ascending=True)\
        .groupby(df['date'].dt.year).account_num.count())

date
2020    7
2021    2
2023    5

Based on the below sample DF:

{'account_num': {0: 51234,
  1: 51234,
  2: 51234,
  3: 51234,
  4: 51234,
  5: 51234,
  6: 51234,
  7: 51234,
  8: 51234,
  9: 51234,
  10: 51234,
  11: 51234,
  12: 51234,
  13: 51234,
  14: 512346,
  15: 512346,
  16: 512346,
  17: 512346,
  18: 512346,
  19: 512346,
  20: 512346,
  21: 512346,
  22: 512346,
  23: 13123,
  24: 13123,
  25: 13123,
  26: 13123,
  27: 13123,
  28: 13123,
  29: 13123,
  30: 13123,
  31: 13123},
 'date': {0: '01/01/2020',
  1: '02/01/2020',
  2: '03/01/2020',
  3: '04/01/2021',
  4: '05/01/2021',
  5: '06/01/2023',
  6: '07/01/2023',
  7: '08/01/2020',
  8: '07/01/2023',
  9: '08/01/2020',
  10: '07/01/2023',
  11: '08/01/2020',
  12: '07/01/2023',
  13: '08/01/2020',
  14: '09/01/2020',
  15: '10/01/2020',
  16: '11/01/2020',
  17: '12/01/2020',
  18: '13/01/2020',
  19: '14/01/2020',
  20: '15/01/2020',
  21: '16/01/2020',
  22: '17/01/2020',
  23: '18/01/2020',
  24: '19/01/2020',
  25: '20/01/2020',
  26: '21/01/2020',
  27: '22/01/2020',
  28: '23/01/2020',
  29: '24/01/2020',
  30: '25/01/2020',
  31: '26/01/2020'}}