Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Construction of a joint dataframe under specific conditions in python

From the following python dataframe:

country_ID            date        ID  visit_ENG  visit_FRA  visit_ESP      visit_time
   ENG   2022-02-04 16:30:21       3          1          0          0 0 days 01:00:00    
   ENG   2022-02-04 16:40:21       3          1          0          0 0 days 00:05:00    
   ENG   2022-02-06 16:35:21       3          1          0          0 1 days 19:55:00   
   ENG   2022-02-04 10:10:21       2          1          0          0             NaT   
   ESP   2022-02-04 15:10:21       2          1          1          1             NaT   
   ENG   2022-02-04 12:35:21       1          1          0          0             NaT   
   ENG   2022-02-04 16:10:21       1          0          0          0 0 days 03:35:00   
   ESP   2022-02-04 14:23:21       0          0          0          1             NaT   
   ESP   2022-02-04 15:27:21       0          1          0          0 0 days 01:04:00   
   FRA   2022-02-04 15:35:21       0          0          0          0             NaT   
   ENG   2022-02-04 16:35:21       0          0          0          0 0 days 00:35:00                

I have extracted the following information in the following subsets, using the following code:

  1. Sum of visit columns for each ID.
    visit_df = pd.DataFrame()
    visit_df['visit_ENG'] = df.groupby('ID')['visit_ENG'].sum()
    visit_df['visit_FRA'] = df.groupby('ID')['visit_FRA'].sum()
    visit_df['visit_ESP'] = df.groupby('ID')['visit_ESP'].sum()
    
    print(visit_df)

              visit_ENG  visit_FRA  visit_ESP
ID                                 
0                     1          0          1
1                     1          0          0
2                     2          1          1
3                     3          0          0
  1. Mean of each country_ID group for each ID.
    subset_avg = pd.to_timedelta(subset_out['visit_time'].dt.total_seconds() \
                                 .groupby([subset_out['ID'], subset_out['country_ID']])
                                 .mean(), unit='s').fillna(pd.Timedelta(days=0)).unstack()
                                 .add_prefix('avg_visit_')
    subset_avg.columns.names = ['']
    
    print(subset_avg)

               avg_visit_ESP         avg_visit_ENG
ID                                
0            0 days 01:04:00        0 days 00:35:00
1                        NaT        0 days 03:35:00
3                        NaT        0 days 15:00:00
  1. Standard deviation of each country_ID group for each ID.
    subset_std = pd.to_timedelta(df['visit_time'].dt.total_seconds() \
                                 .groupby([df['ID'],df['country_ID']])
                                 .std(),unit='s').fillna(pd.Timedelta(days=0)).unstack()
                                 .add_prefix('std_visit_')
    subset_std.columns.names = ['']
    
    print(subset_std)

             std_visit_ESP             std_visit_ENG
ID                                        
0          0 days 00:00:00           0 days 00:00:00
1                      NaT           0 days 00:00:00
3                      NaT           1 days 01:02:48.342559903

I would like to make a joint table with all the extracted data. I need to concatenate all the tables above (visit_df,subset_avg, subset_std), create a column for the ID and complete the data that does not appear in the tables as NaT. As you can see below:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID visit_ENG visit_FRA visit_ESP avg_visit_ESP avg_visit_ENG std_visit_ESP std_visit_ENG
0 1 0 1 0 days 01:04:00 0 days 00:35:00 0 days 00:00:00 0 days 00:00:00
1 1 0 0 NaT 0 days 03:35:00 NaT 0 days 00:00:00
2 2 1 1 NaT NaT NaT NaT
3 3 0 0 NaT 0 days 15:00:00 NaT 1 days 01:02:48.342559903

If there is a way to build the solution I ask for, but using other python commands that are more efficient, let me know.

>Solution :

Are you looking for pd.concat:

>>> pd.concat([visit_df, subset_avg, subset_std], axis=1).reset_index()
   ID  visit_ENG  visit_FRA  visit_ESP    avg_visit_ESP    avg_visit_ENG    std_visit_ESP              std_visit_ENG
0   0          1          0          1  0 days 01:04:00  0 days 00:35:00  0 days 00:00:00            0 days 00:00:00
1   1          1          0          0              NaT  0 days 03:35:00              NaT            0 days 00:00:00
2   2          2          1          1              NaN              NaN              NaN                        NaN
3   3          3          0          0              NaT  0 days 15:00:00              NaT  1 days 01:02:48.342559903

Tip for visit_id:

visit_df = df.groupby('ID').sum()
print(visit_df)

# Output
    visit_ENG  visit_FRA  visit_ESP
ID                                 
0           1          0          1
1           1          0          0
2           2          1          1
3           3          0          0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading