Construction of a joint dataframe under specific conditions in python

March 16, 2022

From the following python dataframe:

country_ID            date        ID  visit_ENG  visit_FRA  visit_ESP      visit_time
   ENG   2022-02-04 16:30:21       3          1          0          0 0 days 01:00:00    
   ENG   2022-02-04 16:40:21       3          1          0          0 0 days 00:05:00    
   ENG   2022-02-06 16:35:21       3          1          0          0 1 days 19:55:00   
   ENG   2022-02-04 10:10:21       2          1          0          0             NaT   
   ESP   2022-02-04 15:10:21       2          1          1          1             NaT   
   ENG   2022-02-04 12:35:21       1          1          0          0             NaT   
   ENG   2022-02-04 16:10:21       1          0          0          0 0 days 03:35:00   
   ESP   2022-02-04 14:23:21       0          0          0          1             NaT   
   ESP   2022-02-04 15:27:21       0          1          0          0 0 days 01:04:00   
   FRA   2022-02-04 15:35:21       0          0          0          0             NaT   
   ENG   2022-02-04 16:35:21       0          0          0          0 0 days 00:35:00

I have extracted the following information in the following subsets, using the following code:

Sum of visit columns for each ID.

    visit_df = pd.DataFrame()
    visit_df['visit_ENG'] = df.groupby('ID')['visit_ENG'].sum()
    visit_df['visit_FRA'] = df.groupby('ID')['visit_FRA'].sum()
    visit_df['visit_ESP'] = df.groupby('ID')['visit_ESP'].sum()
    
    print(visit_df)

              visit_ENG  visit_FRA  visit_ESP
ID                                 
0                     1          0          1
1                     1          0          0
2                     2          1          1
3                     3          0          0

Mean of each country_ID group for each ID.

    subset_avg = pd.to_timedelta(subset_out['visit_time'].dt.total_seconds() \
                                 .groupby([subset_out['ID'], subset_out['country_ID']])
                                 .mean(), unit='s').fillna(pd.Timedelta(days=0)).unstack()
                                 .add_prefix('avg_visit_')
    subset_avg.columns.names = ['']
    
    print(subset_avg)

               avg_visit_ESP         avg_visit_ENG
ID                                
0            0 days 01:04:00        0 days 00:35:00
1                        NaT        0 days 03:35:00
3                        NaT        0 days 15:00:00

Standard deviation of each country_ID group for each ID.

    subset_std = pd.to_timedelta(df['visit_time'].dt.total_seconds() \
                                 .groupby([df['ID'],df['country_ID']])
                                 .std(),unit='s').fillna(pd.Timedelta(days=0)).unstack()
                                 .add_prefix('std_visit_')
    subset_std.columns.names = ['']
    
    print(subset_std)

             std_visit_ESP             std_visit_ENG
ID                                        
0          0 days 00:00:00           0 days 00:00:00
1                      NaT           0 days 00:00:00
3                      NaT           1 days 01:02:48.342559903

I would like to make a joint table with all the extracted data. I need to concatenate all the tables above (visit_df,subset_avg, subset_std), create a column for the ID and complete the data that does not appear in the tables as NaT. As you can see below:

ID	visit_ENG	visit_FRA	visit_ESP	avg_visit_ESP	avg_visit_ENG	std_visit_ESP	std_visit_ENG
0	1	0	1	0 days 01:04:00	0 days 00:35:00	0 days 00:00:00	0 days 00:00:00
1	1	0	0	NaT	0 days 03:35:00	NaT	0 days 00:00:00
2	2	1	1	NaT	NaT	NaT	NaT
3	3	0	0	NaT	0 days 15:00:00	NaT	1 days 01:02:48.342559903

If there is a way to build the solution I ask for, but using other python commands that are more efficient, let me know.

>Solution :

Are you looking for pd.concat:

>>> pd.concat([visit_df, subset_avg, subset_std], axis=1).reset_index()
   ID  visit_ENG  visit_FRA  visit_ESP    avg_visit_ESP    avg_visit_ENG    std_visit_ESP              std_visit_ENG
0   0          1          0          1  0 days 01:04:00  0 days 00:35:00  0 days 00:00:00            0 days 00:00:00
1   1          1          0          0              NaT  0 days 03:35:00              NaT            0 days 00:00:00
2   2          2          1          1              NaN              NaN              NaN                        NaN
3   3          3          0          0              NaT  0 days 15:00:00              NaT  1 days 01:02:48.342559903

Tip for visit_id:

visit_df = df.groupby('ID').sum()
print(visit_df)

# Output
    visit_ENG  visit_FRA  visit_ESP
ID                                 
0           1          0          1
1           1          0          0
2           2          1          1
3           3          0          0