I am new here so don’t know how to use this site.
I have a timeseries data of 37404 ICU Patients. Each patient have multiple rows. I want to down sample my dataframe and select only 2932 patients (all rows of the respective patient ID). Can anyone help me? My data looks like this:
| HR | SBP | DBP | Sepsis | P_ID |
|---|---|---|---|---|
| 92 | 120 | 80 | 0 | 0 |
| 98 | 115 | 85 | 0 | 0 |
| 93 | 125 | 75 | 0 | 1 |
| 95 | 130 | 90 | 0 | 1 |
| 102 | 120 | 80 | 0 | 1 |
| 109 | 115 | 75 | 0 | 2 |
| 94 | 135 | 100 | 0 | 2 |
| 97 | 100 | 70 | 0 | 3 |
| 85 | 120 | 80 | 0 | 4 |
| 88 | 115 | 75 | 0 | 4 |
| 93 | 125 | 85 | 0 | 4 |
| 78 | 130 | 90 | 0 | 5 |
| 115 | 140 | 110 | 0 | 5 |
| 102 | 120 | 80 | 0 | 5 |
| 98 | 140 | 110 | 0 | 5 |
I know I should use some condition on P_ID column, but I am confused.
Thanks for the help.
>Solution :
Use numpy.random.choice for random P_ID and filter in Series.isin with boolean indexing:
df2 = df[df['P_ID'].isin(np.random.choice(df['P_ID'].unique(), size=2932, replace=False))]
Alternative:
df2 = df[df['P_ID'].isin(df['P_ID'].drop_duplicates().sample(n=2932))]
EDIT: For random positions use:
df1 = df['P_ID'].drop_duplicates().sample(n=2932).to_frame('P_ID')
df2 = df.merge(df1, how='right')