I have a very specific dataset it looks something like this:
| record_id | event_id | instrument | repeat_inst |
|---|---|---|---|
| PI0005 | v03_abc_1 | NaN |
1 |
| PI0005 | v03_abc_1 | i_sensor | NaN |
| PI0005 | v03_abc_1 | NaN |
NaN |
| PI0005 | v02_abc_33 | i_sensor | NaN |
| PI0005 | v02_abc_33 | NaN |
NaN |
| PI0006 | v02_abc_1 | i_sensor | 1 |
| PI0006 | v02_abc_1 | NaN |
NaN |
How do I make it look like this:
| record_id | event_id | instrument | repeat_inst |
|---|---|---|---|
| PI0005 | v03_abc_1 | i_sensor | 1 |
| PI0005 | v02_abc_33 | i_sensor | NaN |
| PI0006 | v02_abc_2 | i_sensor | 1 |
Where rows with the same record_id and event_id get merged together, where NaN values are replaced with the other value, and if both values are NaN, then NaN can be kept (like in the forth and fifth row in the original dataframe).
Assume that only one of the related cells have a value and all others have NaN.
This should apply to all columns of the data, there are thousands of columns and rows.
I tried using group by, but don’t know how to continue.
>Solution :
With R
library(dplyr)
df1 %>%
group_by(record_id, event_id) %>%
summarise(across(everything(), ~.x[!is.na(.x)][1]))