Scrapped this massive (10M+ entries) Twitter dataset using academictwitteR, and as I am preparing to do some network analysis, I’ve come up against an issue whereby the dataset only identifies the used ID if a particular tweet is responding to another user (see mockup below). What I am trying to do across this dataset is a conditional replace whereby the user ID in the "in response to" column is replaced by the username.
Current database
| ID_column | Username | In_response_to |
|---|---|---|
| ID12345 | JohnA | NA |
| ID54321 | JaneB | ID12345 |
| ID51243 | MarkE | ID54321 |
Desired outcome
| ID_column | Username | In_response_to |
|---|---|---|
| ID12345 | JohnA | NA |
| ID54321 | JaneB | JohnA |
| ID51243 | MarkE | JaneB |
I have looked around extensively through SO and other forums for solutions, but I haven’t managed to. Being relatively new to R, I am sure the answer will be staring me in the face…
>Solution :
library(dplyr)
data_df <- read.delim(file = textConnection('
ID12345 JohnA NA
ID54321 JaneB ID12345
ID51243 MarkE ID54321
'), header = FALSE) |> setNames(c('ID_column', 'Username', 'In_response_to'))
lookup_list <- (data_df$Username) |> setNames(data_df$ID_column)
data_df |>
mutate(In_response_to = recode(In_response_to, !!!lookup_list))