Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Simulate effect of confounder on two random variables

I want to generate some data in order to show partial correlation to control for a confounder.

Specifically, I want to generate data about two uncorrelated random variables (let’s say speech and memory) and use a third variable to influence them both (age).

I’d expect to observe a strong correlation between speech and memory, due to the confounder age, and no correlation between the same two variables if I control for age (that is, calculate a partial correlation on age).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

That said, I can’t generate the strong correlation with my code.


age <- rep(1:10, 10)

speech <- age * abs(rnorm(100))
memory <- age * abs(rnorm(100))

cor(speech, memory) # correlation, it should be high but it's not

residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals

cor(residuals_speech, residuals_memory) # partial correlation controlling for age, it should be around zero

>Solution :

The random noise needs to be additive, not multiplicative:

set.seed(1)

age <- rep(1:10, 10)

speech <- age + rnorm(100)
memory <- age + rnorm(100)

Now we have a high correlation between speech and memory:

cor(speech, memory)
#> [1] 0.9067772

But the correlation of the residuals once age is taken into account is close to zero, as expected.

residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals

cor(residuals_speech, residuals_memory) 
#> [1] 0.0001755437

To get speech and memory to be on different scales with different amounts of noise (as real data might be), you can multiply age and the random noise by scalar values. This will retain the high correlation caused by the confounder.

set.seed(1)

age <- rep(1:10, 10)

speech <- 5 * age + 3 * rnorm(100)
memory <- 2.1 * age - 2 * rnorm(100)

cor(speech, memory)
#> [1] 0.9361466

residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals

cor(residuals_speech, residuals_memory) 
#> [1] -0.0001755437

Created on 2022-11-21 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading