I have a dataset with several hundred thousand rows of GPS coordinates and associated variables. The GPS coordinates from the source are at the center of given city blocks, rather than located at a specific address. I need to jitter these coordinates by up to one-half long-block to spread all the points around in the block for easier data visualization down the road.
Code and reproducible sample are included below.
# Code for StackOverflow question
# Required package[s]
require(tidyverse)
# Generate minimal reproducible example
centerlong <- c(-90.28192, -90.28192, -90.28192, -90.28192, -90.28192,
-90.31374, -90.31374, -91.51432, -92.12345, -93.12345)
centerlat <- c(44.12345, 44.12345, 44.12345, 44.28567, 44.28567,
43.98243, 43.98243, 45.00249, 42.12345, 41.12345)
df <- data.frame(centerlong, centerlat)
# Jitter GPS coordinates by 1/2 long block
df <- df %>%
mutate(Longitude = runif(1)*((centerlong+0.0005)-(centerlong-0.0005))+(centerlong-0.0005)) %>%
mutate(Latitude = runif(1)*((centerlat+0.0005)-(centerlat-0.0005))+(centerlat-0.0005))
My problem is that with my above code, it’s taking all of the GPS coordinates which are the same and jittering them all to the exact same new value, rather than jittering each row individually.
What I’m seeing:
| centerlong | Longitude |
| ——– | ——– |
| -90.28192 | -90.28196 |
| -90.28192 | -90.28196 |
| -90.28192 | -90.28196 |
| [...] | [...] |
What I want to see:
| centerlong | Longitude |
| ——– | ——– |
| -90.28192 | -90.28080 |
| -90.28192 | -90.28142 |
| -90.28192 | -90.28105 |
| [...] | [...] |
I have also tried to generate these values without using mutate:
df$Longitude <- runif(1)*((df$centerlong+0.0005)-(df$centerlong-0.0005))+(df$centerlong-0.0005)
I’m not sure how to correct this behavior. It seems like runif() is just generating the single number for the whole df instead of generating a new number for each row in the df. I know I’m missing something simple, but I’ve been digging around the internet for a few hours, now, without much success.
>Solution :
Let’s look at your code and mathematically simplify the expression:
runif(1)*((centerlong+0.0005)-(centerlong-0.0005))+(centerlong-0.0005)
## call centerlong `x`, and call 0.0005 `j`
runfi(1) * ((x + j) - (x - j)) + (x - j)
runif(1) * (x - x + j - j) + x - j
runif(1) * 0 + x - j
x - j
So the runif() is being multiplied by 0.
Also, you are correct that runif(1) is generating a single number. It’s first argument is n, how many numbers you want it to generate. You are using runif(1), so you are telling it specifically to generate 1 number. You want runif(nrow(df)) to generate a random number for each row, or you can use the dplyr helper function n() to use runif(n()) inside mutate().
I would suggest setting your jitter distance in a variable so you decrease the risk of typos in repeating it and you can easily change it if you want to try out a different distance. Then you can do something like this:
j = 0.0005
df <- df %>%
mutate(
Longitude = centerlong + runif(n = n(), min = -j, max = j)
Latitude = centerlat + runif(n = n(), min = -j, max = j)
)