Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Find out which observations are located in each "bar" of the histogram

I am working with the R programming language. Suppose I have the following data:

     a = rnorm(1000,10,1)
     b = rnorm(200,3,1)
     c = rnorm(200,13,1)
    
    d = c(a,b,c)
index <- 1:1400

my_data = data.frame(index,d)

I can make the following histograms of the same data by adjusting the "bin" length (via the "breaks" option):

hist(my_data, breaks = 10, main = "Histogram #1, Breaks = 10")
 hist(my_data, breaks = 100, main = "Histogram #2, Breaks = 100")
 hist(my_data, breaks = 5, main = "Histogram #3, Breaks = 5")

enter image description here

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

My Question: In each one of these histograms there are a different number of "bars" (i.e. bins). For example, in the first histogram there are 8 bars and in the third histogram there are 4 bars. For each one of these histograms, is there a way to find out which observations (from the original file "d") are located in each bar?

Right now, I am trying to manually do this, e.g. (for histogram #3)

histogram3_bar1 <- my_data[which(my_data$d < 5 & my_data$d > 0), ]
histogram3_bar2 <- my_data[which(my_data$d < 10 & my_data$d > 5), ]
histogram3_bar3 <- my_data[which(my_data$d < 15 & my_data$d > 10), ]
histogram3_bar4 <- my_data[which(my_data$d < 15 & my_data$d > 20), ]


head(histogram3_bar1)

    index        d
1001  1001 4.156393
1002  1002 3.358958
1003  1003 1.605904
1004  1004 3.603535
1006  1006 2.943456
1007  1007 1.586542

But is there a more "efficient" way to do this?

Thanks!

>Solution :

hist itself can provide for the solution to the question’s problem, to find out which data points are in which intervals. hist returns a list with first member breaks

First, make the problem reproducible by setting the RNG seed.

set.seed(2021)
a = rnorm(1000,10,1)
b = rnorm(200,3,1)
c = rnorm(200,13,1)
d = c(a,b,c)

Now, save the return value of hist and have findInterval tell the bins where each data points are in.

h1 <- hist(d, breaks = 10)
f1 <- findInterval(d, h1$breaks)

h1$breaks
# [1] -2  0  2  4  6  8 10 12 14 16

head(f1)
#[1] 6 7 7 7 7 6

The first six observations are intervals 6 and 7 with end points 8, 10 and 12, as can be seen indexing d by f1:

head(d[f1])
#[1]  8.07743 10.26174 10.26174 10.26174 10.26174  8.07743

As for whether the intervals given by end points 8, 10 and 12 are left- or right-closed, see help("findInterval").

As a final check, table the values returned by findInterval and see if they match the histogram’s counts.

table(f1)
#f1
#  1   2   3   4   5   6   7   8   9 
#  2  34 130  34  17 478 512 169  24 
h1$counts
#[1]   2  34 130  34  17 478 512 169  24

To have the intervals for each data point, the following

bins <- data.frame(bin = f1, min = h1$breaks[f1], max = h1$breaks[f1 + 1L])
head(bins)
#  bin min max
#1   6   8  10
#2   7  10  12
#3   7  10  12
#4   7  10  12
#5   7  10  12
#6   6   8  10
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading