Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Overlapping Histograms in Seaborn

I’m plotting two histograms using Seaborn’s histplot function. The first histogram represents my entire dataset, while the second one is a subset of the first. However, the second histogram doesn’t seem to overlap with the first one as expected. Here’s the code I’m using:

import numpy as np
from scipy.stats import norm

data = np.sin(np.arange(0, 6*np.pi, 0.1)) * 100
sns.scatterplot(x=[np.mean(data)], y=[0])
sns.lineplot(data)

population_size = 10000
sample_size = 100
total_means = []
for x in range(population_size):
    total_means.append(np.mean(np.random.choice(data, 100)))

total_means = np.array(total_means)
sns.histplot(total_means, kde=True)

# Q. Find the range for 68% of data will lie in that interval
from scipy.stats import norm
z1 = norm.ppf(.50 - .68/2)
se = np.array(data).std() / sample_size ** .5
x1 = z1 * se + np.array(data).mean()
z2 = norm.ppf(.50 + .68/2)
x2 = z2 * se + np.array(data).mean()
print(x1, x2)

plt.xticks(np.arange(total_means.min(), total_means.max(), 10))
plt.xticks(np.arange(0, 500, 100))
sns.histplot(total_means, kde=True)
sns.histplot(total_means[(total_means >= x1) & (total_means <= x2)], kde=True, color='r')

On Stack Overflow, it’s recommended to avoid posting the complete code. However, I have some data that could be used to quickly address the issue without needing to generate new data.

In my code, the last two lines plot two histograms. However, from the resulting graph, it’s clear that these histograms do not overlap as expected.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

sns.histplot(total_means, kde=True)
sns.histplot(total_means[(total_means >= x1) & (total_means <= x2)], kde=True, color='r')

Histogram graph

>Solution :

The problem is that for the red plot, smaller bins are used (the default is calculated via np.histogram_bin_edges()). Smaller bins mean each bin gets less values inside.

Here is some code to show the difference between using default bins and using the same bins for both histograms. The first and last bin of the subset are smaller than those of the larger histogram, as the bin edges don’t coincide with the limits of the subset.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

total_means = np.random.randn(2000)

fig, (ax1, ax2) = plt.subplots(ncols=2)

sns.histplot(total_means, kde=True, ax=ax1)
sns.histplot(total_means[(total_means >= -1) & (total_means <= 1)], kde=True, ax=ax1)
ax1.set_title('default bins')

bins = np.linspace(total_means.min(), total_means.max(), 30)
sns.histplot(total_means, bins=bins, kde=True, ax=ax2)
sns.histplot(total_means[(total_means >= -1) & (total_means <= 1)], bins=bins, kde=True, ax=ax2)
ax2.set_title('same bins for both')

plt.show()

sns.histplot with same bins

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading