Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Guessing the "Format" of a Dataset?

I am working with the R programming language.

I am trying to follow this tutorial here https://rpubs.com/chidungkt/505486, but the dataset required for this tutorial appears to be no longer available. Therefore, I am trying to "guess" the format of the dataset and try to simulate a fake dataset in a similar format – thus allowing me to continue the tutorial.

I spent some time trying to analyze the structure of the code and tried to infer the format of the dataset – this is what I came up with:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Age = c("0-10", "0-10", "11-20", "11-20", "21-30", "21-30", "31-40", "31-40", "41-50", "41-50", "51-60", "51-60")

Gender = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F")

Value = as.integer(rnorm(12, 100,10))

vn_2018_pop = data.frame(Age, Gender, Value)

     Age Gender Value
1   0-10      M   125
2   0-10      F   103
3  11-20      M    84
4  11-20      F   105
5  21-30      M    96
6  21-30      F    88
7  31-40      M    88
8  31-40      F   120
9  41-50      M   106
10 41-50      F   118
11 51-60      M   105
12 51-60      F   112

Based on this dataset, I tried to run the R code from the tutorial:

# Load some packages for scrapping data and data manipulation: 
library(rvest)
library(magrittr)
library(tidyverse)
library(extrafont)

my_colors <- c("#2E74C0", "#CB454A")
my_font <- "Roboto Condensed"

vn_2018_pop %>% ggplot(aes(Age, Value, fill = Gender)) + 
  geom_col(position = "stack") + 
  coord_flip() + 
  scale_y_continuous(breaks = seq(-5000000, 5000000, 1000000), 
                     limits = c(-5000000, 5000000), 
                     labels = c(paste0(seq(5, 0, -1), "M"), paste0(1:5, "M"))) + 
  theme_minimal() + 
  scale_fill_manual(values = my_colors, name = "", labels = c("Female", "Male")) + 
  guides(fill = guide_legend(reverse = TRUE)) + 
  theme(panel.grid.major.x = element_line(linetype = "dotted", size = 0.2, color = "grey40")) + 
  theme(panel.grid.major.y = element_blank()) + 
  theme(panel.grid.minor.y = element_blank()) + 
  theme(panel.grid.minor.x = element_blank()) + 
  theme(legend.position = "top") + 
  theme(plot.title = element_text(family = my_font, size = 28)) + 
  theme(plot.subtitle = element_text(family = my_font, size = 13, color = "gray40")) + 
  theme(plot.caption = element_text(family = my_font, size = 12, colour = "grey40", face = "italic")) + 
  theme(plot.margin = unit(c(1.2, 1.2, 1.2, 1.2), "cm")) + 
  theme(axis.text = element_text(size = 13, family = my_font)) + 
  theme(legend.text = element_text(size = 12, face = "bold", color = "grey30", family = my_font)) + 
  labs(x = NULL, y = NULL, 
       title = "Population Pyramids of Vietnam in 2018",
       subtitle = "A population pyramid illustrates the age-sex structure of a country's population and may provide insights about\npolitical and social stability, as well as economic development. Countries with young populations need to\ninvest more in schools, while countries with older populations need to invest more in the health sector.",
       caption = "Data Source: https://www.census.gov")

The code seems to run – but an empty plot is produced:

enter image description here

Can someone please show me what I am doing wrong and what I can do to fix this problem?

Thanks!

>Solution :

Fix the scale of your data by multiplying the values by e.g. 4e4 and make the values for males negative:

library(tidyverse)
library(extrafont)

set.seed(123)

vn_2018_pop$Value <- 4e4 * vn_2018_pop$Value
vn_2018_pop$Value[vn_2018_pop$Gender == "M"] <- -vn_2018_pop$Value[vn_2018_pop$Gender == "M"]

my_colors <- c("#2E74C0", "#CB454A")
my_font <- "Roboto Condensed"

vn_2018_pop %>% ggplot(aes(Age, Value, fill = Gender)) +
  geom_col(position = "stack") +
  coord_flip() +
  scale_y_continuous(
    breaks = seq(-5000000, 5000000, 1000000),
    limits = c(-5000000, 5000000),
    labels = c(paste0(seq(5, 0, -1), "M"), paste0(1:5, "M"))
  ) +
  theme_minimal() +
  scale_fill_manual(values = my_colors, name = "", labels = c("Female", "Male")) +
  guides(fill = guide_legend(reverse = TRUE)) +
  theme(panel.grid.major.x = element_line(linetype = "dotted", size = 0.2, color = "grey40")) +
  theme(panel.grid.major.y = element_blank()) +
  theme(panel.grid.minor.y = element_blank()) +
  theme(panel.grid.minor.x = element_blank()) +
  theme(legend.position = "top") +
  theme(plot.title = element_text(family = my_font, size = 28)) +
  theme(plot.subtitle = element_text(family = my_font, size = 13, color = "gray40")) +
  theme(plot.caption = element_text(family = my_font, size = 12, colour = "grey40", face = "italic")) +
  theme(plot.margin = unit(c(1.2, 1.2, 1.2, 1.2), "cm")) +
  theme(axis.text = element_text(size = 13, family = my_font)) +
  theme(legend.text = element_text(size = 12, face = "bold", color = "grey30", family = my_font)) +
  labs(
    x = NULL, y = NULL,
    title = "Population Pyramids of Vietnam in 2018",
    subtitle = "A population pyramid illustrates the age-sex structure of a country's population and may provide insights about\npolitical and social stability, as well as economic development. Countries with young populations need to\ninvest more in schools, while countries with older populations need to invest more in the health sector.",
    caption = "Data Source: https://www.census.gov"
  )

enter image description here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading