Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to check if PDF is scanned image or contains text in R

I am conducting a structural equation model on several PDF’s (>1000) in R.

However, some PDF’s are readable and other are scanned, i.e. I need to run them through an OCR-function.

Therefore, I need to find a way to automatically identify which PDF’s contains text and which don’t. Specifically, I wish to get find a way to return whether a given PDF should be ran through OCR.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Does anyone know of any functions or packages in R that might help do this – I can find a couple of solutions for Python, but haven’t been able to identify some in R.

>Solution :

You could use an approach like this (as @danlooo already suggested but I wanted to spell it out):

files <- list.files("/home/johannes/pdfs/",
                    pattern = ".pdf$",
                    full.names = TRUE)

pdfs_l <- lapply(files, function(f) {
  out <- pdftools::pdf_text(f)
  # I set the test to an arbitrary number of characters, it works for me but you want
  # to maybe fine tune it a bit
  contains_text <- nchar(out) > 15
  if (!contains_text) {
    out <- pdftools::pdf_ocr_text(f)
  }
  data.frame(text = out, ocr = !contains_text)
})

pdfs_l |>
  dplyr::bind_rows() |>
  dplyr::mutate(text = trimws(text)) |>
  tibble::as_tibble()
#> # A tibble: 22 × 2
#>    text                                                                    ocr  
#>    <chr>                                                                   <lgl>
#>  1 "TEAM MEMBERS:\n                                                      … FALSE
#>  2 "WS 21/22                                                             … FALSE
#>  3 "WS 21/22                                                             … FALSE
#>  4 "TEAM MEMBERS:\n                                                      … FALSE
#>  5 "TEAM MEMBERS:\n                                                      … FALSE
#>  6 "Key Concepts in Political Communication\n    @Agenda Setting, Priming… FALSE
#>  7 "Key Concepts in Political Communication\n    @Agenda Setting, Priming… FALSE
#>  8 "ELECTIONS AND CAMPAIGNS\n                                            … FALSE
#>  9 ""                                                                      TRUE 
#> 10 ""                                                                      TRUE 
#> # … with 12 more rows

Created on 2022-02-10 by the reprex package (v2.0.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading