Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Replace Text in XML files with placeholder text

I performed text mining on files that I am preparing for publication right now. There are several XML files that contain text within segments (see basic example below). Due to copyright restrictions, I have to make sure that the files that I am going to publish do not contain the whole text while someone who has the texts should be able to ‘reconstruct’ the files. To make sure that one can still perform basic text mining (= count lengths), the segment length should not change. Therefore I am looking for a way to replace every word except for the first and last one in all segments with dummy / placeholder text.

Basic example:

Input:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

<text>
<div>
<seg xml:id="A">Lorem ipsum dolor sit amet</seg>
<seg xml:id="B">sed diam nonumy eirmod tempor invidunt</seg>
</div>
</text>

Output:

<text>
<div>
<seg xml:id="A">Lorem blank blank blank amet</seg>
<seg xml:id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>

>Solution :

There is rapply to recursively replace values in a nested list:

Let be data.xml containing your input.

library(tidyverse)
library(xml2)

read_xml("data.xml") %>%
  as_list() %>%
  rapply(how = "replace", function(x) {
    tokens <-
      x %>%
      str_split(" ") %>%
      simplify()
    
    n_tokens <- length(tokens)
    
    c(
      tokens[[1]],
      rep("blank", n_tokens - 2),
      tokens[[n_tokens]]
    ) %>%
      paste0(collapse = " ")
  }) %>%
  as_xml_document() %>%
  write_xml("data2.xml")

Output file data2.xml:

<?xml version="1.0" encoding="UTF-8"?>
<text>
  <div>
    <seg id="A">Lorem blank blank blank amet</seg>
    <seg id="B">sed blank blank blank blank invidunt</seg>
  </div>
</text>
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading