I am writing an automation script in Python to loop through old HTML documentation files and run some RegEx commands, prior to converting the files to rST. I’ve run into a roadblock trying to wrap certain patterns in <pre> and </pre> tags.
I need to find each group occurrence of the below HTML pattern and insert a <pre> tag before and a </pre> tag after.
Pattern:
- p tag with class name of "CodeReference", repeated 1 or more times
<p class="CodeReference">
Sample HTML:
<h3>
Could be any HTML here
</h3>
<p class="CodeReference">
First line</p>
<p class="CodeReference">
Second line</p>
<p class="CodeReference">
Last line</p>
<div>
More random HTML down here as well
</div>
Desired outcome:
<h3>
Could be any HTML here
</h3>
<pre>
<p class="CodeReference">
First line of code</p>
<p class="CodeReference">
Second line of code</p>
<p class="CodeReference">
Last line of code</p>
</pre>
<div>
More random HTML down here as well
</div>
My challenge currently is there’s no prior pattern to reference a positive look-behind with, so I need to capture each group of <p class="CodeReference"> patterns and wrap the entire group in <pre></pre> tags.
Said differently, in each group of <p class="CodeReference"> I need to find the first occurrence and insert a <pre> tag in front of it. Then, in each group of <p class="CodeReference">, find the last occurrence and insert a </pre> tag after it.
Here is what I’ve tried so far (using Python): Regex101 workspace
code_block = re.sub(r'(?<!(<\/p>\n))<p class=\"CodeReference\">', r'<pre>\g<0>', code_block)
^Captures the first occurrence based on it not being preceded by a closing </p> tag. However, this doesn’t capture the last occurrence, and it sort of feels like I’m doing it wrong. I’m open to multiple RegEx statements, doesn’t need to be a one-liner. I just don’t know how to properly capture this group of paragraph tags and reference the first and last occurrences.
Any help would be appreciated, thank you!
>Solution :
For best results here, you might want to investigate using Python’s Beautiful Soup library. If you must use regex, and assuming that you don’t have any nested HTML tags, you may try the following approach:
inp = """<h3>
Could be any HTML here
</h3>
<p class="CodeReference">
First line</p>
<p class="CodeReference">
Second line</p>
<p class="CodeReference">
Last line</p>
<div>
More random HTML down here as well
</div>"""
output = re.sub(r'((?:<p class="CodeReference">.*?</p>\s*)+)', r'\n<pre>\n\n\1</pre>\n\n', inp, flags=re.S)
print(output)
This prints:
<h3>
Could be any HTML here
</h3>
<pre>
<p class="CodeReference">
First line</p>
<p class="CodeReference">
Second line</p>
<p class="CodeReference">
Last line</p>
</pre>
<div>
More random HTML down here as well
</div>