Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

re.findall outputs blanks along with correct

I’m trying to get the list output to not have subgroups or empty spaces. I’m trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.

HTML file: (Notice that thing 3 & 4 have /b/ before instead of /a/.)

<!DOCTYPE html>
<html>
    <head></head>   
    <body></body>
        <a href="example.com/a/thing1"></a>
        <a href="example.com/a/thing2"></a>
        <a href="example.com/b/thing3"></a>
        <a href="example.com/b/thing4" ><img src="/thing4.png"></a>
    </body>
</html>

Python file:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import re

html = open("help.html", "r").read()
links = re.findall('((?<=\.com\/a\/).*(?="))|((?<=\.com\/b\/).*(?=" ><))|((?<=\.com\/b\/).*(?="><\/a))',html)

print(links)

What will output when I run the above py file:

[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]

What I want it to output:

[thing1, thing2, thing3, thing4]

>Solution :

You just have to remove the capturing groups. As stated in re.findall:

Empty matches are included in the result.

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

An example of capturing group is ((?<=\.com\/a\/).*(?=")), so the most external brackets shall be removed, same for the other 2 groups:

links = re.findall('(?<=\.com\/a\/).*(?=")|(?<=\.com\/b\/).*(?=" ><)|(?<=\.com\/b\/).*(?="><\/a)',HTML)

Output:

['thing1', 'thing2', 'thing3', 'thing4']
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading