Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

I have a problem with construct regular expression. [Python, Pandas]

I have a data frame where row in one column looks like this:

<title>Some text</title>

<selftext>Some text</selftext>

This above is one row in one column.
The problem is that not every row looks like this. I have to implement that rows which not looks like this was removed.

I tried to use code below:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

pattern = "<title>[a-zA-Z0-9]</title>\n\n<selftext>[a-zA-Z0-9]</selftext>"
for row in df.column_name:
    if row == pattern:
        print(row)

and I don’t have any rows printed, although I should.
What I am doing wrong?
Anyone knows?

>Solution :

My first idea for what is wrong with the pattern would be that you set a range but only allow exactly one character. Use this to allow any content within title and selftext tags which have at least one character.

pattern = "<title>[a-zA-Z0-9]+</title>\n\n<selftext>[a-zA-Z0-9]+</selftext>"

Also you did not call an actual regex pattern. You just did a string comparison. So unless the content would be exactly [a-zA-Z0-9] it wouldnt match.

Use it like this:

import re
pattern = "<title>[a-zA-Z0-9]+</title>\n\n<selftext>[a-zA-Z0-9]+</selftext>"
for row in df.column_name:
    if re.match(pattern, row):
        print(row)

Edit: Unless you also want to filter the content by following exactly the right character set and numbers range, I would recommend making the pattern much more broad. Basically XML allows for everything except Tags (<, >) within the tags. So you could just match until the next opening tag. While you’re at it you can also allow empty tags as these can also occur in XML.

import re
pattern = "<title>[^<]*</title>\n\n<selftext>[^<]*</selftext>"
for row in df.column_name:
    if re.match(pattern, row):
        print(row)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading