Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I convert `A_B_C_DEF` to `ABC_DEF`?

I have strings of this form:

A_B_CDEF_GHI
A_B_C_DEF_G_H_I
ABC_D_E_F_GHI
ABCDEFG_H_I
A_B_C

I need to convert those to the following:

AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC

So the rules are:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • (._){2,} should be converted to XXX_ if it’s not at the end of the string.
  • If (_.){2,} occurs at the end of a string, it should be converted to _XXX.
  • If (_.){2,}. is the entire string, all underscores should be removed.

I’ve gotten to (((.)_){2,}), which does match the first rule, but how can I replace it with the non-underscore characters it found?

The python tag is present because that’s where the code is, and I know regex dialects depend on the language.

>Solution :

The dot in your example code matches any character including an underscore. You can make the pattern a bit more specific instead.

You can get all of the double A-Z matches out of the way, and capture the single A-Z followed by _ and A-Z in a group.

Then for the capture group replace the _ with an empty string.

_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z]))+)
  • _?[A-Z]{2,}_? Match 2 or more occurences of A-Z surrounded by optional underscores
  • | or
  • ( Capture group 1
    • [A-Z] Match a single A-Z
    • (?:_[A-Z](?![A-Z]))+ Repeat 1+ times _ and A-Z asserting not A-Z to the right
  • ) Close group 1

See a regex demo and a Python demo

For example:

import re
pattern = r'_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z]))+)'
s = ("A_B_CDEF_GHI\n"
            "A_B_C_DEF_G_H_I\n"
            "ABC_D_E_F_GHI\n"
            "ABCDEFG_H_I\n"
            "A_B_C")

res = re.sub(pattern, lambda x: x.group(1).replace("_", "") if x.group(1) else x.group(), s)
print(res)

Output

AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC

A bit broader match instead of characters A-Z could be using a negated character class matching any char except a whitespace char or underscore

_?[^_\s]{2,}_?|([^_\s](?:_[^_\s](?![^_\s]))+)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading