I have strings of this form:
A_B_CDEF_GHI
A_B_C_DEF_G_H_I
ABC_D_E_F_GHI
ABCDEFG_H_I
A_B_C
I need to convert those to the following:
AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC
So the rules are:
(._){2,}should be converted toXXX_if it’s not at the end of the string.- If
(_.){2,}occurs at the end of a string, it should be converted to_XXX. - If
(_.){2,}.is the entire string, all underscores should be removed.
I’ve gotten to (((.)_){2,}), which does match the first rule, but how can I replace it with the non-underscore characters it found?
The
pythontag is present because that’s where the code is, and I know regex dialects depend on the language.
>Solution :
The dot in your example code matches any character including an underscore. You can make the pattern a bit more specific instead.
You can get all of the double A-Z matches out of the way, and capture the single A-Z followed by _ and A-Z in a group.
Then for the capture group replace the _ with an empty string.
_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z]))+)
_?[A-Z]{2,}_?Match 2 or more occurences of A-Z surrounded by optional underscores|or(Capture group 1[A-Z]Match a single A-Z(?:_[A-Z](?![A-Z]))+Repeat 1+ times_and A-Z asserting not A-Z to the right
)Close group 1
See a regex demo and a Python demo
For example:
import re
pattern = r'_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z]))+)'
s = ("A_B_CDEF_GHI\n"
"A_B_C_DEF_G_H_I\n"
"ABC_D_E_F_GHI\n"
"ABCDEFG_H_I\n"
"A_B_C")
res = re.sub(pattern, lambda x: x.group(1).replace("_", "") if x.group(1) else x.group(), s)
print(res)
Output
AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC
A bit broader match instead of characters A-Z could be using a negated character class matching any char except a whitespace char or underscore
_?[^_\s]{2,}_?|([^_\s](?:_[^_\s](?![^_\s]))+)