i have dataset that prepare for train in fasttext and i wanna remove sublabels from dataset
for example:
__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data.
Any help much appreciated
thanks
im tried this:
r'(?<=__label__[^_]+)\w+'
isnt working
exact code:
ptrn = r'(?<=__label__[^_]+)\w+'
re.sub(ptrn, '', test_String)
and this error was occured:
error:
error Traceback (most recent call
last)
c:\Users\THoseini\Desktop\projects\ensani_classification\tes4t.ipynb
Cell 3 in <cell line: 3>()
1 ptrn = r'(?<=label[^_]+)\w+’
—-> 3 re.sub(ptrn, ”, test_String)File
c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\re.py:209,
in sub(pattern, repl, string, count, flags)
202 def sub(pattern, repl, string, count=0, flags=0):
203 """Return the string obtained by replacing the leftmost
204 non-overlapping occurrences of the pattern in string by the
205 replacement repl. repl can be either a string or a callable;
206 if a string, backslash escapes in it are processed. If it is
207 a callable, it’s passed the Match object and must return
208 a replacement string to be used."""
–> 209 return _compile(pattern, flags).sub(repl, string, count)File
c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\re.py:303,
in _compile(pattern, flags)
301 if not sre_compile.isstring(pattern):
302 raise TypeError("first argument must be string or compiled pattern")
–> 303 p = sre_compile.compile(pattern, flags)
304 if not (flags & DEBUG):
305 if len(_cache) >= _MAXCACHE:
306 # Drop the oldest itemFile
c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\sre_compile.py:792,
in compile(p, flags)
–> 198 raise error("look-behind requires fixed-width pattern")
199 emit(lo) # look behind
200 _compile(code, av[1], flags)error: look-behind requires fixed-width pattern
>Solution :
try this regex:
(__label__[^_\s]+)\w+
and a sample code in python:
import re
test_string = """__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data."""
ptrn = r'(__label__[^_\s]+)\w+'
re.sub(ptrn, r'\1', test_string)
The re.sub() function stands for a substring and returns a string with replaced values.
[^character_group] means negation: Matches any single character that is not in character_group. and \w matches any word character. \s matches any white-space character.
and output are like expected:
__label__label1 __label__label2 __label__label __label__label1 sometext some sentce som data.