I used an NLP chunker that splits incorrectly the term ‘C++’ and ‘C#’ as: C (NN), +(SYM), +(SYM), C (NN), #(SYM).
The resulting list of incorrect chunking looks like this:
l = [['C', 'NN'], ['+', 'SYM'], ['+', 'SYM'], ['C', 'NN'], ['#', 'NN']]
I would like to post-process this list, by identifying the strings in index 0 of each list that are ‘C’ and the next in line ‘+’, ‘+’ or ‘#’. Then I’d like to concatenate these strings, so that ‘C’,’+’,’+’ becomes ‘C++’ by simply adding these together. This has to be generalisable, so it should work with lists that contain multiple different words, but still concatenate the desired strings.
desired result:
l_desired = [['C++', 'NN'], ['C#', 'NN']]
I can identify the items in the list independently (index 0) but I don’t know how to go about identifying the desired sequence. My idea was to use the next() function, although I do not know where to begin.
>Solution :
You can loop over the list and check if the first element is a letter, in this case append as a new item, else update the last item:
from string import ascii_letters
letters = set(ascii_letters)
out = []
for e in l:
if e[0][0] in letters:
out.append(e.copy()) # making a copy not to affect original list
elif out: # this is to check that out is not empty (edge case)
out[-1][0] += e[0]
Or using a blacklist of symbols:
symbols = set('+#')
out = []
for e in l:
if e[0] in symbols and out:
out[-1][0] += e[0]
else:
out.append(e.copy())
output:
[['C++', 'NN'], ['C#', 'NN']]