Having trouble finding a way to get every two letter combination in a string in a dataframe. Everything I have been searching is for words rather than letters. Below is expected output.
stringoutputhellohe, el, ll, loworldwo, or, rl,
I have tried both lines below
df['bigram'] = list(zip(df['string'],df['string][1:]))
Generated this error
ValueError: Length of values (15570) does not match length of index (15571)
df['bigram'] = list(ngrams(df['string'], n=2))
Generated this error
ValueError: Length of values (15570) does not match length of index (15571)
df['bigram']=re.findall(r'[a-zA-z]{2}', df['string'])
Generated this error
TypeError: expected string or bytes-like object
Example:
string | output |
---|---|
hello | he, el, ll, lo |
world | wo, or, rl, ld |
>Solution :
You need to loop over the strings:
from nltk import ngrams
df = pd.DataFrame({'string': ['abc', 'abcdef']})
df['bigram'] = df['string'].apply(lambda x: list(ngrams(x, n=2)))
Output:
string bigram
0 abc [(a, b), (b, c)]
1 abcdef [(a, b), (b, c), (c, d), (d, e), (e, f)]
If you want a string:
df['bigram'] = [', '.join([x[i:i+2] for i in range(len(x)-2)])
for x in df['string']]
Output:
string bigram
0 abc ab
1 abcdef ab, bc, cd, de