Two Letter Bigram in Pandas Dataframe

Having trouble finding a way to get every two letter combination in a string in a dataframe. Everything I have been searching is for words rather than letters. Below is expected output.

stringoutputhellohe, el, ll, loworldwo, or, rl,

I have tried both lines below

df['bigram'] = list(zip(df['string'],df['string][1:]))

Generated this error

ValueError: Length of values (15570) does not match length of index (15571)

df['bigram'] = list(ngrams(df['string'], n=2))

Generated this error

ValueError: Length of values (15570) does not match length of index (15571)

df['bigram']=re.findall(r'[a-zA-z]{2}', df['string'])

Generated this error

TypeError: expected string or bytes-like object

Example:

string output
hello he, el, ll, lo
world wo, or, rl, ld

>Solution :

You need to loop over the strings:

from nltk import ngrams

df = pd.DataFrame({'string': ['abc', 'abcdef']})

df['bigram'] = df['string'].apply(lambda x: list(ngrams(x, n=2)))

Output:

   string                                    bigram
0     abc                          [(a, b), (b, c)]
1  abcdef  [(a, b), (b, c), (c, d), (d, e), (e, f)]

If you want a string:

df['bigram'] = [', '.join([x[i:i+2] for i in range(len(x)-2)])
                for x in df['string']]

Output:

   string          bigram
0     abc              ab
1  abcdef  ab, bc, cd, de

Leave a Reply