So I would like to take every row and split it into bigrams to be used as columns in order to encode the original string column.
I have a dataset like this one:
| A |
|---|
| blue |
| red |
| black |
I want my result to look like this:
| A | bl | lu | ue | re | ed | la | ac | ck |
|---|---|---|---|---|---|---|---|---|
| blue | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| red | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| black | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
I have tried spliting up A but it does not split characters.
>Solution :
Here’s a way to do:
# sample data
f = pd.DataFrame({'A': ['blue', 'red', 'black']})
def bigram(s, n=2):
return [s[i:i+n] for i in range(0, len(s), 1) if len(s[i:i+2]) == n]
# using pandas
f['bgm'] = f['A'].apply(bigram)
f = f.explode('bgm').reset_index(drop=True)
f = pd.crosstab(f['A'], f['bgm']).reset_index()
f.columns.name=None
print(f)
A ac bl ck ed la lu re ue
0 black 1 1 1 0 1 0 0 0
1 blue 0 1 0 0 0 1 0 1
2 red 0 0 0 1 0 0 1 0