Is it possible to create a Bag of Words but word char or substrings?

February 7, 2022

Is it possible to create a BoW but instead of searching for words I do it for substrings?

I’m working on a python program were I input an array with various names (instead of full sentences) on it and try to apply BoW on it and the problem is that because BoW is for words in sentences the program treats them as sentences.

Example:
If I have the word Farahoka, Csanoha, April, Bas, Phrahonee and I’m looking for the substring aho

How could I do this?

Edit:
It seems that my question is not that clear, so I’ll try to do my best to explain what is the task and what I need to do.

I have a list of various names on an array, and I’m trying to find a way to vectorize the letters or maybe find a way to separate into syllabes.

Example:

In BoW if I have The sky is blue today it will be separated into [The, sky, is, blue, today], in the problem I have I’m trying to do something similar, separate/find substrings for words.

Using the previous example, I want to take the word todayand search for the substring ay

Is it possible to do it without using things like if 'ay' in today or endswith('ay')?

In theory I need to use an unigram model for this in order to learn wights for a predictor but it seems all I can find online is focused on words and not substrings.

>Solution :

You don’t have much choice but to loop over the elements.

The exact output you expect is unclear, but you could do one of the following.

Searching for matches:

words = ['Farahoka', 'Csanoha', 'April', 'Bas', 'Phrahonee']

[w for w in words if 'aho' in w]
# ['Farahoka', 'Phrahonee']

Testing is any word contains the substring:

any('aho' in w for w in words)
# True

If you’re looking for something a bit more generic, you could compute all three-grams of your words:

from nltk import ngrams
from collections import Counter

counts = sum((Counter(map(''.join, ngrams(w.lower(),3)))
              for w in words), Counter())

Output:

Counter({'far': 1,
         'ara': 1,
         'rah': 2,
         'aho': 2,
         'hok': 1,
         'oka': 1,
         'csa': 1,
         'san': 1,
         'ano': 1,
         'noh': 1,
         'oha': 1,
         'apr': 1,
         'pri': 1,
         'ril': 1,
         'bas': 1,
         'phr': 1,
         'hra': 1,
         'hon': 1,
         'one': 1,
         'nee': 1})