I am trying to create a polars dataframe which is a frequency table of words in a list of words. Something like this:
from collections import defaultdict
word_freq= defaultdict(int)
for word in list_of_words:
word_freq[word] += 1
Except, instead of a dictionary I would like it to be a polars dataframe with two columns: word, count.
I would also like to know what the best way to convert this dict to a df (in cases where that may be needed).
>Solution :
There is collections.Counter which simplifies this:
from collections import Counter
words = ['foo', 'foo', 'bar', 'baz', 'baz']
counts = Counter(words)
Counter({'foo': 2, 'bar': 1, 'baz': 2})
To create a Dataframe:
pl.DataFrame(list(counts.items()), schema=['word', 'count'])
shape: (3, 2)
┌──────┬───────┐
│ word ┆ count │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪═══════╡
│ foo ┆ 2 │
│ bar ┆ 1 │
│ baz ┆ 2 │
└──────┴───────┘
You could also do the counting in polars with .value_counts()
pl.Series('word', words).value_counts()
shape: (3, 2)
┌──────┬────────┐
│ word ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════╪════════╡
│ foo ┆ 2 │
│ bar ┆ 1 │
│ baz ┆ 2 │
└──────┴────────┘