I’m trying to append the contents of two pickled files in a directory to a dict which is then appended to a list. For reference there are only two .pkl files in the directory and the pickled objects are returned as lists. However, when I try to append the dicts to the list, I get duplicate results. Anyone idea why?
import os
import pickle
import pandas as pd
y_labels = ('anime.pkl', 'manga.pkl')
def process_docs(path, label):
docs = os.listdir(path)
data = []
for doc in docs:
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
data.append({'label': label, 'text': ' '.join(text)})
return data
data = []
for label in y_labels:
data.extend(process_docs('keywords', label))
df = pd.DataFrame(data)
ACTUAL OUPUT:
[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': 'a b c'}]
[{'label': 'manga.pkl', 'text': '1 2 3'}, {'label': 'manga.pkl', 'text': '1 2 3'}]
EXPECTED OUPUT:
[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': '1 2 3'}]
>Solution :
That’s because you are reading the same directory twice. From your code, you called process_docs('keywords', label) twice. Each time, you called docs = os.listdir(path) where the path was 'keywords' for both times. Therefore, docs were the same. After that, you looped the docs and append the content of the same files. As a result, you got duplicated results.
In order to get your expected result, you only need to iterate both docs and label pairs once only. You do not need two for loops. For example, you can do the following.
data = []
path = 'keywords'
docs = os.listdir(path)
for i, label in enumerate(y_labels):
doc = docs[i]
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
data.append({'label': label, 'text': ' '.join(text)})
df = pd.DataFrame(data)