Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why is appending a dict to list not working here?

I’m trying to append the contents of two pickled files in a directory to a dict which is then appended to a list. For reference there are only two .pkl files in the directory and the pickled objects are returned as lists. However, when I try to append the dicts to the list, I get duplicate results. Anyone idea why?

import os
import pickle
import pandas as pd


y_labels = ('anime.pkl', 'manga.pkl')


def process_docs(path, label):
    docs = os.listdir(path)
    data = []
    for doc in docs:
        with open(f'{path}/{doc}', 'rb') as f:
            text = pickle.load(f)
            data.append({'label': label, 'text': ' '.join(text)})
    return data


data = []
for label in y_labels:
    data.extend(process_docs('keywords', label))
df = pd.DataFrame(data)

ACTUAL OUPUT:

[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': 'a b c'}] 
[{'label': 'manga.pkl', 'text': '1 2 3'}, {'label': 'manga.pkl', 'text': '1 2 3'}]

EXPECTED OUPUT:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': '1 2 3'}]

>Solution :

That’s because you are reading the same directory twice. From your code, you called process_docs('keywords', label) twice. Each time, you called docs = os.listdir(path) where the path was 'keywords' for both times. Therefore, docs were the same. After that, you looped the docs and append the content of the same files. As a result, you got duplicated results.

In order to get your expected result, you only need to iterate both docs and label pairs once only. You do not need two for loops. For example, you can do the following.

data = []
path = 'keywords'
docs = os.listdir(path)
for i, label in enumerate(y_labels):
    doc = docs[i]
    with open(f'{path}/{doc}', 'rb') as f:
        text = pickle.load(f)
        data.append({'label': label, 'text': ' '.join(text)})
df = pd.DataFrame(data)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading