Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Grouping filenames by multiple categories into list of lists

Given a directory containing the following files:

pcasvm_dataset_window_blackman_nperseg_4096_distance_1_speed_25k
pcasvm_dataset_window_blackman_nperseg_4096_distance_2_speed_25k
pcasvm_dataset_window_blackman_nperseg_8192_distance_1_speed_100k
pcasvm_dataset_window_blackman_nperseg_16384_distance_1_speed_200k
pcasvm_dataset_window_hamming_nperseg_4096_distance_1_speed_25k
pcasvm_dataset_window_hamming_nperseg_8192_distance_5_speed_25k
pcasvm_dataset_window_hann_nperseg_4096_distance_1_speed_25k
...

I can read these in with the following comprehension: datasets = [d for d in os.listdir('path/to/dir')]

However, what I want to do is analyse these datasets in group, with the groups being:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

window (i.e. blackman, hann) and nperseg (i.e. 8192, 4096, etc.)

The problem here is how to best achieve this fairly quickly given a large number of actual datasets. Would a dictionary be ideal? For example:

dict(
    blackman: dict(
        4096: [file1, file2, file3],
        8192: [..., ],
        ...
    ),
    ...
)

Thanks!

>Solution :

If I understand you correctly, you can use re to parse filenames and dict.setdefault to group them:

import re

file_names = [
    "pcasvm_dataset_window_blackman_nperseg_4096_distance_1_speed_25k",
    "pcasvm_dataset_window_blackman_nperseg_4096_distance_2_speed_25k",
    "pcasvm_dataset_window_blackman_nperseg_8192_distance_1_speed_100k",
    "pcasvm_dataset_window_blackman_nperseg_16384_distance_1_speed_200k",
    "pcasvm_dataset_window_hamming_nperseg_4096_distance_1_speed_25k",
    "pcasvm_dataset_window_hamming_nperseg_8192_distance_5_speed_25k",
    "pcasvm_dataset_window_hann_nperseg_4096_distance_1_speed_25k",
]

pat = re.compile(r"window_([^_]+)_nperseg_([^_]+)")

out = {}
for name in file_names:
    m = pat.search(name)
    if m:
        out.setdefault(m.group(1), {}).setdefault(m.group(2), []).append(name)

print(out)

Prints:

{
    "blackman": {
        "4096": [
            "pcasvm_dataset_window_blackman_nperseg_4096_distance_1_speed_25k",
            "pcasvm_dataset_window_blackman_nperseg_4096_distance_2_speed_25k",
        ],
        "8192": [
            "pcasvm_dataset_window_blackman_nperseg_8192_distance_1_speed_100k"
        ],
        "16384": [
            "pcasvm_dataset_window_blackman_nperseg_16384_distance_1_speed_200k"
        ],
    },
    "hamming": {
        "4096": [
            "pcasvm_dataset_window_hamming_nperseg_4096_distance_1_speed_25k"
        ],
        "8192": [
            "pcasvm_dataset_window_hamming_nperseg_8192_distance_5_speed_25k"
        ],
    },
    "hann": {
        "4096": ["pcasvm_dataset_window_hann_nperseg_4096_distance_1_speed_25k"]
    },
}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading