Pytorch delete features columns from dataset

Advertisements

I have a dataset below and would like to delete features From A – F
the dataset are converted from python dataframe

dataset = datasets.DatasetDict({"train":Dataset.from_pandas(X_train),
                        "test":Dataset.from_pandas(X_test),
                        "val":Dataset.from_pandas(X_val),
                      })

The dataset output like below

DatasetDict({
train: Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'text', '__index_level_0__', 'label'],
    num_rows: 1173
})
test: Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'text', '__index_level_0__', 'label'],
    num_rows: 1369
})
val: Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'text', '__index_level_0__', 'label'],
    num_rows: 1369
})

})

Result like below

DatasetDict({
train: Dataset({
    features: ['text', '__index_level_0__', 'label'],
    num_rows: 1173
})
test: Dataset({
    features: ['text', '__index_level_0__', 'label'],
    num_rows: 1369
})
val: Dataset({
    features: ['text', '__index_level_0__', 'label'],
    num_rows: 1369
})

})

>Solution :

What you need is the remove_columns() method from datasets. This works on any Dataset() object, if you want to remove some columns at this level and not in Pandas before.

dataset = dataset.remove_columns("label")

For your case, it would be:

dataset = dataset.remove_columns(['A', 'B', 'C', 'D', 'E', 'F'])

You can have a look here: https://huggingface.co/docs/datasets/process

Leave a Reply Cancel reply