Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to properly make a train/test split using `torchdata`?

I’ve been using the torchdata library (v0.6.0) to construct datapipes for my machine learning model, but I can’t seem to figure out how torchdata expects its users to make a train/test split.

Supposing I have a datapipe dp, my first attempt was to use the Sampler datapipe along with a torch.utils.data.SubsetRandomSampler (which is what I expected from this part of the documentation), but this doesn’t work how I would’ve thought:

>>> dp = Iterable Wrapper(range(5))
>>> Sampler(dp,SubsetRandomSampler([0, 1, 2]))
Traceback (most recent call last):
...
TypeError: 'SubsetRandomSampler' object is not callable

Maybe torchdata has it’s own samplers I’m not familiar with.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The only other way I can think of doing this would be to use a Demultiplexer, but this feels unclean to me, because we have to enumerate then "de-enumerate":

>>> train_len = len(dp) * 0.8
>>> dp1, dp2 = dp.enumerate().demux(num_instances=2, classifier_fn=lambda x: x[0] >= train_len)
>>> dp1, dp2 = (d.map(lambda x: x[1]) for d in (dp1, dp2))

Is there an "intended" way of doing this with torchdata which I’m missing?

>Solution :

PyTorch’s tutorial on using DataPipes answers the question:

import torchdata.datapipes.iter as pipes
from torch.utils.data import DataLoader, random_split

# initialize DataPipe with dummy values
dp = pipes.IterableWrapper(range(5))

# create train/test split ratio sizes (assuming 80/20 split)
train_size, test_test = int(len(dp) * 0.8), len(dp) - train_size

# split dataset into train/test sets
train_dataset, test_dataset = random_split(dp, [train_size, test_size])

# create batch sizes for train and test dataloaders
# (loading everything into memory, no minibatches)
batch_train, batch_test = len(train_dataset), len(test_dataset)

# create train and test dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_train, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_test)

# train model
for i, j in train_dataloader:
    ...
    preds = model(i)
    loss = loss_fn(preds, j)
    ....

If you want to use the built-in random_split() method of DataPipe:

train_dataset, test_dataset = dp.random_split(total_length=len(dp), weights={"train": 0.8, "test": 0.2}, seed=42)

train_dataloader = DataLoader(train_dataset, batch_size=batch_train, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_test)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading