I’ve been using the torchdata library (v0.6.0) to construct datapipes for my machine learning model, but I can’t seem to figure out how torchdata expects its users to make a train/test split.
Supposing I have a datapipe dp, my first attempt was to use the Sampler datapipe along with a torch.utils.data.SubsetRandomSampler (which is what I expected from this part of the documentation), but this doesn’t work how I would’ve thought:
>>> dp = Iterable Wrapper(range(5))
>>> Sampler(dp,SubsetRandomSampler([0, 1, 2]))
Traceback (most recent call last):
...
TypeError: 'SubsetRandomSampler' object is not callable
Maybe torchdata has it’s own samplers I’m not familiar with.
The only other way I can think of doing this would be to use a Demultiplexer, but this feels unclean to me, because we have to enumerate then "de-enumerate":
>>> train_len = len(dp) * 0.8
>>> dp1, dp2 = dp.enumerate().demux(num_instances=2, classifier_fn=lambda x: x[0] >= train_len)
>>> dp1, dp2 = (d.map(lambda x: x[1]) for d in (dp1, dp2))
Is there an "intended" way of doing this with torchdata which I’m missing?
>Solution :
PyTorch’s tutorial on using DataPipes answers the question:
import torchdata.datapipes.iter as pipes
from torch.utils.data import DataLoader, random_split
# initialize DataPipe with dummy values
dp = pipes.IterableWrapper(range(5))
# create train/test split ratio sizes (assuming 80/20 split)
train_size, test_test = int(len(dp) * 0.8), len(dp) - train_size
# split dataset into train/test sets
train_dataset, test_dataset = random_split(dp, [train_size, test_size])
# create batch sizes for train and test dataloaders
# (loading everything into memory, no minibatches)
batch_train, batch_test = len(train_dataset), len(test_dataset)
# create train and test dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_train, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_test)
# train model
for i, j in train_dataloader:
...
preds = model(i)
loss = loss_fn(preds, j)
....
If you want to use the built-in random_split() method of DataPipe:
train_dataset, test_dataset = dp.random_split(total_length=len(dp), weights={"train": 0.8, "test": 0.2}, seed=42)
train_dataloader = DataLoader(train_dataset, batch_size=batch_train, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_test)