Custom Data Guide
TensorBloom supports loading your own datasets for training. This guide covers all supported data formats.
Quick Start
- Drag a Data node onto the canvas
- In Properties, select your data source
- Connect the Data node’s x handle to your model’s first layer
- Connect the Data node’s y handle to your Loss function’s target input
- Train
Built-in Datasets
Select a preset from the dropdown — no configuration needed:
| Dataset | Domain | Input Shape | Classes | Description |
|---|---|---|---|---|
| MNIST | Vision | [1, 28, 28] | 10 | Handwritten digits |
| Fashion-MNIST | Vision | [1, 28, 28] | 10 | Clothing items |
| CIFAR-10 | Vision | [3, 32, 32] | 10 | Natural images (10 categories) |
| CIFAR-100 | Vision | [3, 32, 32] | 100 | Natural images (100 categories) |
| TinyShakespeare | Text | [256] | 65 | Character-level Shakespeare |
| WikiText-2 | Text | [128] | 30,000 | Word-level Wikipedia |
| IMDB | Text | [100] | 2 | Movie review sentiment |
| AG News | Text | [100] | 4 | News classification |
| SpeechCommands | Audio | [80, 100] | 35 | Keyword spotting |
Datasets are downloaded automatically on first use to ./data/.
Custom Tensor Files
For your own data saved as PyTorch tensors, NumPy arrays, or SafeTensors.
Supported Formats
| Format | Extension | How to Create |
|---|---|---|
| PyTorch | .pt | torch.save({"features": X, "labels": y}, "data.pt") |
| NumPy | .npz | np.savez("data.npz", features=X, labels=y) |
| NumPy | .npy | np.save("data.npy", X) |
| SafeTensors | .safetensors | save_file({"features": X}, "data.safetensors") |
Creating a Dataset
import torch
# Classification
X = torch.randn(1000, 3, 32, 32) # 1000 images, 3 channels, 32x32
y = torch.randint(0, 10, (1000,)) # 10 classes
torch.save({"images": X, "labels": y}, "my_dataset.pt")
# Regression
X = torch.randn(500, 13) # 500 samples, 13 features
y = torch.randn(500, 1) # continuous target
torch.save({"inputs": X, "targets": y}, "regression_data.pt")
# Autoencoder (no labels)
X = torch.randn(1000, 1, 28, 28)
torch.save({"data": X}, "autoencoder_data.pt")
Loading in TensorBloom
- Select Custom Tensors as the source
- Browse for your file or type the path
- Click Scan Data
- The app discovers all tensors in your file and shows them with shapes, types, and statistics
- Assign each tensor a role:
- Model Input — feeds into your neural network
- Loss Target — compared against the model’s output by the loss function
- Not Used — ignored during training
Tensor Requirements
- All tensors must have the same number of samples (first dimension)
- The first dimension is the batch dimension (number of samples)
- Input tensors are cast to
float32by default - Target tensors for classification should be
int64/long(class indices) - Target tensors for regression should be
float32
Folder of Files
You can also point to a folder containing multiple tensor files. TensorBloom scans all .pt, .npz, .npy, and .safetensors files and merges their tensors.
ImageFolder (Local Images)
For image classification with your own photos:
my_images/
├── train/
│ ├── cats/
│ │ ├── img001.jpg
│ │ └── img002.jpg
│ └── dogs/
│ ├── img003.jpg
│ └── img004.jpg
└── val/
├── cats/
│ └── img005.jpg
└── dogs/
└── img006.jpg
- Select ImageFolder source
- Set the path to your folder
- Configure image size and color mode (RGB/Grayscale)
- Optionally enable data augmentation (random flip, crop, rotation, color jitter)
HuggingFace Datasets
Note: HuggingFace support is experimental in v0.1. Simple vision and text datasets work. Complex/nested datasets (multi-modal, structured data with nested dicts) require a custom preprocessing script.
Load datasets from the HuggingFace Hub:
- Select HuggingFace source
- Enter the dataset name (e.g.,
mnist,imdb,ag_news) - Click Inspect Data to discover columns, types, and splits
- Configure column mapping (image/text/label columns)
- For datasets with complex columns, provide a preprocessing script in the code editor
What works
- Vision datasets with
Image+ClassLabelcolumns (CIFAR-10, MNIST, etc.) - Text classification with
string+ClassLabelcolumns (IMDB, AG News, etc.) - Simple tabular datasets with scalar columns
What doesn’t work yet
- Datasets with nested/dict columns (SQuAD, multi-modal datasets)
- Streaming datasets (large datasets that don’t fit in memory)
- Datasets requiring custom tokenizers
For unsupported formats, export your data as a .pt file and use Custom Tensors instead.
Requires pip install datasets.
CSV Files
For tabular data in CSV format:
- Select Custom CSV source
- Set the path to your CSV file
- Configure delimiter, header, target column
- Optionally select feature columns and normalization
The CSV should have feature columns followed by a target column. The first row can be a header.
Validation Split
All data sources support a Val Split parameter (default 0.2). This splits your training data into training and validation sets. The validation loss is reported after each epoch to detect overfitting.
Tips
- Shape mismatch? TensorBloom auto-fixes
in_featuresandin_channelswhen you click Start Training - Wrong loss function? If your target is class labels but you’re using MSELoss, the preflight check will warn you
- Large datasets? Use mixed precision training (Advanced > Mixed Precision) to reduce VRAM usage
- Reproducibility? Set a random seed in Advanced > Random Seed