Data Loading¶

Data Augmentation¶

PyTorch Connectomics uses MONAI dictionary transforms for augmentation. The common path is to configure augmentations in YAML and let the Lightning data factory build the transform pipeline:

from connectomics.config import load_config
from connectomics.data.augmentation import build_train_transforms

cfg = load_config("tutorials/minimal.yaml")
transforms = build_train_transforms(cfg, keys=["image", "label"], skip_loading=True)

sample = {"image": image, "label": label}
augmented = transforms(sample)

For custom pipelines, compose MONAI transforms with the connectomics-specific *d dictionary transforms:

from monai.transforms import Compose, RandFlipd
from connectomics.data.augmentation import RandCutBlurd, RandMisAlignmentd

transforms = Compose([
    RandFlipd(keys=["image", "label"], prob=0.5, spatial_axis=0),
    RandMisAlignmentd(keys=["image", "label"], prob=0.5, displacement=16),
    RandCutBlurd(keys=["image"], prob=0.7, length_ratio=0.6),
])

sample = {"image": image, "label": label}
augmented = transforms(sample)

The standard keys are image, label, label_aux, and mask. Spatial transforms that receive multiple keys sample one random transform and apply it consistently to every specified key.

Augmentations are configured under data.augmentation:

default:
  data:
    augmentation:
      profile: aug_standard
      misalignment:
        enabled: true
        prob: 0.5
        displacement: 16

Each transform has an enabled flag. To turn off a specific transformation, set:

default:
  data:
    augmentation:
      misalignment:
        enabled: false

Rejection Sampling¶

Rejection sampling in the dataloader is applied for the following two purposes:

1 - Adding more attention to sparse targets

For some datasets/tasks, the foreground mask is sparse in the volume (e.g., synapse detection). Therefore we perform reject sampling to decrease the ratio of (all completely avoid) regions without foreground pixels. Such a design lets the model pay more attention to the foreground pixels to alleviate false negatives (but may introduce more false positives). Configure rejection sampling under data.dataloader:

default:
  data:
    dataloader:
      reject_sampling:
        size_thres: 1000
        p: 0.95

The size_thres: 1000 key-value pair means that if a random volume contains more than 1,000 non-background voxels, then the volume is considered as a foreground volume and is returned by the rejection sampling function. If it contains less than 1,000 voxels, the function will reject it with a probability p: 0.95 and sample another volume. size_thres is set to -1 by default to disable the rejection sampling.

2 - Handling partially annotated data

Some datasets are only partially labeled, and the unlabeled region should not be considered in loss calculation. In that case, the user can specify the data path to the valid mask using data.train.mask and data.val.mask. The valid mask volume should be of the same shape as the label volume with non-zero values denoting annotated regions. A sampled volume with a valid ratio less than 0.5 will be rejected by default.

Filename and Lazy Datasets¶

The old TileDataset path has been removed. Large datasets now use one of the current dataset implementations exported from connectomics.data.datasets:

connectomics.data.datasets.CachedVolumeDataset for volumes that fit in RAM.
connectomics.data.datasets.LazyH5VolumeDataset and connectomics.data.datasets.LazyZarrVolumeDataset for crop-on-read HDF5/Zarr training without preloading the full volume.
connectomics.data.datasets.MonaiFilenameDataset for pre-tiled PNG/TIFF-style file lists.

For filename-based datasets, prepare a JSON file with image and label paths:

import json
from pathlib import Path

root = Path("path/to/dataset")
n_images = 2000
data_dict = {
    "base_path": str(root),
    "images": [f"images/im{idx:04d}.png" for idx in range(n_images)],
    "masks": [f"labels/seg{idx:04d}.png" for idx in range(n_images)],
}

js_path = "filename_dataset.json"
with open(js_path, 'w') as fp:
    json.dump(data_dict, fp)

Then select the filename dataset in the Hydra config:

default:
  data:
    train:
      dataset_type: filename
      json: filename_dataset.json
      image_key: images
      label_key: masks
      split_ratio: 0.9

For large HDF5 or Zarr volumes, prefer lazy crop-on-read instead of file tiling:

default:
  data:
    dataloader:
      use_lazy_h5: true
      # or: use_lazy_zarr: true
      patch_size: [128, 128, 128]

The Lightning data factory chooses the concrete dataset from these config fields:

from connectomics.config import load_config
from connectomics.training.lightning import create_datamodule

cfg = load_config("tutorials/minimal.yaml")
datamodule = create_datamodule(cfg)

Handling 2D Data¶

We design two ways to run inference for a trained 2D model. The first way is to directly load a 3D volume, but the inference pipeline will predict each slice one-by-one and stack them back to a 3D volume. For representations depend on the dimension of the inputs (e.g., affinity map has three channels for 3D masks but only two channels for 2D masks), the number of output channels is consistent with the 2D model. The second way is to directly load 2D PNG or TIFF images. Below are the configurations for streaming 2D inputs at inference time:

test:
  data:
    test:
      dataset_type: filename
      json: datasets/test_files.json
    dataloader:
      patch_size: [1, 256, 256]

The filename JSON should list every input image:

{
  "base_path": "/data/test",
  "images": [
    "slice_0001.png",
    "slice_0002.png",
    "slice_0003.png",
    "slice_0004.png"
  ]
}

The useful Linux command to list PNG images in a folder is:

ls -d $(pwd -P)/*.png > path.txt