Skip to content

Data Preparation

Data preparation turns genomic regions and labels into a tokenized Hugging Face DatasetDict with train, validation, and test splits.

CLI

Grouped CLI
bertnado prepare-data \
  --file-path data/regions.parquet \
  --target-column bound \
  --fasta-file data/genome.fa \
  --tokenizer-name PoetschLab/GROVER \
  --output-dir output/dataset \
  --task-type binary_classification \
  --threshold 0.5
Standalone command
bertnado-data \
  --file-path data/regions.parquet \
  --target-column bound \
  --fasta-file data/genome.fa \
  --tokenizer-name PoetschLab/GROVER \
  --output-dir output/dataset \
  --task-type binary_classification \
  --threshold 0.5

Python API

Prepare data from Python
from pathlib import Path

from bertnado.api import prepare_dataset

prepare_dataset(
    file_path=Path("data/regions.parquet"),
    target_column="bound",
    fasta_file=Path("data/genome.fa"),
    output_dir=Path("output/dataset"),
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
    threshold=0.5,
)

Input Format

--file-path should point to a Parquet file whose index contains genomic regions in this format:

Parquet index
chr1:100000-101024
chr1:101024-102048
chr2:250000-251024

The Parquet file must also contain the target column passed with --target-column.

BertNado parses each index value into:

Field Source
chromosome Text before :
start Number before -
end Number after -
labels The target column

The FASTA file is used to fetch the DNA sequence for each interval.

Task Types

Task type Label behavior
regression Uses the target values as continuous labels.
binary_classification Converts target values to 0 or 1 using --threshold.
multilabel_classification Converts comma-separated target values to integer label lists.

For binary classification:

Binary classification
bertnado-data \
  --file-path data/regions.parquet \
  --target-column bound \
  --fasta-file data/genome.fa \
  --output-dir output/dataset \
  --task-type binary_classification \
  --threshold 0.5

Chromosome Splits

BertNado uses fixed chromosome-aware splits:

Split Chromosomes
Train All chromosomes except chr8 and chr9
Validation chr8
Test chr9

Make sure your dataset has enough examples on chr8 and chr9. Empty validation or test splits will cause trouble later during training or evaluation.

Outputs

The prepared dataset is saved to disk:

Dataset output
output/dataset/
|-- train/
|-- validation/
`-- test/

Each split contains fetched sequences, labels, and tokenizer outputs such as input_ids and attention_mask.

BertNado also writes label distribution plots:

Label plots
output/dataset/
|-- label_distribution_train.png
|-- label_distribution_val.png
`-- label_distribution_test.png

Binary classification also writes:

Binary classification extras
output/dataset/
|-- class_distribution.png
`-- class_weights.json

class_weights.json is used automatically during binary classification training when no explicit positive-class weight is provided.