Data Preparation¶
Data preparation turns genomic regions and labels into a tokenized Hugging Face
DatasetDict with train, validation, and test splits.
CLI¶
bertnado prepare-data \
--file-path data/regions.parquet \
--target-column bound \
--fasta-file data/genome.fa \
--tokenizer-name PoetschLab/GROVER \
--output-dir output/dataset \
--task-type binary_classification \
--threshold 0.5
bertnado-data \
--file-path data/regions.parquet \
--target-column bound \
--fasta-file data/genome.fa \
--tokenizer-name PoetschLab/GROVER \
--output-dir output/dataset \
--task-type binary_classification \
--threshold 0.5
Python API¶
from pathlib import Path
from bertnado.api import prepare_dataset
prepare_dataset(
file_path=Path("data/regions.parquet"),
target_column="bound",
fasta_file=Path("data/genome.fa"),
output_dir=Path("output/dataset"),
task_type="binary_classification",
tokenizer_name="PoetschLab/GROVER",
threshold=0.5,
)
Input Format¶
--file-path should point to a Parquet file whose index contains genomic
regions in this format:
The Parquet file must also contain the target column passed with
--target-column.
BertNado parses each index value into:
| Field | Source |
|---|---|
chromosome |
Text before : |
start |
Number before - |
end |
Number after - |
labels |
The target column |
The FASTA file is used to fetch the DNA sequence for each interval.
Task Types¶
| Task type | Label behavior |
|---|---|
regression |
Uses the target values as continuous labels. |
binary_classification |
Converts target values to 0 or 1 using --threshold. |
multilabel_classification |
Converts comma-separated target values to integer label lists. |
For binary classification:
bertnado-data \
--file-path data/regions.parquet \
--target-column bound \
--fasta-file data/genome.fa \
--output-dir output/dataset \
--task-type binary_classification \
--threshold 0.5
Chromosome Splits¶
BertNado uses fixed chromosome-aware splits:
| Split | Chromosomes |
|---|---|
| Train | All chromosomes except chr8 and chr9 |
| Validation | chr8 |
| Test | chr9 |
Make sure your dataset has enough examples on chr8 and chr9. Empty
validation or test splits will cause trouble later during training or
evaluation.
Outputs¶
The prepared dataset is saved to disk:
Each split contains fetched sequences, labels, and tokenizer outputs such as
input_ids and attention_mask.
BertNado also writes label distribution plots:
output/dataset/
|-- label_distribution_train.png
|-- label_distribution_val.png
`-- label_distribution_test.png
Binary classification also writes:
class_weights.json is used automatically during binary classification
training when no explicit positive-class weight is provided.