BertNado¶
BertNado is a modular framework for fine-tuning Hugging Face DNA language models such as GROVER, NT2, and DNABERT variants on genomic prediction tasks. It supports full fine-tuning and parameter-efficient transfer learning strategies such as LoRA.
Features¶
- Model support for GROVER, NT2, DNABERT, and other Hugging Face-compatible DNA language models.
- Task flexibility for regression, binary classification, multi-label classification, and masked DNA modeling.
- Chromosome-aware train, validation, and test splits to reduce data leakage.
- Efficient fine-tuning with LoRA and other parameter-efficient transfer learning approaches.
- Hyperparameter optimization through Weights & Biases sweeps.
- Evaluation outputs for ROC curves, precision-recall curves, and confusion matrices in binary classification workflows.
- Model interpretation with SHAP and Captum Layer Integrated Gradients.
Installation¶
For the documentation tooling:
Quickstart¶
The end-to-end workflow prepares genomic regions, runs a sweep, trains the best configuration, evaluates predictions, and extracts feature attributions.
BertNado uses Weights & Biases for sweeps and training logs. Run wandb login
locally, or set WANDB_API_KEY on servers and CI before starting the sweep.
The sweep metric is also used to choose the best checkpoint during training.
bertnado-data \
--file-path test/data/mock_data.parquet \
--target-column bound \
--fasta-file test/data/mock_genome.fasta \
--tokenizer-name PoetschLab/GROVER \
--output-dir output/dataset \
--task-type binary_classification \
--threshold 0.5
bertnado-sweep \
--config-path test/data/mock_sweep_config.json \
--output-dir output/sweep \
--model-name PoetschLab/GROVER \
--dataset output/dataset \
--sweep-count 2 \
--project-name project \
--metric-name eval/roc_auc \
--metric-goal maximize \
--task-type binary_classification
--config-path is the W&B sweep recipe. The mock path is only an example;
use your own JSON config for real experiments.
bertnado-train \
--output-dir output/train \
--model-name PoetschLab/GROVER \
--dataset output/dataset \
--best-config-path output/sweep/best_sweep_config.json \
--task-type binary_classification \
--project-name project \
--metric-name eval/roc_auc \
--metric-goal maximize
from bertnado.api import (
extract_features,
predict_and_evaluate,
prepare_dataset,
run_sweep,
train_model,
)
prepare_dataset(
file_path="test/data/mock_data.parquet",
target_column="bound",
fasta_file="test/data/mock_genome.fasta",
output_dir="output/dataset",
task_type="binary_classification",
threshold=0.5,
)
sweep = run_sweep(
config_path="test/data/mock_sweep_config.json", # W&B sweep recipe JSON
output_dir="output/sweep",
dataset="output/dataset",
project_name="project",
task_type="binary_classification",
sweep_count=2,
metric_name="eval/roc_auc",
metric_goal="maximize",
)
train_model(
output_dir="output/train",
dataset="output/dataset",
best_config_path=sweep["best_config_path"],
project_name="project",
task_type="binary_classification",
metric_name=sweep["metric_name"],
metric_goal=sweep["metric_goal"],
)
predict_and_evaluate(
model_dir="output/train/model",
dataset_dir="output/dataset",
output_dir="output/predictions",
task_type="binary_classification",
)
extract_features(
model_dir="output/train/model",
dataset_dir="output/dataset",
output_dir="output/feature_analysis",
task_type="binary_classification",
method="both",
target_class=1,
)
CLI¶
Use the command-line interface when you want a reproducible shell workflow for data preparation, sweeps, full training, prediction, and feature analysis.
Workflow¶
Use the workflow guide when you want a deeper explanation of each stage, including expected data format, sweep configuration, training outputs, prediction artifacts, and SHAP/LIG attribution files.
API¶
Use the Python API when you want to orchestrate BertNado workflows from notebooks, scripts, pipelines, or tests.
Outputs¶
- Figures are saved to
output/figures/. - SHAP scores are saved to
output/shap/. - LIG attributions are saved to
output/lig/. - Trained models are saved to
output/models/or the configured training output directory.
Project Structure¶
bertnado/
|-- api.py # Programmatic API
|-- cli.py # Command-line interface
|-- data/
| `-- prepare_dataset.py # Dataset creation and tokenization
|-- evaluation/
| |-- predict.py # Predict from trained models
| `-- feature_extraction.py # SHAP / LIG-based interpretation
`-- training/
|-- finetune.py # Fine-tuning using best config
|-- full_train.py # Full training loop
|-- model.py # PEFT/LoRA model architecture
|-- sweep.py # W&B sweep setup
|-- trainers.py # Trainer wrappers
`-- metrics.py # Metric computation