Skip to content

BertNado

BertNado logo

BertNado is a modular framework for fine-tuning Hugging Face DNA language models such as GROVER, NT2, and DNABERT variants on genomic prediction tasks. It supports full fine-tuning and parameter-efficient transfer learning strategies such as LoRA.

Features

  • Model support for GROVER, NT2, DNABERT, and other Hugging Face-compatible DNA language models.
  • Task flexibility for regression, binary classification, multi-label classification, and masked DNA modeling.
  • Chromosome-aware train, validation, and test splits to reduce data leakage.
  • Efficient fine-tuning with LoRA and other parameter-efficient transfer learning approaches.
  • Hyperparameter optimization through Weights & Biases sweeps.
  • Evaluation outputs for ROC curves, precision-recall curves, and confusion matrices in binary classification workflows.
  • Model interpretation with SHAP and Captum Layer Integrated Gradients.

Installation

Install BertNado
git clone https://github.com/CChahrour/BertNado.git
cd BertNado
pip install -e .

For the documentation tooling:

Install documentation dependencies
pip install -e ".[dev]"

Quickstart

The end-to-end workflow prepares genomic regions, runs a sweep, trains the best configuration, evaluates predictions, and extracts feature attributions.

BertNado uses Weights & Biases for sweeps and training logs. Run wandb login locally, or set WANDB_API_KEY on servers and CI before starting the sweep. The sweep metric is also used to choose the best checkpoint during training.

1. Prepare the dataset
bertnado-data \
  --file-path test/data/mock_data.parquet \
  --target-column bound \
  --fasta-file test/data/mock_genome.fasta \
  --tokenizer-name PoetschLab/GROVER \
  --output-dir output/dataset \
  --task-type binary_classification \
  --threshold 0.5
2. Run a sweep
bertnado-sweep \
  --config-path test/data/mock_sweep_config.json \
  --output-dir output/sweep \
  --model-name PoetschLab/GROVER \
  --dataset output/dataset \
  --sweep-count 2 \
  --project-name project \
  --metric-name eval/roc_auc \
  --metric-goal maximize \
  --task-type binary_classification

--config-path is the W&B sweep recipe. The mock path is only an example; use your own JSON config for real experiments.

3. Train the best model
bertnado-train \
  --output-dir output/train \
  --model-name PoetschLab/GROVER \
  --dataset output/dataset \
  --best-config-path output/sweep/best_sweep_config.json \
  --task-type binary_classification \
  --project-name project \
  --metric-name eval/roc_auc \
  --metric-goal maximize
4. Predict and evaluate
bertnado-predict \
  --tokenizer-name PoetschLab/GROVER \
  --model-dir output/train/model \
  --dataset-dir output/dataset \
  --output-dir output/predictions \
  --task-type binary_classification
5. Interpret the model
bertnado-feature \
  --tokenizer-name PoetschLab/GROVER \
  --model-dir output/train/model \
  --dataset-dir output/dataset \
  --output-dir output/feature_analysis \
  --task-type binary_classification \
  --method both \
  --target-class 1
Run the same workflow from Python
from bertnado.api import (
    extract_features,
    predict_and_evaluate,
    prepare_dataset,
    run_sweep,
    train_model,
)

prepare_dataset(
    file_path="test/data/mock_data.parquet",
    target_column="bound",
    fasta_file="test/data/mock_genome.fasta",
    output_dir="output/dataset",
    task_type="binary_classification",
    threshold=0.5,
)

sweep = run_sweep(
    config_path="test/data/mock_sweep_config.json",  # W&B sweep recipe JSON
    output_dir="output/sweep",
    dataset="output/dataset",
    project_name="project",
    task_type="binary_classification",
    sweep_count=2,
    metric_name="eval/roc_auc",
    metric_goal="maximize",
)

train_model(
    output_dir="output/train",
    dataset="output/dataset",
    best_config_path=sweep["best_config_path"],
    project_name="project",
    task_type="binary_classification",
    metric_name=sweep["metric_name"],
    metric_goal=sweep["metric_goal"],
)

predict_and_evaluate(
    model_dir="output/train/model",
    dataset_dir="output/dataset",
    output_dir="output/predictions",
    task_type="binary_classification",
)

extract_features(
    model_dir="output/train/model",
    dataset_dir="output/dataset",
    output_dir="output/feature_analysis",
    task_type="binary_classification",
    method="both",
    target_class=1,
)

CLI

Use the command-line interface when you want a reproducible shell workflow for data preparation, sweeps, full training, prediction, and feature analysis.

Open the CLI guide

Workflow

Use the workflow guide when you want a deeper explanation of each stage, including expected data format, sweep configuration, training outputs, prediction artifacts, and SHAP/LIG attribution files.

Open the workflow guide

API

Use the Python API when you want to orchestrate BertNado workflows from notebooks, scripts, pipelines, or tests.

Open the API guide

Outputs

  • Figures are saved to output/figures/.
  • SHAP scores are saved to output/shap/.
  • LIG attributions are saved to output/lig/.
  • Trained models are saved to output/models/ or the configured training output directory.

Project Structure

Package layout
bertnado/
|-- api.py                      # Programmatic API
|-- cli.py                      # Command-line interface
|-- data/
|   `-- prepare_dataset.py      # Dataset creation and tokenization
|-- evaluation/
|   |-- predict.py              # Predict from trained models
|   `-- feature_extraction.py   # SHAP / LIG-based interpretation
`-- training/
    |-- finetune.py             # Fine-tuning using best config
    |-- full_train.py           # Full training loop
    |-- model.py                # PEFT/LoRA model architecture
    |-- sweep.py                # W&B sweep setup
    |-- trainers.py             # Trainer wrappers
    `-- metrics.py              # Metric computation