Skip to content

API

The bertnado.api module provides a lightweight Python interface over the same workflows exposed by the CLI.

For a detailed explanation of the data, sweep, training, prediction, and feature-attribution stages, see the workflow guide.

Full Workflow

This script runs the full BertNado workflow from Python: dataset preparation, hyperparameter sweep, final training, prediction, evaluation, and feature attribution.

The config_path argument in the sweep step points to a Weights & Biases sweep configuration JSON file. The mock path below is just an example; see the CLI sweep config section for a complete template. The sweep metric is also used by Hugging Face Trainer to choose the best checkpoint inside each run.

BertNado logs sweeps and training runs to Weights & Biases. Run wandb login once on local machines, or set WANDB_API_KEY in non-interactive environments before calling run_sweep or train_model.

api_full_workflow.py
from pathlib import Path

from bertnado.api import (
    extract_features,
    predict_and_evaluate,
    prepare_dataset,
    run_sweep,
    train_model,
)

PROJECT_NAME = "bertnado"
MODEL_NAME = "PoetschLab/GROVER"
TASK_TYPE = "binary_classification"

DATA_DIR = Path("test/data")
OUTPUT_DIR = Path("output")

DATASET_DIR = OUTPUT_DIR / "dataset"
SWEEP_DIR = OUTPUT_DIR / "sweep"
TRAIN_DIR = OUTPUT_DIR / "train"
PREDICTIONS_DIR = OUTPUT_DIR / "predictions"
FEATURE_DIR = OUTPUT_DIR / "feature_analysis"


def main() -> None:
    prepare_dataset(
        file_path=DATA_DIR / "mock_data.parquet",
        target_column="bound",
        fasta_file=DATA_DIR / "mock_genome.fasta",
        output_dir=DATASET_DIR,
        task_type=TASK_TYPE,
        tokenizer_name=MODEL_NAME,
        threshold=0.5,
    )

    sweep = run_sweep(
        config_path=DATA_DIR / "mock_sweep_config.json",
        output_dir=SWEEP_DIR,
        dataset=DATASET_DIR,
        project_name=PROJECT_NAME,
        task_type=TASK_TYPE,
        model_name=MODEL_NAME,
        sweep_count=10,
        metric_name="eval/roc_auc",
        metric_goal="maximize",
    )

    train_model(
        output_dir=TRAIN_DIR,
        dataset=DATASET_DIR,
        best_config_path=sweep["best_config_path"],
        project_name=PROJECT_NAME,
        task_type=TASK_TYPE,
        model_name=MODEL_NAME,
        metric_name=sweep["metric_name"],
        metric_goal=sweep["metric_goal"],
    )

    predict_and_evaluate(
        model_dir=TRAIN_DIR / "model",
        dataset_dir=DATASET_DIR,
        output_dir=PREDICTIONS_DIR,
        task_type=TASK_TYPE,
        tokenizer_name=MODEL_NAME,
    )

    extract_features(
        model_dir=TRAIN_DIR / "model",
        dataset_dir=DATASET_DIR,
        output_dir=FEATURE_DIR,
        task_type=TASK_TYPE,
        tokenizer_name=MODEL_NAME,
        method="both",
        target_class=1,
    )


if __name__ == "__main__":
    main()

Step-by-Step Calls

Import workflow functions
from bertnado.api import (
    extract_features,
    predict_and_evaluate,
    prepare_dataset,
    run_sweep,
    train_model,
)
Prepare a binary classification dataset
prepare_dataset(
    file_path="test/data/mock_data.parquet",
    target_column="bound",
    fasta_file="test/data/mock_genome.fasta",
    output_dir="output/dataset",
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
    threshold=0.5,
)
Run a W&B sweep
sweep = run_sweep(
    config_path="test/data/mock_sweep_config.json",
    output_dir="output/sweep",
    dataset="output/dataset",
    project_name="bertnado",
    task_type="binary_classification",
    model_name="PoetschLab/GROVER",
    sweep_count=10,
    metric_name="eval/roc_auc",
    metric_goal="maximize",
)

config_path is the sweep recipe, not input data. It tells BertNado which metric to optimize and which hyperparameters to sample. metric_name and metric_goal are optional overrides; when omitted, BertNado uses the sweep config metric or the task default.

Train the final model
train_model(
    output_dir="output/train",
    dataset="output/dataset",
    best_config_path=sweep["best_config_path"],
    project_name="bertnado",
    task_type="binary_classification",
    model_name="PoetschLab/GROVER",
    metric_name=sweep["metric_name"],
    metric_goal=sweep["metric_goal"],
)

If the config was produced by run_sweep, the metric arguments are optional because the resolved optimization metric is already saved in best_sweep_config.json.

Predict and evaluate
predict_and_evaluate(
    model_dir="output/train/model",
    dataset_dir="output/dataset",
    output_dir="output/predictions",
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
)
Extract feature attributions
extract_features(
    model_dir="output/train/model",
    dataset_dir="output/dataset",
    output_dir="output/feature_analysis",
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
    method="both",
    target_class=1,
)

Convenience Aliases

The API includes aliases for the most common naming styles:

Alias Target
prepare_data prepare_dataset
train train_model
full_train train_model
predict predict_and_evaluate
feature_analysis extract_features
analyze_features extract_features

Reference

Programmatic API for BertNado workflows.

This module mirrors BertNado's CLI commands with plain Python functions. Use it when you want to run dataset preparation, sweeps, training, evaluation, and feature attribution from notebooks, scripts, or larger Python applications.

The heavy workflow dependencies are imported lazily inside each function so import bertnado.api stays lightweight.

prepare_dataset(file_path: PathLike, target_column: str, fasta_file: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any

Prepare and tokenize a chromosome-aware dataset.

This is the Python equivalent of the bertnado-data CLI command. It reads the input target table, extracts DNA sequences from the FASTA file, creates chromosome-aware train/validation/test splits, tokenizes the sequences, and writes the prepared dataset to output_dir.

Parameters:

Name Type Description Default
file_path PathLike

Path to the input Parquet file containing genomic regions and target values.

required
target_column str

Name of the column in file_path to use as the prediction target.

required
fasta_file PathLike

Path to the genome FASTA file used to extract sequences.

required
output_dir PathLike

Directory where the prepared dataset should be written.

required
task_type TaskType

Learning task. Must be "binary_classification", "multilabel_classification", or "regression".

required
tokenizer_name str

Hugging Face tokenizer name or local tokenizer path. Defaults to "PoetschLab/GROVER".

DEFAULT_TOKENIZER_NAME
threshold float

Decision threshold used when converting targets for binary classification. Defaults to 0.5.

0.5

Returns:

Type Description
Any

The value returned by :meth:bertnado.data.prepare_dataset.DatasetPreparer.prepare.

Raises:

Type Description
ValueError

If task_type is not one of BertNado's supported task types.

run_sweep(config_path: PathLike, output_dir: PathLike, dataset: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, sweep_count: int = 10, metric_name: str | None = None, metric_goal: MetricGoal | None = None) -> dict[str, Any]

Run a W&B hyperparameter sweep and save the best run config.

This is the Python equivalent of the bertnado-sweep CLI command. It loads a W&B sweep configuration, creates a sweep, runs sweep_count trials, selects the best run using the configured metric, and writes that run's configuration to best_sweep_config.json in output_dir.

Parameters:

Name Type Description Default
config_path PathLike

Path to the W&B sweep configuration JSON file. The file can include a metric object with name and goal fields. When omitted, BertNado uses a task-specific default metric.

required
output_dir PathLike

Directory where best_sweep_config.json should be saved.

required
dataset PathLike

Path to a dataset prepared by :func:prepare_dataset.

required
project_name str

W&B project name used for sweep creation and run lookup.

required
task_type TaskType

Learning task. Must be "binary_classification", "multilabel_classification", or "regression".

required
model_name str

Hugging Face model name or local model path. Defaults to "PoetschLab/GROVER".

DEFAULT_MODEL_NAME
sweep_count int

Number of W&B agent trials to run. Defaults to 10.

10
metric_name str | None

Optional metric to optimize, such as "eval/roc_auc" or "eval/loss". Overrides the sweep config metric when provided.

None
metric_goal MetricGoal | None

Optional optimization direction. Must be "maximize" or "minimize". Overrides the sweep config goal when provided.

None

Returns:

Type Description
dict[str, Any]

Sweep metadata with sweep_id, best_run_id, metric_name, metric_goal, metric_for_best_model, metric_value, best_config, and best_config_path.

Raises:

Type Description
ValueError

If task_type is not one of BertNado's supported task types.

RuntimeError

If the sweep completes without any recorded runs.

train_model(output_dir: PathLike, dataset: PathLike, best_config_path: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, pos_weight: float | list[float] | None = None, metric_name: str | None = None, metric_goal: MetricGoal | None = None, **training_kwargs: Any) -> Any

Train a final model from a saved sweep configuration.

This is the Python equivalent of the bertnado-train CLI command. It loads the best hyperparameter configuration from best_config_path, trains the selected model on the prepared dataset, and writes model artifacts to output_dir.

Parameters:

Name Type Description Default
output_dir PathLike

Directory where training artifacts and the final model should be saved.

required
dataset PathLike

Path to a dataset prepared by :func:prepare_dataset.

required
best_config_path PathLike

Path to best_sweep_config.json from :func:run_sweep.

required
project_name str

W&B project name used for training run logging.

required
task_type TaskType

Learning task. Must be "binary_classification", "multilabel_classification", or "regression".

required
model_name str

Hugging Face model name or local model path. Defaults to "PoetschLab/GROVER".

DEFAULT_MODEL_NAME
pos_weight float | list[float] | None

Optional positive-class weight for imbalanced classification. Pass a scalar, a list of per-class weights, or a tensor-like object with a to method. Ignored by regression tasks.

None
metric_name str | None

Optional metric used to choose the best checkpoint, such as "eval/roc_auc" or "eval/loss". When omitted, BertNado uses the metric saved by :func:run_sweep or a task default.

None
metric_goal MetricGoal | None

Optional optimization direction. Must be "maximize" or "minimize". When omitted, BertNado uses the saved goal or infers one from the metric.

None
training_kwargs Any

Extra Hugging Face TrainingArguments keyword arguments, such as warmup_ratio, lr_scheduler_type, gradient_accumulation_steps, eval_steps, or save_steps.

{}

Returns:

Type Description
Any

The value returned by :meth:bertnado.training.full_train.FullTrainer.train.

Raises:

Type Description
ValueError

If task_type is not one of BertNado's supported task types.

predict_and_evaluate(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any

Run prediction on the test split and write evaluation outputs.

This is the Python equivalent of the bertnado-predict CLI command. It loads a trained model, evaluates it against the prepared dataset's test split, and writes prediction/evaluation artifacts such as metrics and plots to output_dir.

Parameters:

Name Type Description Default
model_dir PathLike

Directory containing the trained BertNado model.

required
dataset_dir PathLike

Directory containing a dataset prepared by :func:prepare_dataset.

required
output_dir PathLike

Directory where predictions, metrics, and figures should be saved.

required
task_type TaskType

Learning task. Must be "binary_classification", "multilabel_classification", or "regression".

required
tokenizer_name str

Hugging Face tokenizer name or local tokenizer path. Defaults to "PoetschLab/GROVER".

DEFAULT_TOKENIZER_NAME
threshold float

Decision threshold used for binary or multilabel classification predictions. Defaults to 0.5.

0.5

Returns:

Type Description
Any

The value returned by :meth:bertnado.evaluation.predict.Evaluator.evaluate.

Raises:

Type Description
ValueError

If task_type is not one of BertNado's supported task types.

extract_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any

Run SHAP, LIG, or both feature-attribution methods.

This is the Python equivalent of the bertnado-feature CLI command. It loads a trained model and prepared dataset, computes attribution scores with SHAP, Layer Integrated Gradients (LIG), or both, and writes the analysis outputs to output_dir.

Parameters:

Name Type Description Default
model_dir PathLike

Directory containing the trained BertNado model.

required
dataset_dir PathLike

Directory containing a dataset prepared by :func:prepare_dataset.

required
output_dir PathLike

Directory where feature-attribution outputs should be saved.

required
task_type TaskType

Learning task. Must be "binary_classification", "multilabel_classification", or "regression".

required
tokenizer_name str

Hugging Face tokenizer name or local tokenizer path. Defaults to "PoetschLab/GROVER".

DEFAULT_TOKENIZER_NAME
method FeatureMethod

Attribution method to run. Must be "shap", "lig", or "both". Defaults to "lig".

'lig'
target_class int

Class index to explain for classification tasks. Defaults to 1.

1
max_examples int | None

Optional maximum number of examples to process. Use None to process the implementation default.

None
n_steps int

Number of integration steps for LIG. Defaults to 50.

50

Returns:

Type Description
Any

The value returned by :meth:bertnado.evaluation.feature_extraction.Attributer.extract.

Raises:

Type Description
ValueError

If task_type or method is not supported.

analyze_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any

Run feature attribution using the CLI-style function name.

This convenience wrapper calls :func:extract_features with the same arguments. It exists for users who prefer the feature analysis wording from the CLI.

Parameters:

Name Type Description Default
model_dir PathLike

Directory containing the trained BertNado model.

required
dataset_dir PathLike

Directory containing a dataset prepared by :func:prepare_dataset.

required
output_dir PathLike

Directory where feature-attribution outputs should be saved.

required
task_type TaskType

Learning task. Must be "binary_classification", "multilabel_classification", or "regression".

required
tokenizer_name str

Hugging Face tokenizer name or local tokenizer path. Defaults to "PoetschLab/GROVER".

DEFAULT_TOKENIZER_NAME
method FeatureMethod

Attribution method to run. Must be "shap", "lig", or "both". Defaults to "lig".

'lig'
target_class int

Class index to explain for classification tasks. Defaults to 1.

1
max_examples int | None

Optional maximum number of examples to process. Use None to process the implementation default.

None
n_steps int

Number of integration steps for LIG. Defaults to 50.

50

Returns:

Type Description
Any

The value returned by :func:extract_features.

Raises:

Type Description
ValueError

If task_type or method is not supported.