API¶

The bertnado.api module provides a lightweight Python interface over the same workflows exposed by the CLI.

For a detailed explanation of the data, sweep, training, prediction, and feature-attribution stages, see the workflow guide.

Full Workflow¶

This script runs the full BertNado workflow from Python: dataset preparation, hyperparameter sweep, final training, prediction, evaluation, and feature attribution.

The config_path argument in the sweep step points to a Weights & Biases sweep configuration JSON file. The mock path below is just an example; see the CLI sweep config section for a complete template. The sweep metric is also used by Hugging Face Trainer to choose the best checkpoint inside each run.

BertNado logs sweeps and training runs to Weights & Biases. Run wandb login once on local machines, or set WANDB_API_KEY in non-interactive environments before calling run_sweep or train_model.

api_full_workflow.py

from pathlib import Path

from bertnado.api import (
    extract_features,
    predict_and_evaluate,
    prepare_dataset,
    run_sweep,
    train_model,
)

PROJECT_NAME = "bertnado"
MODEL_NAME = "PoetschLab/GROVER"
TASK_TYPE = "binary_classification"

DATA_DIR = Path("test/data")
OUTPUT_DIR = Path("output")

DATASET_DIR = OUTPUT_DIR / "dataset"
SWEEP_DIR = OUTPUT_DIR / "sweep"
TRAIN_DIR = OUTPUT_DIR / "train"
PREDICTIONS_DIR = OUTPUT_DIR / "predictions"
FEATURE_DIR = OUTPUT_DIR / "feature_analysis"


def main() -> None:
    prepare_dataset(
        file_path=DATA_DIR / "mock_data.parquet",
        target_column="bound",
        fasta_file=DATA_DIR / "mock_genome.fasta",
        output_dir=DATASET_DIR,
        task_type=TASK_TYPE,
        tokenizer_name=MODEL_NAME,
        threshold=0.5,
    )

    sweep = run_sweep(
        config_path=DATA_DIR / "mock_sweep_config.json",
        output_dir=SWEEP_DIR,
        dataset=DATASET_DIR,
        project_name=PROJECT_NAME,
        task_type=TASK_TYPE,
        model_name=MODEL_NAME,
        sweep_count=10,
        metric_name="eval/roc_auc",
        metric_goal="maximize",
    )

    train_model(
        output_dir=TRAIN_DIR,
        dataset=DATASET_DIR,
        best_config_path=sweep["best_config_path"],
        project_name=PROJECT_NAME,
        task_type=TASK_TYPE,
        model_name=MODEL_NAME,
        metric_name=sweep["metric_name"],
        metric_goal=sweep["metric_goal"],
    )

    predict_and_evaluate(
        model_dir=TRAIN_DIR / "model",
        dataset_dir=DATASET_DIR,
        output_dir=PREDICTIONS_DIR,
        task_type=TASK_TYPE,
        tokenizer_name=MODEL_NAME,
    )

    extract_features(
        model_dir=TRAIN_DIR / "model",
        dataset_dir=DATASET_DIR,
        output_dir=FEATURE_DIR,
        task_type=TASK_TYPE,
        tokenizer_name=MODEL_NAME,
        method="both",
        target_class=1,
    )


if __name__ == "__main__":
    main()

Step-by-Step Calls¶

ImportsPrepareSweepTrainEvaluateInterpret

Import workflow functions

from bertnado.api import (
    extract_features,
    predict_and_evaluate,
    prepare_dataset,
    run_sweep,
    train_model,
)

Prepare a binary classification dataset

prepare_dataset(
    file_path="test/data/mock_data.parquet",
    target_column="bound",
    fasta_file="test/data/mock_genome.fasta",
    output_dir="output/dataset",
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
    threshold=0.5,
)

Run a W&B sweep

sweep = run_sweep(
    config_path="test/data/mock_sweep_config.json",
    output_dir="output/sweep",
    dataset="output/dataset",
    project_name="bertnado",
    task_type="binary_classification",
    model_name="PoetschLab/GROVER",
    sweep_count=10,
    metric_name="eval/roc_auc",
    metric_goal="maximize",
)

config_path is the sweep recipe, not input data. It tells BertNado which metric to optimize and which hyperparameters to sample. metric_name and metric_goal are optional overrides; when omitted, BertNado uses the sweep config metric or the task default.

Train the final model

train_model(
    output_dir="output/train",
    dataset="output/dataset",
    best_config_path=sweep["best_config_path"],
    project_name="bertnado",
    task_type="binary_classification",
    model_name="PoetschLab/GROVER",
    metric_name=sweep["metric_name"],
    metric_goal=sweep["metric_goal"],
)

If the config was produced by run_sweep, the metric arguments are optional because the resolved optimization metric is already saved in best_sweep_config.json.

Predict and evaluate

predict_and_evaluate(
    model_dir="output/train/model",
    dataset_dir="output/dataset",
    output_dir="output/predictions",
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
)

Extract feature attributions

extract_features(
    model_dir="output/train/model",
    dataset_dir="output/dataset",
    output_dir="output/feature_analysis",
    task_type="binary_classification",
    tokenizer_name="PoetschLab/GROVER",
    method="both",
    target_class=1,
)

Convenience Aliases¶

The API includes aliases for the most common naming styles:

Alias	Target
`prepare_data`	`prepare_dataset`
`train`	`train_model`
`full_train`	`train_model`
`predict`	`predict_and_evaluate`
`feature_analysis`	`extract_features`
`analyze_features`	`extract_features`

Reference¶

Programmatic API for BertNado workflows.

This module mirrors BertNado's CLI commands with plain Python functions. Use it when you want to run dataset preparation, sweeps, training, evaluation, and feature attribution from notebooks, scripts, or larger Python applications.

The heavy workflow dependencies are imported lazily inside each function so import bertnado.api stays lightweight.

`prepare_dataset(file_path: PathLike, target_column: str, fasta_file: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any` ¶

Prepare and tokenize a chromosome-aware dataset.

This is the Python equivalent of the bertnado-data CLI command. It reads the input target table, extracts DNA sequences from the FASTA file, creates chromosome-aware train/validation/test splits, tokenizes the sequences, and writes the prepared dataset to output_dir.

Parameters:

Name	Type	Description	Default
`file_path`	`PathLike`	Path to the input Parquet file containing genomic regions and target values.	required
`target_column`	`str`	Name of the column in `file_path` to use as the prediction target.	required
`fasta_file`	`PathLike`	Path to the genome FASTA file used to extract sequences.	required
`output_dir`	`PathLike`	Directory where the prepared dataset should be written.	required
`task_type`	`TaskType`	Learning task. Must be `"binary_classification"`, `"multilabel_classification"`, or `"regression"`.	required
`tokenizer_name`	`str`	Hugging Face tokenizer name or local tokenizer path. Defaults to `"PoetschLab/GROVER"`.	`DEFAULT_TOKENIZER_NAME`
`threshold`	`float`	Decision threshold used when converting targets for binary classification. Defaults to `0.5`.	`0.5`

Returns:

Type	Description
`Any`	The value returned by :meth:`bertnado.data.prepare_dataset.DatasetPreparer.prepare`.

Raises:

Type	Description
`ValueError`	If `task_type` is not one of BertNado's supported task types.

`run_sweep(config_path: PathLike, output_dir: PathLike, dataset: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, sweep_count: int = 10, metric_name: str | None = None, metric_goal: MetricGoal | None = None) -> dict[str, Any]` ¶

Run a W&B hyperparameter sweep and save the best run config.

This is the Python equivalent of the bertnado-sweep CLI command. It loads a W&B sweep configuration, creates a sweep, runs sweep_count trials, selects the best run using the configured metric, and writes that run's configuration to best_sweep_config.json in output_dir.

Parameters:

Name	Type	Description	Default
`config_path`	`PathLike`	Path to the W&B sweep configuration JSON file. The file can include a `metric` object with `name` and `goal` fields. When omitted, BertNado uses a task-specific default metric.	required
`output_dir`	`PathLike`	Directory where `best_sweep_config.json` should be saved.	required
`dataset`	`PathLike`	Path to a dataset prepared by :func:`prepare_dataset`.	required
`project_name`	`str`	W&B project name used for sweep creation and run lookup.	required
`task_type`	`TaskType`	Learning task. Must be `"binary_classification"`, `"multilabel_classification"`, or `"regression"`.	required
`model_name`	`str`	Hugging Face model name or local model path. Defaults to `"PoetschLab/GROVER"`.	`DEFAULT_MODEL_NAME`
`sweep_count`	`int`	Number of W&B agent trials to run. Defaults to `10`.	`10`
`metric_name`	`str \| None`	Optional metric to optimize, such as `"eval/roc_auc"` or `"eval/loss"`. Overrides the sweep config metric when provided.	`None`
`metric_goal`	`MetricGoal \| None`	Optional optimization direction. Must be `"maximize"` or `"minimize"`. Overrides the sweep config goal when provided.	`None`

Returns:

Type	Description
`dict[str, Any]`	Sweep metadata with `sweep_id`, `best_run_id`, `metric_name`, `metric_goal`, `metric_for_best_model`, `metric_value`, `best_config`, and `best_config_path`.

Raises:

Type	Description
`ValueError`	If `task_type` is not one of BertNado's supported task types.
`RuntimeError`	If the sweep completes without any recorded runs.

`train_model(output_dir: PathLike, dataset: PathLike, best_config_path: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, pos_weight: float | list[float] | None = None, metric_name: str | None = None, metric_goal: MetricGoal | None = None, **training_kwargs: Any) -> Any` ¶

Train a final model from a saved sweep configuration.

This is the Python equivalent of the bertnado-train CLI command. It loads the best hyperparameter configuration from best_config_path, trains the selected model on the prepared dataset, and writes model artifacts to output_dir.

Parameters:

Name	Type	Description	Default
`output_dir`	`PathLike`	Directory where training artifacts and the final model should be saved.	required
`dataset`	`PathLike`	Path to a dataset prepared by :func:`prepare_dataset`.	required
`best_config_path`	`PathLike`	Path to `best_sweep_config.json` from :func:`run_sweep`.	required
`project_name`	`str`	W&B project name used for training run logging.	required
`task_type`	`TaskType`	Learning task. Must be `"binary_classification"`, `"multilabel_classification"`, or `"regression"`.	required
`model_name`	`str`	Hugging Face model name or local model path. Defaults to `"PoetschLab/GROVER"`.	`DEFAULT_MODEL_NAME`
`pos_weight`	`float \| list[float] \| None`	Optional positive-class weight for imbalanced classification. Pass a scalar, a list of per-class weights, or a tensor-like object with a `to` method. Ignored by regression tasks.	`None`
`metric_name`	`str \| None`	Optional metric used to choose the best checkpoint, such as `"eval/roc_auc"` or `"eval/loss"`. When omitted, BertNado uses the metric saved by :func:`run_sweep` or a task default.	`None`
`metric_goal`	`MetricGoal \| None`	Optional optimization direction. Must be `"maximize"` or `"minimize"`. When omitted, BertNado uses the saved goal or infers one from the metric.	`None`
`training_kwargs`	`Any`	Extra Hugging Face `TrainingArguments` keyword arguments, such as `warmup_ratio`, `lr_scheduler_type`, `gradient_accumulation_steps`, `eval_steps`, or `save_steps`.	`{}`

Returns:

Type	Description
`Any`	The value returned by :meth:`bertnado.training.full_train.FullTrainer.train`.

Raises:

Type	Description
`ValueError`	If `task_type` is not one of BertNado's supported task types.

`predict_and_evaluate(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any` ¶

Run prediction on the test split and write evaluation outputs.

This is the Python equivalent of the bertnado-predict CLI command. It loads a trained model, evaluates it against the prepared dataset's test split, and writes prediction/evaluation artifacts such as metrics and plots to output_dir.

Parameters:

Name	Type	Description	Default
`model_dir`	`PathLike`	Directory containing the trained BertNado model.	required
`dataset_dir`	`PathLike`	Directory containing a dataset prepared by :func:`prepare_dataset`.	required
`output_dir`	`PathLike`	Directory where predictions, metrics, and figures should be saved.	required
`task_type`	`TaskType`	Learning task. Must be `"binary_classification"`, `"multilabel_classification"`, or `"regression"`.	required
`tokenizer_name`	`str`	Hugging Face tokenizer name or local tokenizer path. Defaults to `"PoetschLab/GROVER"`.	`DEFAULT_TOKENIZER_NAME`
`threshold`	`float`	Decision threshold used for binary or multilabel classification predictions. Defaults to `0.5`.	`0.5`

Returns:

Type	Description
`Any`	The value returned by :meth:`bertnado.evaluation.predict.Evaluator.evaluate`.

Raises:

Type	Description
`ValueError`	If `task_type` is not one of BertNado's supported task types.

`extract_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any` ¶

Run SHAP, LIG, or both feature-attribution methods.

This is the Python equivalent of the bertnado-feature CLI command. It loads a trained model and prepared dataset, computes attribution scores with SHAP, Layer Integrated Gradients (LIG), or both, and writes the analysis outputs to output_dir.

Parameters:

Name	Type	Description	Default
`model_dir`	`PathLike`	Directory containing the trained BertNado model.	required
`dataset_dir`	`PathLike`	Directory containing a dataset prepared by :func:`prepare_dataset`.	required
`output_dir`	`PathLike`	Directory where feature-attribution outputs should be saved.	required
`task_type`	`TaskType`	Learning task. Must be `"binary_classification"`, `"multilabel_classification"`, or `"regression"`.	required
`tokenizer_name`	`str`	Hugging Face tokenizer name or local tokenizer path. Defaults to `"PoetschLab/GROVER"`.	`DEFAULT_TOKENIZER_NAME`
`method`	`FeatureMethod`	Attribution method to run. Must be `"shap"`, `"lig"`, or `"both"`. Defaults to `"lig"`.	`'lig'`
`target_class`	`int`	Class index to explain for classification tasks. Defaults to `1`.	`1`
`max_examples`	`int \| None`	Optional maximum number of examples to process. Use `None` to process the implementation default.	`None`
`n_steps`	`int`	Number of integration steps for LIG. Defaults to `50`.	`50`

Returns:

Type	Description
`Any`	The value returned by :meth:`bertnado.evaluation.feature_extraction.Attributer.extract`.

Raises:

Type	Description
`ValueError`	If `task_type` or `method` is not supported.

`analyze_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any` ¶

Run feature attribution using the CLI-style function name.

This convenience wrapper calls :func:extract_features with the same arguments. It exists for users who prefer the feature analysis wording from the CLI.

Parameters:

Name	Type	Description	Default
`model_dir`	`PathLike`	Directory containing the trained BertNado model.	required
`dataset_dir`	`PathLike`	Directory containing a dataset prepared by :func:`prepare_dataset`.	required
`output_dir`	`PathLike`	Directory where feature-attribution outputs should be saved.	required
`task_type`	`TaskType`	Learning task. Must be `"binary_classification"`, `"multilabel_classification"`, or `"regression"`.	required
`tokenizer_name`	`str`	Hugging Face tokenizer name or local tokenizer path. Defaults to `"PoetschLab/GROVER"`.	`DEFAULT_TOKENIZER_NAME`
`method`	`FeatureMethod`	Attribution method to run. Must be `"shap"`, `"lig"`, or `"both"`. Defaults to `"lig"`.	`'lig'`
`target_class`	`int`	Class index to explain for classification tasks. Defaults to `1`.	`1`
`max_examples`	`int \| None`	Optional maximum number of examples to process. Use `None` to process the implementation default.	`None`
`n_steps`	`int`	Number of integration steps for LIG. Defaults to `50`.	`50`

Returns:

Type	Description
`Any`	The value returned by :func:`extract_features`.

Raises:

Type	Description
`ValueError`	If `task_type` or `method` is not supported.

API¶

Full Workflow¶

Step-by-Step Calls¶

Convenience Aliases¶

Reference¶

prepare_dataset(file_path: PathLike, target_column: str, fasta_file: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any ¶

run_sweep(config_path: PathLike, output_dir: PathLike, dataset: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, sweep_count: int = 10, metric_name: str | None = None, metric_goal: MetricGoal | None = None) -> dict[str, Any] ¶

predict_and_evaluate(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any ¶

extract_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any ¶

analyze_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any ¶

`prepare_dataset(file_path: PathLike, target_column: str, fasta_file: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any` ¶

`run_sweep(config_path: PathLike, output_dir: PathLike, dataset: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, sweep_count: int = 10, metric_name: str | None = None, metric_goal: MetricGoal | None = None) -> dict[str, Any]` ¶

`predict_and_evaluate(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any` ¶

`extract_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any` ¶

`analyze_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any` ¶