API¶
The bertnado.api module provides a lightweight Python interface over the same
workflows exposed by the CLI.
For a detailed explanation of the data, sweep, training, prediction, and feature-attribution stages, see the workflow guide.
Full Workflow¶
This script runs the full BertNado workflow from Python: dataset preparation, hyperparameter sweep, final training, prediction, evaluation, and feature attribution.
The config_path argument in the sweep step points to a Weights & Biases sweep
configuration JSON file. The mock path below is just an example; see the
CLI sweep config section for a complete template.
The sweep metric is also used by Hugging Face Trainer to choose the best
checkpoint inside each run.
BertNado logs sweeps and training runs to Weights & Biases. Run wandb login
once on local machines, or set WANDB_API_KEY in non-interactive environments
before calling run_sweep or train_model.
from pathlib import Path
from bertnado.api import (
extract_features,
predict_and_evaluate,
prepare_dataset,
run_sweep,
train_model,
)
PROJECT_NAME = "bertnado"
MODEL_NAME = "PoetschLab/GROVER"
TASK_TYPE = "binary_classification"
DATA_DIR = Path("test/data")
OUTPUT_DIR = Path("output")
DATASET_DIR = OUTPUT_DIR / "dataset"
SWEEP_DIR = OUTPUT_DIR / "sweep"
TRAIN_DIR = OUTPUT_DIR / "train"
PREDICTIONS_DIR = OUTPUT_DIR / "predictions"
FEATURE_DIR = OUTPUT_DIR / "feature_analysis"
def main() -> None:
prepare_dataset(
file_path=DATA_DIR / "mock_data.parquet",
target_column="bound",
fasta_file=DATA_DIR / "mock_genome.fasta",
output_dir=DATASET_DIR,
task_type=TASK_TYPE,
tokenizer_name=MODEL_NAME,
threshold=0.5,
)
sweep = run_sweep(
config_path=DATA_DIR / "mock_sweep_config.json",
output_dir=SWEEP_DIR,
dataset=DATASET_DIR,
project_name=PROJECT_NAME,
task_type=TASK_TYPE,
model_name=MODEL_NAME,
sweep_count=10,
metric_name="eval/roc_auc",
metric_goal="maximize",
)
train_model(
output_dir=TRAIN_DIR,
dataset=DATASET_DIR,
best_config_path=sweep["best_config_path"],
project_name=PROJECT_NAME,
task_type=TASK_TYPE,
model_name=MODEL_NAME,
metric_name=sweep["metric_name"],
metric_goal=sweep["metric_goal"],
)
predict_and_evaluate(
model_dir=TRAIN_DIR / "model",
dataset_dir=DATASET_DIR,
output_dir=PREDICTIONS_DIR,
task_type=TASK_TYPE,
tokenizer_name=MODEL_NAME,
)
extract_features(
model_dir=TRAIN_DIR / "model",
dataset_dir=DATASET_DIR,
output_dir=FEATURE_DIR,
task_type=TASK_TYPE,
tokenizer_name=MODEL_NAME,
method="both",
target_class=1,
)
if __name__ == "__main__":
main()
Step-by-Step Calls¶
sweep = run_sweep(
config_path="test/data/mock_sweep_config.json",
output_dir="output/sweep",
dataset="output/dataset",
project_name="bertnado",
task_type="binary_classification",
model_name="PoetschLab/GROVER",
sweep_count=10,
metric_name="eval/roc_auc",
metric_goal="maximize",
)
config_path is the sweep recipe, not input data. It tells BertNado which
metric to optimize and which hyperparameters to sample. metric_name and
metric_goal are optional overrides; when omitted, BertNado uses the sweep
config metric or the task default.
train_model(
output_dir="output/train",
dataset="output/dataset",
best_config_path=sweep["best_config_path"],
project_name="bertnado",
task_type="binary_classification",
model_name="PoetschLab/GROVER",
metric_name=sweep["metric_name"],
metric_goal=sweep["metric_goal"],
)
If the config was produced by run_sweep, the metric arguments are optional
because the resolved optimization metric is already saved in
best_sweep_config.json.
Convenience Aliases¶
The API includes aliases for the most common naming styles:
| Alias | Target |
|---|---|
prepare_data |
prepare_dataset |
train |
train_model |
full_train |
train_model |
predict |
predict_and_evaluate |
feature_analysis |
extract_features |
analyze_features |
extract_features |
Reference¶
Programmatic API for BertNado workflows.
This module mirrors BertNado's CLI commands with plain Python functions. Use it when you want to run dataset preparation, sweeps, training, evaluation, and feature attribution from notebooks, scripts, or larger Python applications.
The heavy workflow dependencies are imported lazily inside each function so
import bertnado.api stays lightweight.
prepare_dataset(file_path: PathLike, target_column: str, fasta_file: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any
¶
Prepare and tokenize a chromosome-aware dataset.
This is the Python equivalent of the bertnado-data CLI command. It
reads the input target table, extracts DNA sequences from the FASTA file,
creates chromosome-aware train/validation/test splits, tokenizes the
sequences, and writes the prepared dataset to output_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
PathLike
|
Path to the input Parquet file containing genomic regions and target values. |
required |
target_column
|
str
|
Name of the column in |
required |
fasta_file
|
PathLike
|
Path to the genome FASTA file used to extract sequences. |
required |
output_dir
|
PathLike
|
Directory where the prepared dataset should be written. |
required |
task_type
|
TaskType
|
Learning task. Must be |
required |
tokenizer_name
|
str
|
Hugging Face tokenizer name or local tokenizer path.
Defaults to |
DEFAULT_TOKENIZER_NAME
|
threshold
|
float
|
Decision threshold used when converting targets for
binary classification. Defaults to |
0.5
|
Returns:
| Type | Description |
|---|---|
Any
|
The value returned by
:meth: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
run_sweep(config_path: PathLike, output_dir: PathLike, dataset: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, sweep_count: int = 10, metric_name: str | None = None, metric_goal: MetricGoal | None = None) -> dict[str, Any]
¶
Run a W&B hyperparameter sweep and save the best run config.
This is the Python equivalent of the bertnado-sweep CLI command. It
loads a W&B sweep configuration, creates a sweep, runs sweep_count
trials, selects the best run using the configured metric, and writes that
run's configuration to best_sweep_config.json in output_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path
|
PathLike
|
Path to the W&B sweep configuration JSON file. The file
can include a |
required |
output_dir
|
PathLike
|
Directory where |
required |
dataset
|
PathLike
|
Path to a dataset prepared by :func: |
required |
project_name
|
str
|
W&B project name used for sweep creation and run lookup. |
required |
task_type
|
TaskType
|
Learning task. Must be |
required |
model_name
|
str
|
Hugging Face model name or local model path. Defaults to
|
DEFAULT_MODEL_NAME
|
sweep_count
|
int
|
Number of W&B agent trials to run. Defaults to |
10
|
metric_name
|
str | None
|
Optional metric to optimize, such as
|
None
|
metric_goal
|
MetricGoal | None
|
Optional optimization direction. Must be |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Sweep metadata with |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
RuntimeError
|
If the sweep completes without any recorded runs. |
train_model(output_dir: PathLike, dataset: PathLike, best_config_path: PathLike, project_name: str, task_type: TaskType, model_name: str = DEFAULT_MODEL_NAME, pos_weight: float | list[float] | None = None, metric_name: str | None = None, metric_goal: MetricGoal | None = None, **training_kwargs: Any) -> Any
¶
Train a final model from a saved sweep configuration.
This is the Python equivalent of the bertnado-train CLI command. It
loads the best hyperparameter configuration from best_config_path, trains
the selected model on the prepared dataset, and writes model artifacts to
output_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_dir
|
PathLike
|
Directory where training artifacts and the final model should be saved. |
required |
dataset
|
PathLike
|
Path to a dataset prepared by :func: |
required |
best_config_path
|
PathLike
|
Path to |
required |
project_name
|
str
|
W&B project name used for training run logging. |
required |
task_type
|
TaskType
|
Learning task. Must be |
required |
model_name
|
str
|
Hugging Face model name or local model path. Defaults to
|
DEFAULT_MODEL_NAME
|
pos_weight
|
float | list[float] | None
|
Optional positive-class weight for imbalanced
classification. Pass a scalar, a list of per-class weights, or a
tensor-like object with a |
None
|
metric_name
|
str | None
|
Optional metric used to choose the best checkpoint, such
as |
None
|
metric_goal
|
MetricGoal | None
|
Optional optimization direction. Must be |
None
|
training_kwargs
|
Any
|
Extra Hugging Face |
{}
|
Returns:
| Type | Description |
|---|---|
Any
|
The value returned by
:meth: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
predict_and_evaluate(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, threshold: float = 0.5) -> Any
¶
Run prediction on the test split and write evaluation outputs.
This is the Python equivalent of the bertnado-predict CLI command. It
loads a trained model, evaluates it against the prepared dataset's test
split, and writes prediction/evaluation artifacts such as metrics and plots
to output_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir
|
PathLike
|
Directory containing the trained BertNado model. |
required |
dataset_dir
|
PathLike
|
Directory containing a dataset prepared by
:func: |
required |
output_dir
|
PathLike
|
Directory where predictions, metrics, and figures should be saved. |
required |
task_type
|
TaskType
|
Learning task. Must be |
required |
tokenizer_name
|
str
|
Hugging Face tokenizer name or local tokenizer path.
Defaults to |
DEFAULT_TOKENIZER_NAME
|
threshold
|
float
|
Decision threshold used for binary or multilabel
classification predictions. Defaults to |
0.5
|
Returns:
| Type | Description |
|---|---|
Any
|
The value returned by
:meth: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
extract_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any
¶
Run SHAP, LIG, or both feature-attribution methods.
This is the Python equivalent of the bertnado-feature CLI command. It
loads a trained model and prepared dataset, computes attribution scores with
SHAP, Layer Integrated Gradients (LIG), or both, and writes the analysis
outputs to output_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir
|
PathLike
|
Directory containing the trained BertNado model. |
required |
dataset_dir
|
PathLike
|
Directory containing a dataset prepared by
:func: |
required |
output_dir
|
PathLike
|
Directory where feature-attribution outputs should be saved. |
required |
task_type
|
TaskType
|
Learning task. Must be |
required |
tokenizer_name
|
str
|
Hugging Face tokenizer name or local tokenizer path.
Defaults to |
DEFAULT_TOKENIZER_NAME
|
method
|
FeatureMethod
|
Attribution method to run. Must be |
'lig'
|
target_class
|
int
|
Class index to explain for classification tasks.
Defaults to |
1
|
max_examples
|
int | None
|
Optional maximum number of examples to process. Use
|
None
|
n_steps
|
int
|
Number of integration steps for LIG. Defaults to |
50
|
Returns:
| Type | Description |
|---|---|
Any
|
The value returned by
:meth: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
analyze_features(model_dir: PathLike, dataset_dir: PathLike, output_dir: PathLike, task_type: TaskType, tokenizer_name: str = DEFAULT_TOKENIZER_NAME, method: FeatureMethod = 'lig', target_class: int = 1, max_examples: int | None = None, n_steps: int = 50) -> Any
¶
Run feature attribution using the CLI-style function name.
This convenience wrapper calls :func:extract_features with the same
arguments. It exists for users who prefer the feature analysis wording
from the CLI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir
|
PathLike
|
Directory containing the trained BertNado model. |
required |
dataset_dir
|
PathLike
|
Directory containing a dataset prepared by
:func: |
required |
output_dir
|
PathLike
|
Directory where feature-attribution outputs should be saved. |
required |
task_type
|
TaskType
|
Learning task. Must be |
required |
tokenizer_name
|
str
|
Hugging Face tokenizer name or local tokenizer path.
Defaults to |
DEFAULT_TOKENIZER_NAME
|
method
|
FeatureMethod
|
Attribution method to run. Must be |
'lig'
|
target_class
|
int
|
Class index to explain for classification tasks.
Defaults to |
1
|
max_examples
|
int | None
|
Optional maximum number of examples to process. Use
|
None
|
n_steps
|
int
|
Number of integration steps for LIG. Defaults to |
50
|
Returns:
| Type | Description |
|---|---|
Any
|
The value returned by :func: |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |