lncrnapy.train

Classification

Functions for training a deep learning model for the classification of RNA transcripts as either protein-coding or long non-coding.

lncrnapy.train.classification.epoch_classifier(model, dataloader, loss_function, optimizer, scaler): Trains model for a single epoch.

lncrnapy.train.classification.evaluate_classifier(model, data, loss_function, metrics={'Accuracy': <function accuracy_score>, 'F1 (macro)': <function <lambda>>, 'Precision (ncRNA)': <function <lambda>>, 'Precision (pcRNA)': <function <lambda>>, 'Recall (ncRNA)': <function <lambda>>, 'Recall (pcRNA)': <function <lambda>>}): Simple evaluation function to keep track of in-training progress.

lncrnapy.train.classification.train_classifier(model, train_data, valid_data, epochs, n_samples_per_epoch=None, batch_size=8, optimizer=None, weighted_loss=True, random_reading_frame=True, logger=None, metrics={'Accuracy': <function accuracy_score>, 'F1 (macro)': <function <lambda>>, 'Precision (ncRNA)': <function <lambda>>, 'Precision (pcRNA)': <function <lambda>>, 'Recall (ncRNA)': <function <lambda>>, 'Recall (pcRNA)': <function <lambda>>})

Trains model for classification task, using train_data, for specified amount of epochs.

Parameters:

model (torch.nn.Module | lncrnapy.modules.Classifier) – Neural network that is to be trained.
train_data (lncrnapy.data.Data) – Data to use for training, must call set_tensor_features first. After every training epoch, the performance of the model on a subset of the training set is determined. The length of this subset is min(len(train_data), len(valid_data)).
valid_data (lncrnapy.data.Data) – Data to use for validation, must call set_tensor_features first.
epochs (int) – How many epochs (data run-throughs) to train for.
n_samples_per_epoch (int) – If specified, indicates the number of samples per training epoch. If None, will sample the full training set.
batch_size (int) – Number of examples per batch (default is 64).
optimizer (torch.optim) – Optimizer to update the network’s weights during training. If None (default), will use Adam with learning rate 0.0001.
weighted_loss (bool) – Whether to apply weighted loss to correct for class imbalance (default is False)
random_reading_frame (bool:) – If True (default) and model.base_arch==CSEBERT, trains the model with sequences that have been frameshifted by a random number (between [0,kernel_size]).
logger (lncrnapy.train.loggers) – Logger object whose log method will be called at every epoch. If None (default), will use LoggerBase, which only keeps track of the history.
metrics (dict[str:callable]) – Metrics (name + function) that will be evaluated at every epoch.

Loggers

Logger objects used by training functions. Loggers should inherit from LoggerBase and contain the following methods: * set_columns: Specifies which columns will be logged. * log: Called every epoch, logs new results.

class lncrnapy.train.loggers.EarlyStopping(metric_name, filepath, maximize=True)

Special logger that saves the model if it has the best-so-far performance.

`metric_name`

Column name of the metric that the early stopping is based on.

Type:: str

`filepath`

Path to and name of file where model should be saved to.

Type:: str

`sign`

Whether the goal is max-/minimization (1/-1).

Type:: int

`best_score`

Current best score of metric.

Type:: float

`epoch`

Epoch counter.

Type:: int

class lncrnapy.train.loggers.LoggerBase

Base logger that adds a new row to its history DataFrame with every log.

`history`

Contains all logged data throughout training.

Type:: pd.DataFrame

`columns`

Column names of the values that are received every log.

Type:: list[str]

finish(): Finishes logging, reports training time and final performance.

log(epoch_results, model): Logs epoch_results for given model.

start(metrics): This method should be called right before the training loop starts. It sets columns according to the specified metrics, and starts the timrer. The class assumes the loss function as first logged value and train/validation results).

class lncrnapy.train.loggers.LoggerDistribution(filepath, apply_to=None)

class lncrnapy.train.loggers.LoggerList(*args)

Combine multiple loggers into one, executing all of their actions per logging event while keeping track of a single, shared history.

`loggers`

List of loggers from lncrnapy.train.loggers.

Type:: list

log(epoch_results, model): Logs epoch_results for given model.

start(metrics): This method should be called right before the training loop starts. It sets columns according to the specified metrics, and starts the timrer. The class assumes the loss function as first logged value and train/validation results).

class lncrnapy.train.loggers.LoggerPlot(dir_path, metric_names=None)

Plots and saves the results as figures at every epoch.

`dir_path`

Path to new/existing directory in which figures will be stored.

Type:: str

`metric_names`

Indicates which metrics to plot per epoch.

Type:: list[str]

`history`

Contains all logged data throughout training.

Type:: pd.DataFrame

`columns`

Column names of the values that are received every log.

Type:: list[str]

plot_history(metric_name, filepath=None, figsize=None): Plots the history data for a given metric_name, saving the resulting figure to a (optionally) specified filepath.

start(metrics): This method should be called right before the training loop starts. It sets columns according to the specified metrics, and starts the timrer. The class assumes the loss function as first logged value and train/validation results).

class lncrnapy.train.loggers.LoggerPrint(metric_names=None)

Prints the results per epoch.

`epoch`

Epoch counter.

Type:: int

`metric_names`

Indicates which metrics to print per epoch.

Type:: list[str]

`history`

Contains all logged data throughout training.

Type:: pd.DataFrame

`columns`

Column names of the values that are received every log.

Type:: list[str]

class lncrnapy.train.loggers.LoggerTokenCounts(vocab_size, filepath, valid=True): Plots, at every epoch, the true token count vs the predicted token counts, based on the ‘Counts’ metric.

class lncrnapy.train.loggers.LoggerWrite(filepath, metric_names=None)

Writes the results to a file at every epoch.

`filepath`

Path to new .csv file to which to write the results to.

Type:: str

`metric_names`

Indicates which metrics to write per epoch.

Type:: list[str]

`history`

Contains all logged data throughout training.

Type:: pd.DataFrame

`columns`

Column names of the values that are received every log.

Type:: list[str]

Learning Rate Schedule

Simple learning rate schedule implementation.

Huang et al. (2022) https://nlp.seas.harvard.edu/annotated-transformer

class lncrnapy.train.lr_schedule.LrSchedule(optimizer, d_model, warmup_steps): Linearly increases the learning rate for the first warmup_steps, then then decreases the learning rate proportionally to 1/sqrt(step_number)

Masked Convolution Modeling

Masked Language Modeling pre-training task for nucleotide sequences that are encoded using Convolutional Sequence Encoding.

References

MycoAI: Romeijn et al. (2024) https://doi.org/10.1111/1755-0998.14006 Huang et al. (2022) https://nlp.seas.harvard.edu/annotated-transformer

lncrnapy.train.masked_conv_modeling.epoch(model, dataloader, p_mlm, p_mask, p_random, mask_size, loss_function, optimizer, scaler, lr_scheduler): Trains model for a single epoch.

lncrnapy.train.masked_conv_modeling.evaluate(model, data, p_mlm, p_mask, p_random, mask_size, loss_function, metrics): Evaluation function to keep track of in-training progress for MCM.

lncrnapy.train.masked_conv_modeling.get_consecutive_indices(indices, mask_size, max_len): Expands indices with up to mask_size follow-up indices. Stays within the maximum bound as specified by max_len.

lncrnapy.train.masked_conv_modeling.get_random_nucs(X_shape): Returns a tensor of random 4D-DNA encoded nucleotides of X_shape.

lncrnapy.train.masked_conv_modeling.index_to_bool(indices, shape): Creates a boolean Tensor of specified shape, where indices are True.

lncrnapy.train.masked_conv_modeling.mask_batch(X, kernel_size, p_mlm, p_mask, p_random, mask_size): Maks a batch of sequence data for MCM

lncrnapy.train.masked_conv_modeling.train_masked_conv_modeling(model, train_data, valid_data, epochs, n_samples_per_epoch=None, batch_size=8, p_mlm=0.15, p_mask=0.8, p_random=0.1, warmup_steps=32000, loss_function=None, mask_size=1, random_reading_frame=True, logger=None, metrics={'Accuracy': <function accuracy_score>})

Trains model for Masked Language Modeling task, using train_data, for specified amount of epochs. Assumes sequence data is inputted in four channels (using Data.set_tensor_features(‘4D-DNA’)), and a model of type MaskedConvModel.

Parameters:

model (torch.nn.Module | lncrnapy.modules.MaskedConvModel) – Neural network that is to be trained.
train_data (lncrnapy.data.Data) – Data to use for training, must call set_tensor_features(4D-DNA) first. After every training epoch, the performance of the model on a random subset of the training set is determined. The length of this subset is min(len(train_data), len(valid_data)).
valid_data (lncrnapy.data.Data) – Data to use for validation, must call set_tensor_features first.
epochs (int) – How many epochs (data run-throughs) to train for.
n_samples_per_epoch (int) – If specified, indicates the number of samples per training epoch. If None, will sample the full training set.
batch_size (int) – Number of examples per batch (default is 64).
p_mlm (float) – Probability for a nucleotide to be selected for MLM (default is 0.15).
p_mask (float) – Probability for a nucleotide to be masked when selected (default 0.8).
p_random (float) – Probability for a nucleotide to be randomly replaced when selected (default is 0.1).
warmup_steps (int) – Number of training steps in which learning rate linearly increases. After this amount of steps, the learning rate decreases proportional to the invserse square root of the step number (default is 32000).
loss_function (torch.nn.Module) – Loss function that is to be optimized, assuming logits (so no Softmax) and ignore_index=-1. Uses torch.nn.CrossEntropyLoss if None (default).
mask_size (int:) – Number of contiguous nucleotides that make up a mask (default is 1).
random_reading_frame (bool:) – If True (default), trains the model with sequences that have been frameshifted by a random number (between [0,kernel_size]).
logger (lncrnapy.train.loggers) – Logger object whose log method will be called at every epoch. If None (default), will use LoggerBase, which only keeps track of the history.
metrics (dict[str:callable]) – Metrics (name + function) that will be evaluated at every epoch.

Masked Token Modeling

Masked Language Modeling pre-training task for tokenized nucleotide sequences.

References

MycoAI: Romeijn et al. (2024) https://doi.org/10.1111/1755-0998.14006 Huang et al. (2022) https://nlp.seas.harvard.edu/annotated-transformer

lncrnapy.train.masked_token_modeling.epoch(model, dataloader, p_mlm, p_mask, p_random, loss_function, optimizer, scaler, lr_scheduler): Trains model for a single epoch.

lncrnapy.train.masked_token_modeling.evaluate(model, data, p_mlm, p_mask, p_random, loss_function, metrics): Evaluation function to keep track of in-training progress for MLM.

lncrnapy.train.masked_token_modeling.mask_batch(X, vocab_size, p_mlm, p_mask, p_random): Maks a batch of sequence data for MLM

lncrnapy.train.masked_token_modeling.train_masked_token_modeling(model, train_data, valid_data, epochs, n_samples_per_epoch=None, batch_size=8, p_mlm=0.15, p_mask=0.8, p_random=0.1, warmup_steps=32000, loss_function=None, logger=None, metrics={'Accuracy': <function accuracy_score>, 'Counts': <function <lambda>>, 'F1 (macro)': <function <lambda>>, 'Precision (macro)': <function <lambda>>, 'Recall (macro)': <function <lambda>>})

Trains model for Masked Language Modeling task, using train_data, for specified amount of epochs. Assumes sequence data is tokenized (see lncrnapy.features.tokenizers) and model of type MaskedTokenModel.

Parameters:

model (torch.nn.Module | lncrnapy.modules.MaskedTokenModel) – Neural network that is to be trained.
train_data (lncrnapy.data.Data) – Data to use for training, must call set_tensor_features first. After every training epoch, the performance of the model on a subset of the training set is determined. The length of this subset is min(len(train_data), len(valid_data)).
valid_data (lncrnapy.data.Data) – Data to use for validation, must call set_tensor_features first.
epochs (int) – How many epochs (data run-throughs) to train for.
n_samples_per_epoch (int) – If specified, indicates the number of samples per training epoch. If None, will sample the full training set.
batch_size (int) – Number of examples per batch (default is 64).
p_mlm (float) – Probability for a token to be selected for MLM (default is 0.15).
p_mask (float) – Probability for a token to be masked when selected (default is 0.8).
p_random (float) – Probability for a token to be randomly replaced when selected (default is 0.1).
warmup_steps (int) – Number of training steps in which learning rate linearly increases. After this amount of steps, the learning rate decreases proportional to the invserse square root of the step number (default is 32000).
loss_function (torch.nn.Module) – Loss function that is to be optimized, assuming logits (so no Softmax) and ignore_index=utils.TOKENS[‘PAD’]. Uses torch.nn.CrossEntropyLoss if None (default).
label_smoothing (float) – How much weight should be subtracted from the target token and divided over the remaining tokens, for regularization (default is 0.1).
logger (lncrnapy.train.loggers) – Logger object whose log method will be called at every epoch. If None (default), will use LoggerBase, which only keeps track of the history.
metrics (dict[str:callable]) – Metrics (name + function) that will be evaluated at every epoch.

Metrics

Contains predefined metrics sets, implemented as dictionaries with metric names as keys, and the corresponding function to calculate them as values. Note that all of these functions assume an input tuple: (y_true, y_pred).

lncrnapy.train.metrics.classification_metrics = {'Accuracy': <function accuracy_score>, 'F1 (macro)': <function <lambda>>, 'Precision (ncRNA)': <function <lambda>>, 'Precision (pcRNA)': <function <lambda>>, 'Recall (ncRNA)': <function <lambda>>, 'Recall (pcRNA)': <function <lambda>>}

accuracy, precision and recall (for pcRNA and lncRNA), and F1 (macro-averaged over both classes).

Type:: Default lncRNA classification metrics

lncrnapy.train.metrics.frame_agreement(y_true, y_pred): Fraction of times in which y_true and y_pred agree in terms of reading frame

lncrnapy.train.metrics.frame_consistency(y_true, y_pred): Fraction of times in which y_pred agrees with itself in terms of reading frame

lncrnapy.train.metrics.mcm_metrics = {'Accuracy': <function accuracy_score>}

accuracy.

Type:: Default MCM evaluation metrics

lncrnapy.train.metrics.mtm_metrics = {'Accuracy': <function accuracy_score>, 'Counts': <function <lambda>>, 'F1 (macro)': <function <lambda>>, 'Precision (macro)': <function <lambda>>, 'Recall (macro)': <function <lambda>>}

accuracy, precision, recal, and F1 (macro- averaged), as well as counts per token.

Type:: Default MTM evaluation metrics

lncrnapy.train.metrics.orf_prediction_metrics = {'Distribution (ORF (end))': <function <lambda>>, 'Distribution (ORF (start))': <function <lambda>>, 'Frame agreement': <function frame_agreement>, 'Frame consistency': <function frame_consistency>, 'MAE (ORF (end))': <function <lambda>>, 'MAE (ORF (start))': <function <lambda>>, 'RMSE (ORF (end))': <function <lambda>>, 'RMSE (ORF (start))': <function <lambda>>}: Metrics designed for ORF prediction. Includes the RMSE/MAE separately for start and end coordinates of the ORF, as well as frame agreement and consistency.

lncrnapy.train.metrics.regression_distribution(y_true, y_pred): Calculates the distribution of values in y_pred

lncrnapy.train.metrics.regression_metrics = {'Distribution': <function <lambda>>, 'MAE': <function <lambda>>, 'RMSE': <function <lambda>>}: Default regression metrics ((root) mean absolute/squared error)

Mixed Precision

For automatically en-/disabling mixed precision depending on whether or not cuda is recognized.

class lncrnapy.train.mixed_precision.DummyScaler

A dummy gradient scaler that does not do anything and serves as a placeholder for when gradient scaling is not desired.

scale(loss): Identity function

step(optimizer): Optimizer step

unscale_(optimizer): Empty function

update(): Empty function

lncrnapy.train.mixed_precision.get_amp_args(device): ‘Returns mixed precision keyword arguments (as dictionary) depending on whether or not device.type==’cuda’.

lncrnapy.train.mixed_precision.get_gradient_scaler(device): Returns gradient scaler or dummy object depending on whether or not device.type==’cuda’.

Regression

Functions for training a lncrnapy deep learning model for regression.

lncrnapy.train.regression.epoch_regressor(model, dataloader, loss_function, optimizer, scaler): Trains model for a single epoch.

lncrnapy.train.regression.evaluate_regressor(model, data, loss_function, standardizer=None, metrics={'Distribution': <function <lambda>>, 'MAE': <function <lambda>>, 'RMSE': <function <lambda>>}): Simple evaluation function to keep track of in-training progress.

lncrnapy.train.regression.train_regressor(model, train_data, valid_data, epochs, batch_size=64, loss_function=None, optimizer=None, standardizer=None, n_samples_per_epoch=None, logger=None, metrics={'Distribution': <function <lambda>>, 'MAE': <function <lambda>>, 'RMSE': <function <lambda>>})

Trains model for regression task, using train_data, for specified amount of epochs.

Parameters:

model (torch.nn.Module | lncrnapy.modules.Classifier) – Neural network that is to be trained.
train_data (lncrnapy.data.Data) – Data to use for training, must call set_tensor_features first. After every training epoch, the performance of the model on a subset of the training set is determined. The length of this subset is min(len(train_data), len(valid_data)).
valid_data (lncrnapy.data.Data) – Data to use for validation, must call set_tensor_features first.
epochs (int) – How many epochs (data run-throughs) to train for.
batch_size (int) – Number of examples per batch (default is 64).
loss_function (torch.nn.Module) – Loss function that is to be optimized. If None, falls back to Mean Squared Error loss (torch.nn.MSELoss) (default is None).
optimizer (torch.optim) – Optimizer to update the network’s weights during training. If None (default), will use Adam with learning rate 0.0001.
standardizer (lncrnapy.train.standardizer.Standardizer) – If specified, will use this standardizer to transform the data back to its original scale during epoch evaluation (default is None).
logger (lncrnapy.train.loggers) – Logger object whose log method will be called at every epoch. If None (default), will use LoggerBase, which only keeps track of the history.
metrics (dict[str:callable]) – Metrics (name + function) that will be evaluated at every epoch.