Running Scripts

Scripts for using lncRNA-BERT for classification and generating embeddings, as well as for (pre-)training and fine-tuning custom models. For feature-based classification or more advanced use-cases, users can write their own custom scripts using the API. An example notebook is provided here.

Classify

Performs lncRNA classification, classifying RNA sequences as either coding or non-coding.

python -m lncnrapy.scripts.classify [-h] [--model_file MODEL_FILE] [--output_file OUTPUT_FILE] [--encoding_method {cse,bpe,kmer,nuc}] [--bpe_file BPE_FILE] [--k K] [--batch_size BATCH_SIZE] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] fasta_file [fasta_file ...]
Positional arguments:
fasta_file

Path to FASTA file of RNA sequences or pair of paths to two FASTA files containing protein- and non-coding RNAs, respectively. (str)

Optional arguments
-h, --help

Show help message.

--model_file MODEL_FILE

Trained classifier model, specified by id of a model hosted on the HuggingFace Hub, or a path to a local directory containing model weights. (str=”luukromeijn/lncRNA-BERT-kmer-k3-finetuned”)

--output_file OUTPUT_FILE

Name of .csv/.h5 output file. (str)

--encoding_method {cse,bpe,kmer,nuc}

Sequence encoding method. (str=”kmer”)

--bpe_file BPE_FILE

Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)

--k K

Specifies k when K-mer Tokenization is used. (int=3)

--batch_size BATCH_SIZE

Number of samples per prediction step. (int=8)

--context_length CONTEXT_LENGTH

Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)

--data_dir DATA_DIR

Parent directory to use for any of the paths specified in these arguments (except for –model_file). (str=””)

--results_dir RESULTS_DIR

Parent directory to use for the results folder of this script. (str=””)

Train

Trains (or fine-tunes) a model (optionally pre-trained) for lncRNA classification.

python -m lncnrapy.scripts.train [-h] [--exp_prefix EXP_PREFIX] [--pretrained_model PRETRAINED_MODEL] [--encoding_method {cse,bpe,kmer,nuc}] [--epochs EPOCHS] [--n_samples_per_epoch N_SAMPLES_PER_EPOCH] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--d_model D_MODEL] [--N N] [--d_ff D_FF] [--h H] [--dropout DROPOUT] [--hidden_cls_layers HIDDEN_CLS_LAYERS [HIDDEN_CLS_LAYERS ...]] [--n_kernels N_KERNELS] [--kernel_size KERNEL_SIZE] [--bpe_file BPE_FILE] [--k K] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] [--model_dir MODEL_DIR] [--no_weighted_loss] [--no_random_reading_frame] [--freeze_network] [--freeze_kernels] [--input_linear] [--no_input_relu] fasta_pcrna_train fasta_ncrna_train fasta_pcrna_valid fasta_ncrna_valid
Positional arguments:
fasta_pcrna_train

Path to FASTA file with pcRNA training sequences. (str)

fasta_ncrna_train

Path to FASTA file with ncRNA training sequences. (str)

fasta_pcrna_valid

Path to FASTA file with pcRNA sequences used for validating the model after every epoch. (str)

fasta_ncrna_valid

Path to FASTA file with ncRNA sequences used for validating the model after every epoch. (str)

Optional arguments
-h, --help

Show help message.

--exp_prefix EXP_PREFIX

Added prefix to model/experiment name. (str)

--pretrained_model PRETRAINED_MODEL

If specified, fine-tunes this pre-trained model instead of training one from scratch. Note that this causes model-related hyperparameters, such as d_model and N, to be ignored. Specified by id of a model hosted on the HuggingFace Hub, or a path to a local directory containing model weights. (str)=””

--encoding_method {cse,bpe,kmer,nuc}

Sequence encoding method. (str=”cse”)

--epochs EPOCHS

Number of epochs to train for. (int=100)

--n_samples_per_epoch N_SAMPLES_PER_EPOCH

Number of training samples per epoch. (int=10000)

--batch_size BATCH_SIZE

Number of samples per optimization step. (int=8)

--learning_rate LEARNING_RATE

Learning rate used by Adam optimizer. (float=1e-5)

--weight_decay WEIGHT_DECAY

Weight decay used by Adam optimizer. (float=0.0)

--d_model D_MODEL

BERT embedding dimensionality. (int=768)

--N N

Number of BERT transformer blocks. (int=12)

--d_ff D_FF

Number of nodes in BERT FFN sublayers (int=4*d_model)

--h H

Number of BERT self-attention heads (int=int(d_model/64))

--dropout DROPOUT

Dropout probability in CLS output head. (float=0)

--hidden_cls_layers HIDDEN_CLS_LAYERS [HIDDEN_CLS_LAYERS …]

Space-separated list with number of hidden nodes in ReLU-activated classification head layers. (int=[])

--n_kernels N_KERNELS

Specifies number of kernels when convolutional sequence encoding is used. (int=768)

--kernel_size KERNEL_SIZE

Specifies kernel size when convolutional sequence encoding is used. (int=9)

--bpe_file BPE_FILE

Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)

--k K

Specifies k when k-mer encoding is used. (int=6)

--context_length CONTEXT_LENGTH

Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)

--data_dir DATA_DIR

Parent directory to use for any of the paths specified in these arguments. (str=””)

--results_dir RESULTS_DIR

Parent directory to use for the results folder of this script. (str=””)

--model_dir MODEL_DIR

Directory where to save the trained model to. Model with highest macro F1-score on the validation dataset is saved. (str=f”{data_dir}/models”)

--no_weighted_loss

Applies correction to pcRNA/ncRNA class imbalance. (bool)

--no_random_reading_frame

Turns off sampling in random reading frame for convolutional sequence encoding. (bool)

--freeze_network

Freezes all weights from the pre-trained model and bases the clasification on the mean embeddings of this model. This only works with the –pretrained_model flag. (bool)

--freeze_kernels

Freezes all convolutional sequence encoding weights from the pre-trained model. Only works with the –pretrained_model flag. (bool)

--input_linear

Forces linear projection of kernels onto d_model dimensions in convolutional sequence encoding. (bool)

--no_input_relu

Turns off ReLU activation of kernels in convolutional sequence encoding. (bool)

Pre-Train

Pre-training script for a Nucleotide Language Model. Several encoding methods and hyperparameter settings are supported.

python -m lncnrapy.scripts.pretrain [-h] [--exp_prefix EXP_PREFIX] [--encoding_method {cse,bpe,kmer,nuc}] [--epochs EPOCHS] [--n_samples_per_epoch N_SAMPLES_PER_EPOCH] [--batch_size BATCH_SIZE] [--warmup_steps WARMUP_STEPS] [--d_model D_MODEL] [--N N] [--d_ff D_FF] [--h H] [--dropout DROPOUT] [--n_kernels N_KERNELS] [--kernel_size KERNEL_SIZE] [--bpe_file BPE_FILE] [--k K] [--p_mlm P_MLM] [--p_mask P_MASK] [--p_random P_RANDOM] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] [--model_dir MODEL_DIR] [--mask_size MASK_SIZE] [--no_random_reading_frame] [--input_linear] [--no_input_relu] [--no_output_linear] [--output_relu] fasta_train fasta_valid
Positional arguments:
fasta_train

Path to FASTA file with pre-training sequences. (str)

fasta_valid

Path to FASTA file with sequences to use for validating model performance after every epoch. (str)

Optional arguments
-h, --help

Show help message.

--exp_prefix EXP_PREFIX

Added prefix to model/experiment name. (str=”MLM”)

--encoding_method {cse,bpe,kmer,nuc}

Sequence encoding method. (str=”cse”)

--epochs EPOCHS

Number of epochs to pre-train for. (int=500)

--n_samples_per_epoch N_SAMPLES_PER_EPOCH

Number of training samples per epoch. (int=10000)

--batch_size BATCH_SIZE

Number of samples per optimization step. (int=8)

--warmup_steps WARMUP_STEPS

Number of optimization steps in which learning rate increases linearly. After this amount of steps, the learning rate decreases proportional to the inverse square root of the step number. (int=8)

--d_model D_MODEL

BERT embedding dimensionality. (int=768)

--N N

Number of BERT transformer blocks. (int=12)

--d_ff D_FF

Number of nodes in BERT FFN sublayers (int=4*d_model)

--h H

Number of BERT self-attention heads (int=int(d_model/64))

--dropout DROPOUT

Dropout probability in MLM output head. (float=0)

--n_kernels N_KERNELS

Specifies number of kernels when convolutional sequence encoding is used. (int=768)

--kernel_size KERNEL_SIZE

Specifies kernel size when convolutional sequence encoding is used. (int=9)

--bpe_file BPE_FILE

Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)

--k K

Specifies k when k-mer encoding is used. (int=6)

--p_mlm P_MLM

Selection probability per token/nucleotide in MLM. (float=0.15)

--p_mask P_MASK

Mask probability for selected token/nucleotide. (float=0.8)

--p_random P_RANDOM

Random replacement chance per token/nucleotide. (float=0.1)

--context_length CONTEXT_LENGTH

Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)

--data_dir DATA_DIR

Parent directory to use for any of the paths specified in these arguments. (str=””)

--results_dir RESULTS_DIR

Parent directory to use for the results folder of this script. (str=””)

--model_dir MODEL_DIR

Directory where to save pre-trained model to. Model with the highest accuracy on the validation dataset is saved. (str=f”{data_dir}/models”)

--mask_size MASK_SIZE

Number of contiguous nucleotides that make up a mask. (int=1)

--no_random_reading_frame

Turns off sampling in random reading frame for convolutional sequence encoding (bool)

--input_linear

Forces linear projection of kernels onto d_model dimensions in convolutional sequence encoding. (bool)

--no_input_relu

Turns off ReLU activation of kernels in convolutional sequence encoding. (bool)

--no_output_linear

Forces linear projection of embeddings onto n_kernels dimensions before masked convolution output layer. (bool)

--output_relu

Forces ReLU activation of embeddings before masked convolution output layer. (bool)

Generate Embeddings

Retrieves sequence embeddings by specified model for input dataset.

python -m lncnrapy.scripts.embeddings [-h] [--model_file MODEL_FILE] [--output_file OUTPUT_FILE] [--output_plot_file OUTPUT_PLOT_FILE] [--encoding_method {cse,bpe,kmer,nuc}] [--bpe_file BPE_FILE] [--k K] [--pooling {CLS,mean,max}] [--dim_red {tsne,pca,umap,None}] [--batch_size BATCH_SIZE] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] fasta_file [fasta_file ...]
Positional arguments:
fasta_file

Path to FASTA file of RNA sequences or pair of paths to two FASTA files containing protein- and non-coding RNAs, respectively. (str)

Optional arguments
-h, --help

Show help message.

--model_file MODEL_FILE

(Pre-)trained model, specified by id of a model hosted on the HuggingFace Hub, or a path to a local directory containing model weights. (str=”luukromeijn/lncRNA-BERT-kmer-k3-pretrained”)

--output_file OUTPUT_FILE

Name of .csv/.h5 output file. (str)

--output_plot_file OUTPUT_PLOT_FILE

If specified, plots the first two dimensions of the (reduced) sequence embeddings and saves them to this file. (str)

--encoding_method {cse,bpe,kmer,nuc}

Sequence encoding method. (str=”kmer”)

--bpe_file BPE_FILE

Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)

--k K

Specifies k when K-mer Tokenization is used. (int=3)

--pooling {CLS,mean,max}

Type of pooling to apply. If “CLS”, will extract embeddings from CLS token. (str=”mean”)

--dim_red {tsne,pca,umap,None}

Type of dimensionality reduction to apply to retrieved embeddings. If None, will not reduce dimensions. (str=tsne)

--batch_size BATCH_SIZE

Number of samples per prediction step. (int=8)

--context_length CONTEXT_LENGTH

Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)

--data_dir DATA_DIR

Parent directory to use for any of the paths specified in these arguments (except for –model_file). (str=””)

--results_dir RESULTS_DIR

Parent directory to use for the results folder of this script. (str=””)

Byte Pair Encoding

Fits a Byte Pair Encoding (BPE) model to a dataset.

python -m lncnrapy.scripts.bpe [-h] [--bpe_file BPE_FILE] [--data_dir DATA_DIR] fasta_train vocab_size
Positional arguments:
fasta_train

Path to FASTA file of RNA sequences to be used for fitting the BPE model. (str)

vocab_size

Pre-defined number of tokens in vocabulary. (str)

Optional arguments
-h, --help

Show help message.

--bpe_file BPE_FILE

Name of BPE output file. (str=f”features/{vocab_size}.bpe”)

--data_dir DATA_DIR

Parent directory to use for any of the paths specified in these arguments. (str=””)