Running Scripts
Scripts for using lncRNA-BERT for classification and generating embeddings, as well as for (pre-)training and fine-tuning custom models. For feature-based classification or more advanced use-cases, users can write their own custom scripts using the API. An example notebook is provided here.
Classify
Performs lncRNA classification, classifying RNA sequences as either coding or non-coding.
python -m lncnrapy.scripts.classify [-h] [--model_file MODEL_FILE] [--output_file OUTPUT_FILE] [--encoding_method {cse,bpe,kmer,nuc}] [--bpe_file BPE_FILE] [--k K] [--batch_size BATCH_SIZE] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] fasta_file [fasta_file ...]
- Positional arguments:
- fasta_file
Path to FASTA file of RNA sequences or pair of paths to two FASTA files containing protein- and non-coding RNAs, respectively. (str)
- Optional arguments
- -h, --help
Show help message.
- --model_file MODEL_FILE
Trained classifier model, specified by id of a model hosted on the HuggingFace Hub, or a path to a local directory containing model weights. (str=”luukromeijn/lncRNA-BERT-kmer-k3-finetuned”)
- --output_file OUTPUT_FILE
Name of .csv/.h5 output file. (str)
- --encoding_method {cse,bpe,kmer,nuc}
Sequence encoding method. (str=”kmer”)
- --bpe_file BPE_FILE
Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)
- --k K
Specifies k when K-mer Tokenization is used. (int=3)
- --batch_size BATCH_SIZE
Number of samples per prediction step. (int=8)
- --context_length CONTEXT_LENGTH
Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)
- --data_dir DATA_DIR
Parent directory to use for any of the paths specified in these arguments (except for –model_file). (str=””)
- --results_dir RESULTS_DIR
Parent directory to use for the results folder of this script. (str=””)
Train
Trains (or fine-tunes) a model (optionally pre-trained) for lncRNA classification.
python -m lncnrapy.scripts.train [-h] [--exp_prefix EXP_PREFIX] [--pretrained_model PRETRAINED_MODEL] [--encoding_method {cse,bpe,kmer,nuc}] [--epochs EPOCHS] [--n_samples_per_epoch N_SAMPLES_PER_EPOCH] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--d_model D_MODEL] [--N N] [--d_ff D_FF] [--h H] [--dropout DROPOUT] [--hidden_cls_layers HIDDEN_CLS_LAYERS [HIDDEN_CLS_LAYERS ...]] [--n_kernels N_KERNELS] [--kernel_size KERNEL_SIZE] [--bpe_file BPE_FILE] [--k K] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] [--model_dir MODEL_DIR] [--no_weighted_loss] [--no_random_reading_frame] [--freeze_network] [--freeze_kernels] [--input_linear] [--no_input_relu] fasta_pcrna_train fasta_ncrna_train fasta_pcrna_valid fasta_ncrna_valid
- Positional arguments:
- fasta_pcrna_train
Path to FASTA file with pcRNA training sequences. (str)
- fasta_ncrna_train
Path to FASTA file with ncRNA training sequences. (str)
- fasta_pcrna_valid
Path to FASTA file with pcRNA sequences used for validating the model after every epoch. (str)
- fasta_ncrna_valid
Path to FASTA file with ncRNA sequences used for validating the model after every epoch. (str)
- Optional arguments
- -h, --help
Show help message.
- --exp_prefix EXP_PREFIX
Added prefix to model/experiment name. (str)
- --pretrained_model PRETRAINED_MODEL
If specified, fine-tunes this pre-trained model instead of training one from scratch. Note that this causes model-related hyperparameters, such as d_model and N, to be ignored. Specified by id of a model hosted on the HuggingFace Hub, or a path to a local directory containing model weights. (str)=””
- --encoding_method {cse,bpe,kmer,nuc}
Sequence encoding method. (str=”cse”)
- --epochs EPOCHS
Number of epochs to train for. (int=100)
- --n_samples_per_epoch N_SAMPLES_PER_EPOCH
Number of training samples per epoch. (int=10000)
- --batch_size BATCH_SIZE
Number of samples per optimization step. (int=8)
- --learning_rate LEARNING_RATE
Learning rate used by Adam optimizer. (float=1e-5)
- --weight_decay WEIGHT_DECAY
Weight decay used by Adam optimizer. (float=0.0)
- --d_model D_MODEL
BERT embedding dimensionality. (int=768)
- --N N
Number of BERT transformer blocks. (int=12)
- --d_ff D_FF
Number of nodes in BERT FFN sublayers (int=4*d_model)
- --h H
Number of BERT self-attention heads (int=int(d_model/64))
- --dropout DROPOUT
Dropout probability in CLS output head. (float=0)
- --hidden_cls_layers HIDDEN_CLS_LAYERS [HIDDEN_CLS_LAYERS …]
Space-separated list with number of hidden nodes in ReLU-activated classification head layers. (int=[])
- --n_kernels N_KERNELS
Specifies number of kernels when convolutional sequence encoding is used. (int=768)
- --kernel_size KERNEL_SIZE
Specifies kernel size when convolutional sequence encoding is used. (int=9)
- --bpe_file BPE_FILE
Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)
- --k K
Specifies k when k-mer encoding is used. (int=6)
- --context_length CONTEXT_LENGTH
Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)
- --data_dir DATA_DIR
Parent directory to use for any of the paths specified in these arguments. (str=””)
- --results_dir RESULTS_DIR
Parent directory to use for the results folder of this script. (str=””)
- --model_dir MODEL_DIR
Directory where to save the trained model to. Model with highest macro F1-score on the validation dataset is saved. (str=f”{data_dir}/models”)
- --no_weighted_loss
Applies correction to pcRNA/ncRNA class imbalance. (bool)
- --no_random_reading_frame
Turns off sampling in random reading frame for convolutional sequence encoding. (bool)
- --freeze_network
Freezes all weights from the pre-trained model and bases the clasification on the mean embeddings of this model. This only works with the –pretrained_model flag. (bool)
- --freeze_kernels
Freezes all convolutional sequence encoding weights from the pre-trained model. Only works with the –pretrained_model flag. (bool)
- --input_linear
Forces linear projection of kernels onto d_model dimensions in convolutional sequence encoding. (bool)
- --no_input_relu
Turns off ReLU activation of kernels in convolutional sequence encoding. (bool)
Pre-Train
Pre-training script for a Nucleotide Language Model. Several encoding methods and hyperparameter settings are supported.
python -m lncnrapy.scripts.pretrain [-h] [--exp_prefix EXP_PREFIX] [--encoding_method {cse,bpe,kmer,nuc}] [--epochs EPOCHS] [--n_samples_per_epoch N_SAMPLES_PER_EPOCH] [--batch_size BATCH_SIZE] [--warmup_steps WARMUP_STEPS] [--d_model D_MODEL] [--N N] [--d_ff D_FF] [--h H] [--dropout DROPOUT] [--n_kernels N_KERNELS] [--kernel_size KERNEL_SIZE] [--bpe_file BPE_FILE] [--k K] [--p_mlm P_MLM] [--p_mask P_MASK] [--p_random P_RANDOM] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] [--model_dir MODEL_DIR] [--mask_size MASK_SIZE] [--no_random_reading_frame] [--input_linear] [--no_input_relu] [--no_output_linear] [--output_relu] fasta_train fasta_valid
- Positional arguments:
- fasta_train
Path to FASTA file with pre-training sequences. (str)
- fasta_valid
Path to FASTA file with sequences to use for validating model performance after every epoch. (str)
- Optional arguments
- -h, --help
Show help message.
- --exp_prefix EXP_PREFIX
Added prefix to model/experiment name. (str=”MLM”)
- --encoding_method {cse,bpe,kmer,nuc}
Sequence encoding method. (str=”cse”)
- --epochs EPOCHS
Number of epochs to pre-train for. (int=500)
- --n_samples_per_epoch N_SAMPLES_PER_EPOCH
Number of training samples per epoch. (int=10000)
- --batch_size BATCH_SIZE
Number of samples per optimization step. (int=8)
- --warmup_steps WARMUP_STEPS
Number of optimization steps in which learning rate increases linearly. After this amount of steps, the learning rate decreases proportional to the inverse square root of the step number. (int=8)
- --d_model D_MODEL
BERT embedding dimensionality. (int=768)
- --N N
Number of BERT transformer blocks. (int=12)
- --d_ff D_FF
Number of nodes in BERT FFN sublayers (int=4*d_model)
- --h H
Number of BERT self-attention heads (int=int(d_model/64))
- --dropout DROPOUT
Dropout probability in MLM output head. (float=0)
- --n_kernels N_KERNELS
Specifies number of kernels when convolutional sequence encoding is used. (int=768)
- --kernel_size KERNEL_SIZE
Specifies kernel size when convolutional sequence encoding is used. (int=9)
- --bpe_file BPE_FILE
Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)
- --k K
Specifies k when k-mer encoding is used. (int=6)
- --p_mlm P_MLM
Selection probability per token/nucleotide in MLM. (float=0.15)
- --p_mask P_MASK
Mask probability for selected token/nucleotide. (float=0.8)
- --p_random P_RANDOM
Random replacement chance per token/nucleotide. (float=0.1)
- --context_length CONTEXT_LENGTH
Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)
- --data_dir DATA_DIR
Parent directory to use for any of the paths specified in these arguments. (str=””)
- --results_dir RESULTS_DIR
Parent directory to use for the results folder of this script. (str=””)
- --model_dir MODEL_DIR
Directory where to save pre-trained model to. Model with the highest accuracy on the validation dataset is saved. (str=f”{data_dir}/models”)
- --mask_size MASK_SIZE
Number of contiguous nucleotides that make up a mask. (int=1)
- --no_random_reading_frame
Turns off sampling in random reading frame for convolutional sequence encoding (bool)
- --input_linear
Forces linear projection of kernels onto d_model dimensions in convolutional sequence encoding. (bool)
- --no_input_relu
Turns off ReLU activation of kernels in convolutional sequence encoding. (bool)
- --no_output_linear
Forces linear projection of embeddings onto n_kernels dimensions before masked convolution output layer. (bool)
- --output_relu
Forces ReLU activation of embeddings before masked convolution output layer. (bool)
Generate Embeddings
Retrieves sequence embeddings by specified model for input dataset.
python -m lncnrapy.scripts.embeddings [-h] [--model_file MODEL_FILE] [--output_file OUTPUT_FILE] [--output_plot_file OUTPUT_PLOT_FILE] [--encoding_method {cse,bpe,kmer,nuc}] [--bpe_file BPE_FILE] [--k K] [--pooling {CLS,mean,max}] [--dim_red {tsne,pca,umap,None}] [--batch_size BATCH_SIZE] [--context_length CONTEXT_LENGTH] [--data_dir DATA_DIR] [--results_dir RESULTS_DIR] fasta_file [fasta_file ...]
- Positional arguments:
- fasta_file
Path to FASTA file of RNA sequences or pair of paths to two FASTA files containing protein- and non-coding RNAs, respectively. (str)
- Optional arguments
- -h, --help
Show help message.
- --model_file MODEL_FILE
(Pre-)trained model, specified by id of a model hosted on the HuggingFace Hub, or a path to a local directory containing model weights. (str=”luukromeijn/lncRNA-BERT-kmer-k3-pretrained”)
- --output_file OUTPUT_FILE
Name of .csv/.h5 output file. (str)
- --output_plot_file OUTPUT_PLOT_FILE
If specified, plots the first two dimensions of the (reduced) sequence embeddings and saves them to this file. (str)
- --encoding_method {cse,bpe,kmer,nuc}
Sequence encoding method. (str=”kmer”)
- --bpe_file BPE_FILE
Filepath to BPE model generated with BPE script. Required when Byte Pair Encoding is used. (str=””)
- --k K
Specifies k when K-mer Tokenization is used. (int=3)
- --pooling {CLS,mean,max}
Type of pooling to apply. If “CLS”, will extract embeddings from CLS token. (str=”mean”)
- --dim_red {tsne,pca,umap,None}
Type of dimensionality reduction to apply to retrieved embeddings. If None, will not reduce dimensions. (str=tsne)
- --batch_size BATCH_SIZE
Number of samples per prediction step. (int=8)
- --context_length CONTEXT_LENGTH
Number of input positions. For cse/k-mer encoding, this translates to a maximum of (768-1)*k input nucleotides. (int=768)
- --data_dir DATA_DIR
Parent directory to use for any of the paths specified in these arguments (except for –model_file). (str=””)
- --results_dir RESULTS_DIR
Parent directory to use for the results folder of this script. (str=””)
Byte Pair Encoding
Fits a Byte Pair Encoding (BPE) model to a dataset.
python -m lncnrapy.scripts.bpe [-h] [--bpe_file BPE_FILE] [--data_dir DATA_DIR] fasta_train vocab_size
- Positional arguments:
- fasta_train
Path to FASTA file of RNA sequences to be used for fitting the BPE model. (str)
- vocab_size
Pre-defined number of tokens in vocabulary. (str)
- Optional arguments
- -h, --help
Show help message.
- --bpe_file BPE_FILE
Name of BPE output file. (str=f”features/{vocab_size}.bpe”)
- --data_dir DATA_DIR
Parent directory to use for any of the paths specified in these arguments. (str=””)