From: Protein sequence analysis in the context of drug repurposing
One-Hot Encoder | Sequence Graph Transform | SeqVec | BERT ProTrans | |
---|---|---|---|---|
Method | Binary encoding | Distance encoding | DL bidirectional contextual LM | DL masked contextual LM |
Input | Sequence token list | Sequence token list | Encoded token list | Encoded token list |
Output | 1 Tensor | 1 Feature vector | 3 Tensors (1 per layer) | 1 Tensor |
Embedding feature vector size | (sequence length x 21) | 441 | (sequence length x 1024) | (sequence length x 1024) |
Implementation libraries | - | sgt package | Pytorch / AllenNLP | Pytorch / Tensorflow |
Architecture | - | - | biLSTM (Recurrent RNN) | Transformer |
Layers (nodes) | - | - | 1 CNN (1024) + 2 biLSTM (1024 nodes each) | 30 layers of biLSTM stacked encoders |
Supervision | - | - | - | Available on structural localization supervision |
Pretrained models | - | - | Yes | Yes |
Databases used in pretraining | UniRef50 (33M sequences) | Big Fat Database (BFD) |