Skip to main content

Table 1 Characteristics of the final selected embedding methods. Rows describe the method (strategy of encoding used and established in Embedding sequence data to numerical vectors section), the input and output (data types used as input and output for all methods), and the embedding feature size (dimensions of final numeric representations). A description of the model architectures and the open code libraries used for embedding generation is also given. Some models provide pre-trained models. Databases of protein sequences used in these cases are given as well

From: Protein sequence analysis in the context of drug repurposing

 

One-Hot Encoder

Sequence Graph Transform

SeqVec

BERT ProTrans

Method

Binary encoding

Distance encoding

DL bidirectional contextual LM

DL masked contextual LM

Input

Sequence token list

Sequence token list

Encoded token list

Encoded token list

Output

1 Tensor

1 Feature vector

3 Tensors (1 per layer)

1 Tensor

Embedding feature vector size

(sequence length x 21)

441

(sequence length x 1024)

(sequence length x 1024)

Implementation libraries

-

sgt package

Pytorch / AllenNLP

Pytorch / Tensorflow

Architecture

-

-

biLSTM (Recurrent RNN)

Transformer

Layers (nodes)

-

-

1 CNN (1024) + 2 biLSTM (1024 nodes each)

30 layers of biLSTM stacked encoders

Supervision

-

-

-

Available on structural localization supervision

Pretrained models

-

-

Yes

Yes

Databases used in pretraining

  

UniRef50 (33M sequences)

Big Fat Database (BFD)