Protein sequence analysis in the context of drug repurposing

Table 1 Characteristics of the final selected embedding methods. Rows describe the method (strategy of encoding used and established in Embedding sequence data to numerical vectors section), the input and output (data types used as input and output for all methods), and the embedding feature size (dimensions of final numeric representations). A description of the model architectures and the open code libraries used for embedding generation is also given. Some models provide pre-trained models. Databases of protein sequences used in these cases are given as well

	One-Hot Encoder	Sequence Graph Transform	SeqVec	BERT ProTrans
Method	Binary encoding	Distance encoding	DL bidirectional contextual LM	DL masked contextual LM
Input	Sequence token list	Sequence token list	Encoded token list	Encoded token list
Output	1 Tensor	1 Feature vector	3 Tensors (1 per layer)	1 Tensor
Embedding feature vector size	(sequence length x 21)	441	(sequence length x 1024)	(sequence length x 1024)
Implementation libraries	-	sgt package	Pytorch / AllenNLP	Pytorch / Tensorflow
Architecture	-	-	biLSTM (Recurrent RNN)	Transformer
Layers (nodes)	-	-	1 CNN (1024) + 2 biLSTM (1024 nodes each)	30 layers of biLSTM stacked encoders
Supervision	-	-	-	Available on structural localization supervision
Pretrained models	-	-	Yes	Yes
Databases used in pretraining			UniRef50 (33M sequences)	Big Fat Database (BFD)

ISSN: 1472-6947