BERT contextual embeddings for taxonomic classification of bacterial DNA sequences

Helaly, MA; Rady, Sherine; Aref, MM;

Abstract


Biological taxonomic classification is an important task needed for the identification and discovery of organisms, as well as the inference of their evolutionary relationships. The order and structure of biological sequence components has an essential and primary role in what the sequence's identity and function is. In order to be able to efficiently differentiate between different bacterial categories, interactions and positions of the biological components in sequences must be known — which is an essential challenge in biological sequence classification. In this light, a considerable amount of recent research has been made to explore efficient representations of biological sequences such as spectral k-mer representation, one-hot encoding, Hilbert space curves and classical word embeddings such as Word2Vec. This paper identifies the taxonomic classification of bacterial 16S rRNA genes at five resolutions mapping hierarchical taxonomic ranks. A Bidirectional Encoder Representations from Transformers (BERT) model is pretrained using biological sequences, which to the best of our knowledge is the first time BERT has been trained with such sequences. A complete prediction model is then proposed – BioSeqBERT-CNN – that initially extracts contextual embeddings representations of DNA sequences using the pretrained BERT model. Extracted representations are further used for taxonomic classification through a Convolutional Neural Network (CNN). For boosting the deep learning classification performance, a data augmentation step is applied. Classification with the original dataset on the most fine-grained rank produced an accuracy of 93.5%, which surpasses that of recent works by 1.5–24.3%. Using data augmentation, an accuracy of 99.9% is achieved, which exceeds values of recent works by a minimum and maximum of 7.9% and 30.7%, respectively on the most fine-grained taxonomic rank. This exhibits promising performance promoting the study of using contextual embeddings to represent biological sequences, with Deep Learning networks.


Other data

Title BERT contextual embeddings for taxonomic classification of bacterial DNA sequences
Authors Helaly, MA; Rady, Sherine ; Aref, MM
Keywords DNA;Taxonomic classification;BERT;Contextual embedding;Deep Learning;Convolutional Neural Network
Issue Date 2022
Publisher PERGAMON-ELSEVIER SCIENCE LTD
Journal Expert Systems with Applications 
Volume 208
ISSN 0957-4174
DOI 10.1016/j.eswa.2022.117972
Scopus ID 2-s2.0-85134592839
Web of science ID WOS:000835482200002

Recommend this item

Similar Items from Core Recommender Database

Google ScholarTM

Check

Citations 1 in scopus


Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.