UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING

Afify, Yasmine M.; Rasha Ismail; Badr, Nagwa; Alaaeldin, Farida;

Abstract


Bioinformaticians face an issue in analyzing and studying protein similarity as the number of
proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for
the analysis process to be able to appropriately categorize proteins based on their sequences. The
extraction of features from protein sequences is done using a variety of methods. The goal of this study
is to investigate the different variations of data on the classification performance of a deep learning
model employing 3D data. First, few research questions were formulated regarding the impact of the
following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep
learning classification process. Second, comprehensive experiments were conducted to answer the
research questions. Six feature extraction methods were utilized to create 3D features with two sizes
(7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in
their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the
standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7
feature matrix has a positive correlation between its dimensions, which improved the results. Second,
using the sum of the first three IMF components had better impact than using the first IMF component.
Third, the classification process did not benefit from the normalization of features for small datasets
unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model,
with a training accuracy reaching 84.03%.


Other data

Title UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING
Authors Afify, Yasmine M. ; Rasha Ismail ; Badr, Nagwa ; Alaaeldin, Farida 
Keywords Deep Learning;Proteins;EMD;IMF;Feature Matrix
Issue Date 1-May-2022
Journal International Journal of Intelligent Computing, and Information Sciences (IJICIS) 
Volume 22
Issue 2
Start page 112
End page 125
ISSN 2535-1710
DOI 10.21608/ijicis.2022.123177.1168

Attached Files

Recommend this item

Similar Items from Core Recommender Database

Google ScholarTM

Check



Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.