Data Mining Techniques in Gene Expressions
Basma Ali Maher;
Abstract
In recent years, the rapid developments in the genetics field have generated a huge amount of biological data. Microarray gene expression data is an important instance of biological data. It has high dimensionality with a small number of samples accompanied with large number of genes. Therefore, using machine learning techniques for knowledge discovery in such data become a rich area for researchers. The mining phase is usually divided into two steps: the gene selection (feature reduction) and the classification process.
Gene selection is a process of finding the genes most strongly related to a particular class. The benefit of this process is to reduce not only dimensionality but also, the danger of presence of irrelevant genes that affect the classification process. Many machine learning approaches are used feature reduction but the study focuses on t-test and class separability. In the other hand, classification is an important data-mining problem that has a wide range of applications. Classification concerns learning that classifies data into the predetermined categories. It is applied to discriminate diseases or to predict outcomes based on gene expression patterns and perhaps even identify the best treatment for given genetic signature. Many machine learning approaches are used classification. In this study, it focuses on Support vector machine and k-nearest neighbor.
Support Vector Machine (SVM) plays a very important role in the data mining classification problem. The structure of SVM depends on kernel functions, where the most commonly used are liner and polynomial. If there are more than two classes in the data set, binary SVMs are not sufficient to
- III -
solve the whole problem. To solve multi-class classification problems, the whole problem should be converted into a number of binary classification problems. Usually, there are two approaches. One is the “one against all” scheme and the other is the “one against one” scheme.
On the other hand, K-Nearest Neighbor shows an outstanding performance in many cases of classifying microarray gene expression. For using KNN technique three key elements are essential, (1) a set of data for training, (2) a group of labels for the training data (identifying the class of each data entry) and (3) the value of K for deciding the number of nearest neighbors.
This study proposes a new hybrid reduction approach for the promotion of the cancer classification accuracy that uses two gene selection techniques to confirm the most informative genes and to discard irrelevant genes that affect the classification accuracy. Actually, it applied two machine learning (ML) gene ranking techniques (T-test and Class Separability (CS)) and two ML classifiers; K-nearest neighbor (KNN) and support vector machine (SVM); for exploring and analyzing the process of mining microarray gene expression profiles. In addition, based on these analyses we proposed a hybrid ML reduction approach to enhance the classification accuracy.
Gene selection is a process of finding the genes most strongly related to a particular class. The benefit of this process is to reduce not only dimensionality but also, the danger of presence of irrelevant genes that affect the classification process. Many machine learning approaches are used feature reduction but the study focuses on t-test and class separability. In the other hand, classification is an important data-mining problem that has a wide range of applications. Classification concerns learning that classifies data into the predetermined categories. It is applied to discriminate diseases or to predict outcomes based on gene expression patterns and perhaps even identify the best treatment for given genetic signature. Many machine learning approaches are used classification. In this study, it focuses on Support vector machine and k-nearest neighbor.
Support Vector Machine (SVM) plays a very important role in the data mining classification problem. The structure of SVM depends on kernel functions, where the most commonly used are liner and polynomial. If there are more than two classes in the data set, binary SVMs are not sufficient to
- III -
solve the whole problem. To solve multi-class classification problems, the whole problem should be converted into a number of binary classification problems. Usually, there are two approaches. One is the “one against all” scheme and the other is the “one against one” scheme.
On the other hand, K-Nearest Neighbor shows an outstanding performance in many cases of classifying microarray gene expression. For using KNN technique three key elements are essential, (1) a set of data for training, (2) a group of labels for the training data (identifying the class of each data entry) and (3) the value of K for deciding the number of nearest neighbors.
This study proposes a new hybrid reduction approach for the promotion of the cancer classification accuracy that uses two gene selection techniques to confirm the most informative genes and to discard irrelevant genes that affect the classification accuracy. Actually, it applied two machine learning (ML) gene ranking techniques (T-test and Class Separability (CS)) and two ML classifiers; K-nearest neighbor (KNN) and support vector machine (SVM); for exploring and analyzing the process of mining microarray gene expression profiles. In addition, based on these analyses we proposed a hybrid ML reduction approach to enhance the classification accuracy.
Other data
| Title | Data Mining Techniques in Gene Expressions | Other Titles | أساليب التنقيب فى بيانات السلاسل الجينية | Authors | Basma Ali Maher | Issue Date | 2014 |
Recommend this item
Similar Items from Core Recommender Database
Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.