An Enhanced Automatic Model Based On Semantic Annotation For Text Documents
Eman Ismail Sayed;
Abstract
Recently, the amount of available data on the web is increasing rapidly, so it is difficult to search for relevant data in huge data set. Text document annotation provides solution to such type of problems. It gives the text an additional information in the form of notes or comments. Annotation facilitates the task of finding the main topics of document. Moreover, annotation helps the reader to overview and understand the document.
Due to the spread of social media applications such as in Facebook, Twitter... etc. Millions of short texts are being produced daily. Therefore, text classification is used to discover knowledge from these unstructured text data. The short text documents (STDs) have special characteristics as being noisy and sparsy because their words are rarely repeated. The traditional methods of classifying such types of documents are based on Bag of Words (BOW) method, which indexes text documents as independent features. Each feature is a single term or word in a document. A document is represented as a vector in feature space. A document vector contains the word weights, which are the number of word occurrences in the document. Classification of STDs based on BOW has many drawbacks: STDs do not provide enough co-occurrence of words or shared context. Representation of such documents is almost sparse because of empty weights when using BOW. So that, the traditional bag of words. (BOW) method fails to achieve good accuracy. Moreover, BOW method treats synonym words as different features and does not consider the relations between words and documents. Therefore, semantic knowledge is introduced as a background to focus on the semantic relationships between the documents words (terms).
In this work, two effective models for semantic annotation are proposed. The first model is classification based on enrichment representation (CBER). It is composed of the proposed semantic analysis based on WordNet(SAWN) model and the word vector term frequency (WVTF). WVTF is a BOW representation of text documents. SAWN maps the text documents with WordNet to extract the concepts. Concepts are the terms that are defined in WordNet. SAWN chooses the most suitable synonym for document concepts by studying and understanding the surrounding concepts in the same document. Thus, concepts with the same meaning will increase the weight of their synonyms. Furthermore, the semantic relationships between concepts have been exploited in order to solve the disambiguation problems such as polysemy and synonyms.
CBER model enriches the STDs with semantic weights to solve disambiguation problems without the need to increase the document features. It considers all documents terms. However, some terms may not be defined in WordNet. So that,
Due to the spread of social media applications such as in Facebook, Twitter... etc. Millions of short texts are being produced daily. Therefore, text classification is used to discover knowledge from these unstructured text data. The short text documents (STDs) have special characteristics as being noisy and sparsy because their words are rarely repeated. The traditional methods of classifying such types of documents are based on Bag of Words (BOW) method, which indexes text documents as independent features. Each feature is a single term or word in a document. A document is represented as a vector in feature space. A document vector contains the word weights, which are the number of word occurrences in the document. Classification of STDs based on BOW has many drawbacks: STDs do not provide enough co-occurrence of words or shared context. Representation of such documents is almost sparse because of empty weights when using BOW. So that, the traditional bag of words. (BOW) method fails to achieve good accuracy. Moreover, BOW method treats synonym words as different features and does not consider the relations between words and documents. Therefore, semantic knowledge is introduced as a background to focus on the semantic relationships between the documents words (terms).
In this work, two effective models for semantic annotation are proposed. The first model is classification based on enrichment representation (CBER). It is composed of the proposed semantic analysis based on WordNet(SAWN) model and the word vector term frequency (WVTF). WVTF is a BOW representation of text documents. SAWN maps the text documents with WordNet to extract the concepts. Concepts are the terms that are defined in WordNet. SAWN chooses the most suitable synonym for document concepts by studying and understanding the surrounding concepts in the same document. Thus, concepts with the same meaning will increase the weight of their synonyms. Furthermore, the semantic relationships between concepts have been exploited in order to solve the disambiguation problems such as polysemy and synonyms.
CBER model enriches the STDs with semantic weights to solve disambiguation problems without the need to increase the document features. It considers all documents terms. However, some terms may not be defined in WordNet. So that,
Other data
| Title | An Enhanced Automatic Model Based On Semantic Annotation For Text Documents | Other Titles | تحسين عملية استنتاج العلامات النصية من الوثائق اعتمادا على المعنى الدلالى | Authors | Eman Ismail Sayed | Issue Date | 2017 |
Recommend this item
Similar Items from Core Recommender Database
Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.