TOWARDS MINING WEB CONTENT OUTLIERS
Ayman Hassan Tanira;
Abstract
The task of outlier detection is to find small fraction of data that are exceptional when compared with rest large amount of data. Finding outliers from huge data repositories is like finding needles in a haystack. The existing outlier detection algorithms were designed for mining numeric data which cannot be applied directly to mine outliers from Web datasets because the Web contains data of different types such as: text, hypertext, images, video, audio, etc.
Web content outliers are Web documents with varying contents compared to other documents taken from the same category. Mining Web document outliers may lead to the identification of competitors, emerging business trends in electronic commerce, improving the quality of results obtained from a Web search engine, and cleaning corpus used in Web documents classification.
This thesis concentrates on enhancing current approaches for detecting Web
document outliers. It introduces a Web document outlier mining system aiming trequired for identifying the closest neighbors for every document in the collection.
The experimental results on two different datasets with embedded motifs showed that FindWDO with N-grams outperforms similar algorithms in the same domain with respect to the accuracy of results.
Web content outliers are Web documents with varying contents compared to other documents taken from the same category. Mining Web document outliers may lead to the identification of competitors, emerging business trends in electronic commerce, improving the quality of results obtained from a Web search engine, and cleaning corpus used in Web documents classification.
This thesis concentrates on enhancing current approaches for detecting Web
document outliers. It introduces a Web document outlier mining system aiming trequired for identifying the closest neighbors for every document in the collection.
The experimental results on two different datasets with embedded motifs showed that FindWDO with N-grams outperforms similar algorithms in the same domain with respect to the accuracy of results.
Other data
| Title | TOWARDS MINING WEB CONTENT OUTLIERS | Other Titles | نحو التنقيب فى المحتوى خارج السياق على الويب | Authors | Ayman Hassan Tanira | Issue Date | 2007 |
Recommend this item
Similar Items from Core Recommender Database
Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.