ARABIC DOCUMENT LAYOUT ANALYSIS USING MACHINE LEARNING AND CONNECTED COMPONENTS BASED FEATURES

Rana Sobhy Mostafa Saad

ARABIC DOCUMENT LAYOUT ANALYSIS USING MACHINE LEARNING AND CONNECTED COMPONENTS BASED FEATURES

Rana Sobhy Mostafa Saad;

Abstract

Document Layout Analysis (DLA) is a key preprocessing stage for optical character recognition (OCR). It locates and defines text and non-text regions of a document image. Arabic DLA is less addressed compared to other languages due to the lack of appropriate publicly available research datasets. A full pipeline of DLA procedure is composed of several stages: Input document Preprocessing, Document Physical layout Analysis (PLA), Document Logical Layout Analysis (LLA), and document analysis output representation.
In this thesis, CCs geometric features are used to represent the Arabic document images These CCs features are classified by means of Support Vector Machines (SVM) and Random Forests (RF) classifiers into text and non-text components to perform PLA for scanned Arabic book pages.
Experiments on BCE-v1, and other researcher's datasets showed remarkable performance of both the SVM and RF based solutions. Comparing to other classical and state-of-the-art systems showed much strength to the proposed system and promise further application to wider problem domains.

Other data

Title	ARABIC DOCUMENT LAYOUT ANALYSIS USING MACHINE LEARNING AND CONNECTED COMPONENTS BASED FEATURES
Other Titles	تحليل هيئة الوثائق العربية باستخدام تعلم الآلة وسمات المكونات المترابطة
Authors	Rana Sobhy Mostafa Saad
Issue Date	2018

Attached Files

File	Size	Format
V2806.pdf	589.86 kB	Adobe PDF	View/Open

Recommend this item

Similar Items from Core Recommender Database

Google Scholar^TM

Check

views 5 in Shams Scholar

downloads 1 in Shams Scholar

ARABIC DOCUMENT LAYOUT ANALYSIS USING MACHINE LEARNING AND CONNECTED COMPONENTS BASED FEATURES

Rana Sobhy Mostafa Saad;

Abstract

Other data

Attached Files

Google ScholarTM

Google Scholar^TM