Informative Content Extraction Project Report

2017 Words9 Pages

A Project Report on
Informative Content Extraction undergone at
National Institute of Technology, Surathkal,
Karnataka
under the guidance of
Dinesh Naik,
Assistant Professor
Submitted by
Faeem Shaikh
11IT22
VII Sem B.Tech (IT) in partial ful llment for the award of the degree of
BACHELOR OF TECHNOLOGY in INFORMATION TECHNOLOGY
Department of Information Technology
National Institute of Technology Karnataka, Surathkal
2014-2015.
Abstract
Internet web pages contain several items that cannot be classi ed as the "infor- mative content",e.g., search and ltering panel, navigation links, advertisements, and so on. Most clients and end-users search for the informative content, and largely do not seek the non-informative content. As a result, the need of Informa- tive Content Extraction …show more content…

You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out di erent parsing strategies or trade speed for exibility[4]. 1
2 Literature Survey
Cai-Nicolas Ziegler[9] and teammates have proposed an approach that allows fully au- tomated extraction of news content from HTML pages. The basic concept is to extract coherent blocks of text from HTML pages, using DOM parsing, and to compute linguistic and structural features for each block. These features are then forwarded to classi ers that decide whether to keep or discard the block at hand. To this end, we use diverse popular classi cation models for learning feature thresholds[9].FastContentExtractor - a fast algorithm to automatically detect content blocks in web pages by improving
ContentExtractor[7]. Instead of storing all input web pages of a website, Son Bao Pham and teammates have automatically created a template to store information of