A Project Report on
Informative Content Extraction undergone at
National Institute of Technology, Surathkal,
Karnataka
under the guidance of
Dinesh Naik,
Assistant Professor
Submitted by
Faeem Shaikh
11IT22
VII Sem B.Tech (IT) in partial fulllment for the award of the degree of
BACHELOR OF TECHNOLOGY in INFORMATION TECHNOLOGY
Department of Information Technology
National Institute of Technology Karnataka, Surathkal
2014-2015.
Abstract
Internet web pages contain several items that cannot be classied as the "infor- mative content",e.g., search and ltering panel, navigation links, advertisements, and so on. Most clients and end-users search for the informative content, and largely do not seek the non-informative content. As a result, the need of Informa- tive Content Extraction
…show more content…
You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out dierent parsing strategies or trade speed for exibility[4]. 1
2 Literature Survey
Cai-Nicolas Ziegler[9] and teammates have proposed an approach that allows fully au- tomated extraction of news content from HTML pages. The basic concept is to extract coherent blocks of text from HTML pages, using DOM parsing, and to compute linguistic and structural features for each block. These features are then forwarded to classiers that decide whether to keep or discard the block at hand. To this end, we use diverse popular classication models for learning feature thresholds[9].FastContentExtractor - a fast algorithm to automatically detect content blocks in web pages by improving
ContentExtractor[7]. Instead of storing all input web pages of a website, Son Bao Pham and teammates have automatically created a template to store information of