1. Introduction
One of the biggest free online knowledge bases is Wikipedia which is growing constantly. Wikipedia is a project which was founded in 2001, to create a free online encyclopaedia. Due to the fact that everybody can contribute to Wikipedia, by creating or editing articles, Wikipedia grew fast to the largest and most popular work on the internet [Wika]. Just the English version has more than 5 million di erent articles. Even though everybody can create articles, the quality of the content is quite well, as everybody can edit or report articles. In 2005 the magazine Nature did an independent peer-review in which they compared 42 Wikipedia articles with articles of the English encyclopaedia Britannica. Compared to Wikipedia, articles
…show more content…
Already in September 2005, the average number of words for an Wikipedia article was around 800 words [Wikb]. It is to assume that the number grew over the years, making it impossible to get a quick rst overview about a topic using Wikipedia articles. Next to large amount of words for just one article, most of the user do not even know what they are actually looking for or what could be an interesting information in a certain context.
Visualizations of concepts and their relations can help users to nd better information for a concept or get an inside in a topic. Due to our visual perception ability, a visual representation of data is often more e ective than a written text [Maz09].
Furthermore works like DBPedia already attempted to structure information from Wikipedia and their results can be visualized in a way that a user gets an overview about a topic and is supported in his search
…show more content…
Depending on the context, a corpus can consists of di erent types of text documents, like articles, blog posts, books or tweets among others.
Bag-Of-Words Model The bag-of-word model (BOW), is a model which is used in natural language processing and information retrieval. In this model a text is represented as a set (e.g. bag) of its words, without keeping the order of the words.
Vector Space Model The vector space model is a model for representing text docu- ments as vectors, in which an entry indicates the occurrence of a term within the text. Like in the bag-of-words model, the order of terms is also disregarded in this model. If a term occurs in a document its entry value is non-zero in the document vector. The entry value is also called term weight and can either be a binary value (with 1 indicating that the term occurs in the document) or a weighted value, based on a weighting function.
To create a vector space model for a given corpus, one has to rst extract all unique terms and form a vocabulary. Each vector than has the size of this vocabulary and an entry ai in the document vector j is set to a non-zero value, if the i − th term of the vocabulary, appears in the j − th