Cacho2021ChapterBuildingAWikipediaN-GRAMCorpus

.pdf

School

University of Nevada, Las Vegas**We aren't endorsed by this school

Course

CS 474

Subject

Computer Science

Date

Dec 28, 2024

Pages

Uploaded by BailiffNeutron18944

Building a Wikipedia N-GRAM CorpusJorge Ram´on Fonseca Cacho(B), Ben Cisneros, and Kazem TaghvaDepartment of Computer Science, University of Nevada, Las Vegas, Las Vegas, USA{Jorge.FonsecaCacho,Kazem.Taghva}@unlv.edu, cisneo1@unlv.nevada.eduAbstract.In this paper, we introduce a set of approaches to buildingan-gram corpus from the Wikipedia monthly XML dumps. We thenapply these to build a 1 to 5-g corpus data set, which we then describein detail, explaining its benefits as a supplement to largern-gram corporalike Google Web 1T 5-g corpus. We analyze our algorithms and discusseﬃciency in terms of space and time. The dataset is publicly availableatwww.unlv.edu.Keywords:NGRAM·NLP·Wikipedia·OCR·Wiki1IntroductionIn the words of The Economist, “The world’s most valuable resource is no longeroil, but data” [1]; however, unlike oil, data is a commodity that is typically notshared or sold, and yet is highly sought after in both industry and academia. Yetat the same time, an immense inﬂux of data is being generated every second onthe Internet. While this may seem paradoxical, it is not. Acquiring data is easy,but converting that raw data into something useful, well therein lies the rub.Google did the Natural Language Processing (NLP) community a great ser-vice when they released the Google Web 1T corpus in 2006 [2]. The data setcontained “approximately 1 trillion word tokens of text from publicly accessibleWeb pages” [2]. That dataset was limited to the English language, but was laterexpanded to cover 10 languages in the Web 1T 5-g, and 10 European languagesversion in 2009 [3]. Both of these datasets enabled a plethora of NLP projectsthat would have not been achievable before, but then they stopped. There wasno V2 version of the datasets, possibly because there was no need for the major-ity of the projects either. In a way, Google Web 1T is a time capsule of whatthe English speaking Internet was in 2006.Moving forward to the present time, we are now beginning 2020. The worldand information has changed since 2006, and one great source that maintains thestate of the world, as written by its inhabitants, is Wikipedia.org. The EnglishWikipedia has 6 million articles with approximately 3.48 billion words in all ofits English Articles [4]. It is steadily growing at a rate of 20,000 new articles and1 GB of text data per year [4]. While some of this data is historical or ordinary, alot of information on recent events is being added. This makes Wikipedia a greatresource for learning about current events and current culture from the last 14cSpringer Nature Switzerland AG 2021K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 277–294, 2021.https://doi.org/10.1007/978-3-030-55187-2_23

278J. R. F. Cacho et al.years that Web 1T does not possess. An example relevant to NLP is the addedphrases and vocabulary that have been created during this time. Whether it is acompletely new words like ‘ransomware’, ‘yeet’, or ‘brexit’, an existing word thatnow has a new meaning like ‘lit’, or phrases like ‘big data’, the English languageis constantly evolving and Wikipedia captures this well.This brings us to the current project, which is to build a Wikipedian-gramcorpus. The goal is to complement existingn-gram corpora, like Web 1T, withour dataset that containsn-grams with modern language and information thatis, otherwise, not available in corpora created in the past. Furthermore, in thevein of reproducible research [5], we provide the Python source code to be ableto re-run our scripts, to generate newer versions of the datasets as Wikipediagrows. While this is not the first attempt at generating a Wikipedian-gramcorpus [6], our goal is to simplify the process, make it eﬃcient, maintain it, andupdate it as new data dumps from Wikipedia are offered.The organization of the paper is as follows. First we discuss Wikipedia indetail; then we cover more about Google Web 1T andn-gram corpora; then wedelve into the process of creating the dataset from the original XML data dumpsto the final product, while discussing multiple approaches, how they perform, andtheir pros and cons; and then, eventually, conclude by discussing future work,which is primarily focused on making the data set accessible to researchers, andon improving the process in terms of space and time complexity.2A Brief Discussion on WikipediaAs per the oﬃcial Wikipedia article about Wikipedia,“Wikipedia is a multilingual online encyclopedia created and maintainedas an open collaboration project by a community of volunteer editors usinga wiki-based editing system. It is the largest and most popular general ref-erence work on the World Wide Web, and is one of the most popular web-sites ranked by Alexa as of October 2019. It features exclusively free con-tent and no commercial ads, and is owned and supported by the WikimediaFoundation, a non-profit organization funded primarily through donations.Wikipedia was launched on January 15, 2001, by Jimmy Wales and LarrySanger. Sanger coined its name, as a portmanteau of ‘wiki’ (the Hawaiianword for ‘quick’) and ‘encyclopedia’. Initially an English-language ency-clopedia, versions of Wikipedia in other languages were quickly developed.With at least 5,980,302 articles, the English Wikipedia is the largest ofthe more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprisesmore than 40 million articles in 301 different languages, and by February2014, it had reached 18 billion page views and nearly 500 million uniquevisitors per month [7].”

Building a Wikipedia N-GRAM Corpus2793Applications of Google Web 1T and N-GRAM CorpusApplicationsThe Google Web 1T 5-g corpus is a data set “contributed by Google Inc., [that]contains English wordn-grams and their observed frequency counts. The lengthof then-grams ranges from unigrams (single words) to five-grams” [2]. As men-tioned earlier, there are 1 trillion word tokens, and the source of the text is frompublicly accessible web pages Google accessed to support its search engine. Thedata was published in 2006 [2], but it is possible that the data itself was webcrawled at an earlier, undisclosed date. As we mentioned in a previous paper,“What makes Google-1T great is that having it based on Web pages allows itto have a broader variety of subjects that enable it to understand more contextthan other corpus. The variety does come at a price asn-grams with fewer than40 occurrences are omitted [8]. This means that many correct 3-g that couldbe very specialized to a topic are missing. A possible solution to this would beto add subject-specific 3-g to the data set for the documents being corrected.There are 13,588,391 unigrams and 977,069,902 trigrams in the corpus containedin approximately 20.8 GB of uncompressed text files.” [9,10]. This research wasdone while utilizing the dataset to correct Optical Character Recognition (OCR)generated errors during the post-processing stage of OCR, by taking advantageof the idea of context-based corrections, where neighboring words can be used toidentify likely candidates to correct a word with an error. The concept is simpleand used every day. If one sees a misspelling, one can create a list of candidatewords with a close Levenshtein Edit distance [11] – as this represents words withthe least amount of changes required to transform the original word into thecandidate word – to correct the error, and then based on the context of thesentence, paragraph, and even document, one can choose the correct candidate.One of the ways we can quantify context is by seeing how frequentn-gram isin our Web 1T corpus. If ann-gram containing the candidate word has a highfrequency, it means it is a popular phrase, meaning it is more likely to be acorrect phrase rather than a non-existent one.Ultimately, our Wikipedia dataset is the solution to our problem with missingspecialized topics or low frequency occurrences, as Wikipedia covers a greatspectrum of topics that may not be covered in Web 1T, and our dataset doesnot remove low frequencyn-grams, allowing us to take advantage of these raren-gram occurrences.4Creating the DatasetHere we will explain the process of downloading the Wikipedia Encyclopedia,extracting the article text, creating 1 to 5-g, and then combining duplicates andgenerating frequencies for each.

280J. R. F. Cacho et al.4.1Wikipedia Database DumpsDownloading Wikipedia may sound fearsome, but is quite trivial. Wikipediaoffers a complete copy of the entire wikitext source and metadata embedded ina single large XML file. These snapshots are usually provided at least once amonth [12].“Wikipedia offers free copies of all available content to interested users. Thesedatabases can be used for mirroring, personal use, informal backups, offline useor database queries (such as for Wikipedia:Maintenance). All text content ismulti-licensed under the Creative Commons Attribution-ShareAlike 3.0 License(CC-BY-SA) and the GNU Free Documentation License (GFDL)” [13].For the initial version of our dataset, we downloaded the latest avail-able at the time from:https://dumps.wikimedia.org/enwiki/latest/, which wasenwiki-latest-pages-articles.xml.bz2dated02-Nov-2019 23:25with asize of 16.6 GB (16,589,476,447 bytes) compressed. After we extracted it, itexpanded to 74.4 GB (74,373,625,381 bytes). This version includes only the cur-rent revisions of each article, and excludes any talk or user pages [13].4.2Wikipedia ExtractorWikiExtrator is a tool that extracts and cleans text from a Wikipedia databasedump. It performs template expansion by preprocessing the whole dump andextracting template definitions, which are then processed quickly due to its abil-ity to use multiprocessing for dealing with articles in parallel [14].To run WikiExtractor, we feed it the original 74.4 GB XML dump filedownloaded from Wikipedia: enwiki-latest-pages-articles.xml. The extractorthen outputs files of approximately 1 MB in size, each containing several arti-cles/documents. These are organized by folders containing 100 files each (about100 MB total). A total of 134 folders are created with a total of 13,321 files. Thefull folder is approximately 13.9 GB, which matches Wikipedia’s claims on thearticle text size being about 16 GB [4]. The size discrepancy versus the originalXML file, Wikipedia’s claim of 16 GB, and the Extractor’s output, lies from theoutput being strictly the text in the article that is just plain text. No tables,infoboxes, charts, lists or other visuals are included. Some Wikipedia web pageshold only an image that is referred to in proper articles. These are all removed aswell. Citation content is also removed since each citation was inside the article.Any content at the bottom, such as links to related information, is removed sinceultimately we want text that we can extractn-grams from and the regular textis the best source of that. These text removals account for the last 2GB of textdata. Figure1shows historic size increase of wikipedia’s article text.For our specific database dump, WikiExtractor extracted 5,963,484 articlesand took 106.7 min, running on an HDD/SDD device with an i7-6700k and 32 mbof RAM. However, the VM that ran it only had access to 24 GB of RAM.

Building a Wikipedia N-GRAM Corpus281Fig. 1.Increase of the size of the English Wikipedia article text in gigabytes [15].WikiExtractor Output:INFO: 62223554Ekaterina KuzminaINFO: 62223557Go Astro Boy Go!INFO: 62223606V Gate, QueenslandINFO: 62223626Statue of Maurice J. TobinINFO: 62223631Geoffrey W. LewisINFO: 62223643Beilba, QueenslandINFO: 62223655Pony Hills, QueenslandINFO: 62223660Eumamurrin, QueenslandINFO: 62223661Rachel FairburnINFO: 622236621974 New Mexico Lobos football teamINFO: 62223678Durham Downs, QueenslandINFO: 62223683Deputy Minister of Domestic Trade andConsumer AffairsINFO: 62223685Hungarian Revolution MemorialINFO: 62223690Myosotis afropalustrisINFO: 622236921975 New Mexico Lobos football teamINFO: 62223702Mahendra Prakash Singh BhogtaINFO: 62223722Naval Base Commander, ClydeINFO: 62223723North Jackson High SchoolINFO: Finished 3-process extraction of 5963484 articlesin 6402.6s (931.4 art/s)INFO: total of page: 10469833, total of article page: 5963484;total of used article page: 59634844.3Sanitizing Data FurtherOnce we have the initial clean data separated into 1MB files (folder name text)we proceed to do additional cleaning in order to prepare to generate ourn-grams.As mentioned in previous research, “There is also the issue of phrases involv-ing special words where ideally rather than representing something like ‘born

282J. R. F. Cacho et al.in 1993’ we could represent it as ‘born in YEAR’ as other- wise correcting textwith a 3-g such as ‘born in 1034’ where born has an error could prove hardas the frequency of that specific 3-g with that year may be low or non- exis-tent. A similar idea was attempted at the University of Ottawa with promisingresults [16]” [10]. It is useful to combine special terms to increase their otherwiserare frequency. We have applied this concept by creating two versions of ourcorpus: one with numbers left intact; and another with all exclusively numerictokens converted into the keyword NUM, and all alphanumeric tokens such as‘A25’ converted into ANUM. As we will see later, the NUM token became thefifth most common token in our unigrams with over 63 million instances, andANUM the thirteenth most common with a unigram frequency of 17 million.In addition to this, we also lowercased all text in order to not create separatetokens for the same word with different casing. We then removed several char-acters that were non-ASCII, such as math symbols and other foreign languagealphabets that appeared in the English Wikipedia, i.e., if an article included thename of a foreigner written in their native language along with the English equiv-alent. We also removed punctuation like quotes, parenthesis, brackets, commas,periods, colons, and other non-alphabetical characters appearing in the ASCIItable. For most of these, we removed them entirely, but for some, we replacedthem with a space, which essentially split a token into two. The clearest case ofthis is hyphenated and underlined words, like ‘California-Arizona’, where thattoken is split into the tokens ‘California’ and ‘Arizona’. This concept is popularwhen recognizing acronyms and works very well [17]; however in some instances,it may also decrease the importance of the hyphen or may cause problems whenan OCR word has a hyphen in it, as it may be removed entirely by some software,while in others is critical to correcting it [18].Finally, certain symbols like ‘+’, ‘=’, ‘%’, are replaced with the English mean-ings of the symbols, such as ‘+’ being converted into ‘PLUS’ (uppercase). Thiswill allow then-gram corpus to strictly be made of alphabetical characters, oralphanumeric characters in the version with numbers.As one can see, the goal with sanitizing our data in this step is to try andcombine tokens that are similar to more accurately reﬂect their frequency in thecorpus. When this step is complete, we now have a folder named cleantext thatcontains the text ready to be parsed.This step took 2:41:07 to process 2,184,219,778 words and was done in wik-isanitize.py.4.4Generating N-GRAMSGeneratingn-grams in this step is quite trivial. It is merely a sliding screen dooreffect on a text to read one word at a time, then combining duplicate entriesto generate a frequency. The challenge here comes in doing this in an eﬃcientway, which can fit in main memory in order to avoid a massive slow down. Theinitial attempt was to generate a MySQL database that contained five tables– 1 gms, 2 mgs, 3 gms, 4 gms, and 5 gms – with the foreign keys to the 1 gmstable in order to ensure that all words found in any ngram were contained in the

Building a Wikipedia N-GRAM Corpus283unigram table. Each word, or token, was its own VARCHAR. In addition, eachtable had a column for the specificn-gram’s frequency. After trying to insert thecleaned text into this SQL schema, we found that it took approximately 24 hto insert 100 MB of data. This was a no-go, and could be due to many reasonssuch as indexing occurring as each ngram is inserted. While we tried eﬃciencyedits like batch inserting (to reduce I/O time), we found they made very littledifference. Next we tried a bare-bones approach of reading in the data files oneat a time and generating a list onn-grams in Python, unsorted, with duplicatesnot combined. We then wrote this list to a file.Reading from Folder: AA to Folder: FD. The last file is ’cleanwiki_21’Folder Path is ’cleantext/’ So for example, opening as: ’cleantext/AA’Folders to Process: 134New Doc: id="12"New Doc: id="25"Words Processed: 10000, Words Inserted: 9991New Doc: id="39"...New Doc: id="62223702"New Doc: id="62223722"New Doc: id="62223723"Total words processed: 2201083528Execution Time:3:20:03Finished wikigengms.pyNext we take that file and load it again, and then sort it using Python’sdefault list sorting algorithm, an implementation of Timsort [19], on a machinewith 500 GB of RAM. Timsort is a hybrid stable sorting algorithm derived frommerge and insertion sort [20]. After we sort it, as we write to the output files, wecombine duplicate entries to generate frequency. Since the list is already sorted,we can easily count the frequencies as we work down the list. Five files aregenerated (1 gms to 5 gms) with our final output. The following is a part of thelogs when generating each of the final gms files. Each one runs independentlyof the other and were done sequentially. However, this process could easily beparallelized to run all 5 at the same time.

284J. R. F. Cacho et al.Total ngrams processed: 2,160,185,080 (for each of the 5 files)1gms.txt:Size: 70MBFinal ngrams in outputfile: 6,065,712Duplicates: 2,154,119,368Execution Time: 0:41:502gms.txt:Size: 2.5GB:Final ngrams in outputfile: 154,611,147Duplicates: 2,005,560,611Execution Time: 1:13:573gms.txt:Size: 13GBFinal ngrams in outputfile: 626,535,837Duplicates: 1,533,622,599Execution Time: 1:46:114gms.txt:Size: 31GBFinal ngrams in outputfile: 1,197,440,896Duplicates: 962,704,218Execution Time: 1:56:515gms.txt:Size: 49GBFinal ngrams in outputfile: 1,598,342,983Duplicates: 561,788,809Execution Time: 2:21:51Total: 95GB5Analyzing the DatasetThe entire process for the given Wikipedia XML file took approximately 33 h.Of this, 106.7 min were spent in the WikiExtractor and 31 h, 17 min for the rest.This is because the process is sequential for everything but the WikiExtractorpart. This does not include decompressing the original XML. As we can see,the total size of our 1–5 is 95 GB. The unigrams are only 70 MB since there areonly 6,065,712 unique unigrams. The following is a list of the 30 most frequentunigrams.

Building a Wikipedia N-GRAM Corpus285the 157160236of 73885711in 64204312and 63508387NUM 63148510a 46299864to 45055271was 27543755is 21427402for 18611616on 18429944as 18043296ANUM 17008403with 15898792by 15718598he 14020879that 12617141at 12275031from 11685231his 10994064it 10357717an 8610000were 7269770which 6629340are 6607169this 5697236also 5608869be 5439755had 5282118first 5169967In addition to our previous observation on NUM and ANUM, we can seethat the remainder of the list is made of stop words, but even so, we can seehow large of a drop off the first two words are, with ‘the’ having just over 157million tokens, and ‘of’ having less than half, with just under 74 million tokens.In addition here are the three most frequentn-grams for 2–4,of the 22264226in the 16136295in NUM 12552882in NUM the 1511740in NUM and 1255827one of the 1137647from NUM to NUM 891927a member of the 398556between NUM and NUM 371341

286J. R. F. Cacho et al.We were able to achieve these frequencies by sorting each ngram file byfrequency to see the most commonn-grams. Notice we do not have a sorted5 gms.txt. That is because we sorted the files using the sort function but hadto give it a custom key function to sort by frequency. The added complexitywas too much for the 500 GB RAM to manage, and we ran out of memory. Wewill discuss more about solving this issue later. The sorting of the 1–4 took 54 h(194,894 s). Most of this was in the 2–4 since the 1 gms.txt at 70 MB took only15 s to sort. Note that this time is strictly for taking the final output files frombefore, sorting them by frequency, and writing them to output files.We also created a smaller dataset where we removed alln-grams with afrequency of 3 or less. For this we took the sanitized text from before and onlyran the sorting algorithm, and as we wrote the output to the file combiningduplicates – as we did before – we excluded writing duplicates with a frequencyof 3 or less to the output file. This run took 34 h, 52 min, and 22 s.The final files were 23 MB, 392 MB, 1,009 MB, 1.2 GB, and 905 MB for 1–5 respectively. It is interesting that there were more 4 gms than 5 gms in thisversion. That is mostly due to 5 gms being less common than 4 gms, so manyof the gms in our original experiment are occurrences with a frequency of lessthan 4.6Alternative Approaches to Improve Retrieval or GmsGenerationOur most immediate goal is to apply map reduce to the process in order to beable to use Cloudera and Hadoop to parallelize the process and finish in a muchfaster fashion. This way we can continuously generate then-grams for the latestXML dumps. In addition to that, we intend to still create an SQL databaseto make the data easily query-able via a web front end. Another interest is inmaking retrieval and queries faster by using data structures like tries or b-tressto make then-grams faster to search.6.1The Experimental IdeaIn this section, we explore a method for analyzing individual words and forexploring the relationships and connections between words. These relationshipsinvolved creatingn-grams, allowing us to see which words tend to appear afterothers or words that appear near each other. This process consists of build-ing individual words rather than by pairs of adjacent ones. We also implementtwo different search tree data structures to search on the basis of comparisonsbetween the search word (the first word in then-gram), starting from the root ofthe current subtree, and the distribution of other words adjacent to the searchedword. Currently, our approach storesn-grams of data (initially forn={1,2,3}),with the ability to change which type ofn-grams to use, and with the ability toquery the stored data as quickly as possible. It should be noted that the require-ments of this project will change in the future to include othern-gram lengths,without having to change the code much (or at all).

Building a Wikipedia N-GRAM Corpus2876.2TrieA trie [21–23] is tree-like data structure based on key space decomposition, wheredata is only stored in leaf nodes. A trie node stores a character that correspondsto a certain string given by the path traced from the root to the node, and(typically) an array of pointers one for each symbol of the English alphabet;therefore, the nodes of the trie have a maximum fan-out of 26. Words withthe same stem (prefix) share the memory field corresponding to the stem. Forexample, consider the sentence “Car House Home”. If we wanted to put thesethree words into a trie, it would look like Fig.2.a-c-h-za----za--r-za--o-za-m--za-e--za--u-za--s-za-e--zFig. 2.An example of trie tree using arrays as transition tables.Only the root node shows its 26 subtrees explicitly, with each subtree associ-ated with a symbol in the alphabet. The traversal starts at the root node. Then,one by one, a character is taken from the given string to evaluate the next treenode to go to. Naturally, the node with the same character is chosen to walk.Note that every step of such walking consumes one character from the givenstring and goes down the tree step by step. If a leave node is reached, then wearrive at the end for that string. On the other hand, if we get stuck at somenode, whether because there is no path marked with the current character weare inspecting, or because the search stopped at an internal node, then the triedoes not recognize that string.Our implementation, however, is somewhat different than how a trie isdescribed. We used an adaptive approach (similar to the data structuresdescribed in [24–27]) to optimize the space overheard, and redesign the trieto use keys that are different types of arrays. In other words, instead of using anarray of pointers to every possible symbol of the next character in the string, weuse a hash array to determine the number of children a node needs to store, asopposed to how many it can possibly store, thereby reducing the memory cost

288J. R. F. Cacho et al.Array List[ 0 ][ 1 ][ 2 ][car : 1][home : 1][house : 1]housemecarTypeTermFrequency[ 0 ]unigramcar1[ 1 ]bigramcarhome1[ 2 ]trigramcarhomehouse1Fig. 3.An instance of a trie tree with its internal nodes, list of red-black trees, andoutgoing links ton-grams.created by the standard tri tree data structure, as the number of children in anode may change.The standard trie usesO(1) direct lookup at decision nodes with an array oftransitions to other nodes indexed by symbols. Unfortunately, the space require-ment is substantial, as memory needs to be allocated for each character in thesearch word. We use an array hash to store pieces of data that have a key (asequence of characters used to identify a single word) and a possible value (ared-black based tree containing the information of the individual words thatfollow). In both cases, the complexity of creating a trie can be expected to beO(M×L), whereMis the number of words in the tree, andLis the averagelength of a word. Similarly, searching for words requires us to performLstepsfor each of theMwords. Unlike the standard trie, the main difference here lieswithin the context of our leaf nodes. Figure3demonstrates the representationof a trie with an adaptive approach.

Building a Wikipedia N-GRAM Corpus289We group words into various types ofn-grams of lengthkusing a red-blacktree data structure at each leaf node. This makes the running time of basicoperations, such as inserting and searching,O(log2(n)). We can, therefore, easilyexamine the most frequent words, and the words that come before or after,because their order depends on the height of the tree. This further enables us toknow how often a sequence ofnconsecutive characters occurs in a sentence froma given word. Furthermore, rather than only analyzing the top few at a time,we can simultaneously investigate all of the relationships between words. Thisis because words are arranged into a network or graph. We are going to refer toagraphhere not as a diagram, but as a combination of linked nodes. Figure4shows an example of the resulting graph after inserting the words “Car HouseHome”.carhomehouseFig. 4.This graph illustrate the relationships and connections between words.To make our current design more useful, we account for nodes joining andleaving the network. The first step in the join code is to look up the predecessor(if one is present) of the new node using the normal lookup method of thestandard trie. The new node is then inserted after its predecessor. To ensurethat all lookups work without fail, the appropriate portion of the key (which wereferred to asedge information) is copied to the new node’s predecessor beforethe new node changes its next node reference to point to another new node.The leave code is similar to the join, except that we do the join process in theopposite direction.6.3StructureThe nodes in the trie can be viewed as siblings at a level as being connected ina list. The trie in Fig.5can be structurally interpreted as:Each leaf node is a container consisting of a consecutive sequence of words,calledn-grams. The leaves are themselves a type of red-black hash tree, whosenodes keep references to words sorted according to their natural ordering. Thesewords come immediately after the ending of a word in a sentence, which isrepresented by a leaf node in the trie. Figure6shows the trie with the edgeinformation that corresponds to the words that follow a leaf node. A leaf node,therefore, has the following additional information associated with it: a standard

290J. R. F. Cacho et al.housemecatrFig. 5.A trie viewed as a left-child right-sibling binary tree, where vertical dotted linesrepresent sibling links.househomecarcarhomehouseFig. 6.The edge information of a leaf node in the trie.array list (an array of transitions from every node) and references to branchnodes.Since we store the edge information with branch nodes, it is important toconsider this information as being associated with the edge coming from itsparent into the branch node. The reason is that branch nodes are never roots,thus, we know what edges to follow, and when edges are followed while movingdown the trie. For example, in Fig.3, when moving from noderto nodee(theclosestetor), we use a digitk, wherekrepresents a contiguous sequence ofkwords from a given sentence, to determine which child of the subtreehis to beused. After obtaining the child in question, we then look for the next word inthat child, which will give us another child, and so on. While this may seem tobe a less eﬃcient lookup algorithm, it is not, because a smallk-value means thatthe constant factor will eventually dominate.6.4ConstructionTo insert a word into the trie, the branching position can be found by traversingthe trie with the string, one by one character. The node where there is no branch

Building a Wikipedia N-GRAM Corpus291housemecatrFig. 7.The resulting subtree after inserting the wordcat.to go to becomes the place to insert a new leaf node. For example, if we wantedto insert the word ‘cat’, we would insert a branch node between nodesaandr.Figure7shows the resulting subtree.Conversely, creating ann-gram of lengthkrequires using the array list asa transition table, in which the rows correspond to the nodes, and the columnscorrespond to the transition words. For a transition to a previous node, forexample, thek-value is used as the starting index from the newly-created leafnodet(cat) to the previous nodee(house). In order to go frometot, we usean out-of-bounds separator to combinek-1n-grams into a single string (e.g.,“car|home|house” – which are all the previous children in the trie). Using thisstring, we can determine the edge information belonging to the correct red-blacktree, which we can then use to update the frequency of then-gram. Figure8shows the transition from nodesttoe.Wordscarhomehousecatn1n1n1n1n2n2n2n3n3n4Fig. 8.Each word corresponds to a node within the tree structure, and each outgoing(directed) edge represents a transition.

292J. R. F. Cacho et al.When searching for a word, or a sequence of words, we use the edge informa-tion to terminate unsuccessful searches (potentially) before reaching a leaf nodeor fall of the trie. As with the standard trie, the search is done by following apath from the root. Suppose we want to search for the bigram “Car House”.Since the skip value for the root node is always the empty string, we use the firstletter (e.g. the letter ‘c’) in the string to determine which subtree to move to.When all of the characters in the string are used and a leaf node is reached, wecome to the first word (“car”) of the bigram. By examining the information ofthe outgoing edges stored in the leaf noder, we determine that, when makingthe move from nodecto the node representing the first word, the search termi-nates if the trie does not contain any red-black tree whose key is equal to theedge information “car|house”.6.5PerformanceIn this section we experimentally compare 1-g, 2-g, and 3-g. Our primary focusis to see if we achieve faster results, or results that require less memory, runningsequentially. (We were not able to run anything higher than 3-g, as it would con-sume more than 500 GB of memory.) This involves feeding the sanitized versionof the text cleantext to the trie instead of the ngram generation, sort, combine,and output method.1gms:Runtime: 1,545 secondsMemory Usage 6,644MB1gms and 2gms:Runtime: 5,341 secondsMemory Usage: 32,273MB1gms, 2gms, and 3gms:Runtime: 21,731 secondsMemory Usage: 166,584MBOne of the first points to make is that this approach is far more expensive interms of memory than the working approach of sortingn-grams. However, theperformance of the 3-g may be a sign that this approach could be faster thanthe sorting method. This cannot be confirmed until we have enough memoryto test it. We expect that the Hadoop implementation will be far faster due toparallelizing either way. It is worth mentioning that the ability to have this triebe used for retrieval is the true goal of our adaptive approach in order to quicklyquery then-grams for a frequency, or in the case of looking for predecessors orsuccessors for NLP.7Conclusion and Future WorkData is the clay from which a sculpture is made. Without the sculpting it’s notbeautiful or useful. In our case, we have taken Wikipedia’s XML dump files,

Building a Wikipedia N-GRAM Corpus293cleaned them up, and generated a 1 to 5-g corpus with the hopes of supple-menting largern-gram corpora like Google Web 1T corpus for NLP uses. Wedescribed our implementation and the final dataset. We have hopes of makingthis dataset accessible for anyone who wishes to query it, but we are also makingit available in its entirety. Future work remains in this project, with the goals ofimproving eﬃciency in terms of space and time to create the dataset. It is notenough to have made the dataset once, as the key of its usefulness lies in its rel-evance, and that can only happen if its constantly regenerated with newer XMLdumps. Because of this, we will look into implementing a parallelized versionof our implementation to speed it up using Hadoop’s map reduce. This lendsitself ideally to combining then-grams and sorting them by either frequencyor alphabetically. Due to our ongoing goal to produce and encourage accessiblereproducible research [5,28], our implementation source code, along with ourongoing experiments and results, are available in multiple repositories includingDocker, Zenodo, & Git. (Search:unlvcs).References1. The Economist: The world’s most valuable resource is no longer oil, but data, TheEconomist: New York, NY, USA (2017)2. Brants, T., Franz, A.: Web 1T 5-gram version 1 (2006)3. Brants, T., Franz, A.: Web 1T 5-gram, 10 European languages version 1. LinguisticData Consortium (2009)4. Wikipedia Contributors: Size of Wikipedia, the free encyclopedia (2019).https://en.wikipedia.org/wiki/Wikipedia:SizeofWikipedia. Accessed 12 Dec 20195. Fonseca Cacho, J.R., Taghva, K.: Reproducible research in document analysis andrecognition. In: Information Technology-New Generations, pp. 389–395. Springer(2018)6. Artiles, J., Sekine, S.: Tagged and cleaned Wikipedia (Tc Wikipedia) and itsNgram.https://nlp.cs.nyu.edu/wikipedia-data/. Accessed 12 Dec 20197. Wikipedia Contributors: Wikipedia, the free encyclopedia (2019).https://en.wikipedia.org/wiki/Wikipedia. Accessed 12 Dec 20198. Evert, S.: Google Web 1T 5-grams made easy (but not for the computer). In:Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, pp. 32–40.Association for Computational Linguistics (2010)9. Fonseca Cacho, J.R.: Improving OCR post processing with machine learning tools.Ph.D. dissertation, University of Nevada, Las Vegas (2019)10. Fonseca Cacho, J.R., Taghva, K., Alvarez, D.: Using the Google Web 1T 5-gramcorpus for OCR error correction. In: 16th International Conference on InformationTechnology-New Generations (ITNG 2019), pp. 505–511. Springer (2019)11. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, andreversals. In: Soviet physics doklady, vol. 10, no. 8, pp. 707–710 (1966)12. Wikipedia Contributors: Wikimedia downloads Wikipedia, the free encyclopedia(2019).https://dumps.wikimedia.org/backup-index.html. Accessed 12 Dec 201913. Wikipedia Contributors: Database download Wikipedia, the free encyclopedia(2019).https://en.wikipedia.org/wiki/Wikipedia:Databasedownload. Accessed 12Dec 2019

294J. R. F. Cacho et al.14. Attardi, G., Fuschetto, A.: Wikiextractor 2.75 [software], 4 March 2017 (2012).http://attardi.github.io/wikiextractor/. Accessed 12 Dec 201915. H¨aggstr¨om, M.: File: Wikipedia article size in gigabytes.png Wikipedia, the freeencyclopedia (2019).https://en.wikipedia.org/wiki/File:Wikipediaarticlesizeingigabytes.png. Accessed 12 Dec 201916. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web IT 3-grams.In: Proceedings of the 2009 Conference on Empirical Methods in Natural LanguageProcessing: Volume 3-Volume 3, pp. 1241–1249. Association for ComputationalLinguistics (2009)17. Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Int. J. Doc.Anal. Recogn.1(4), 191–198 (1999)18. Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system forocr errors in text. Int. J. Doc. Anal. Recogn.3(3), 125–137 (2001)19. Peters, T.: Timsort description (2015)20. Auger, N., Nicaud, C., Pivoteau, C.: Merge strategies: from merge sort to timsort(2015)21. De La Briandais, R.: File searching using variable length keys. Papers presented atthe the March 3–5, 1959: Western Joint Computer Conference, pp. 295–298. ACM(1959)22. Brass, P.: Advanced Data Structures, vol. 193. Cambridge University Press, Cam-bridge (2008)23. Kunth, D.E.: The Art of Computer Programming: Vol. 3, Sorting and Searching,2nd printing (1975)24. Ferr´andez, A., Peral, J.: MergedTrie: eﬃcient textual indexing. PLOS One14(4),1–19 (2019).https://doi.org/10.1371/journal.pone.021528825. Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, eﬃcient data structure forstring keys. ACM Trans. Inf. Syst. (TOIS)20(2), 192–223 (2002)26. Askitis, N., Zobel, J.: Redesigning the string hash table, burst trie, and bst toexploit cache. J. Exp. Algorithmics (JEA)15, 1–7 (2010)27. Bagwell, P.: Ideal hash trees. Technical report (2001)28. Fonseca Cacho, J.R., Taghva, K.: The state of reproducible research in computerscience (to appear)