Traditional introductory statistics courses tend to focus on analyzing data in a numeric format or in frequencies of categorical data (categorical information presented in numeric form). However, there is endless information stored and communicated through text. Humans are constantly bombarded with textual information in the forms of news articles, e-mails and social media, among others. In most cases there is far more textual information available on any given subject than can be manually processed by a human. As such, employing the aid of computers to assist in sifting through text is a valuable skill. Given the notions of intention and subtext, it may at first seem odd to employ a computer to interepret text. However, there are techniques developed to address these concepts and they will be be illustrated below.
Regarding this exercise specifically, it is meant to offer a framework for introducing students to text mining techniques. To illustrate these concepts, we will select some literary piece and walk through the process of extracting the text from an online source, decomposing the text into characters, words, paragraphs, etc. and analyze the resulting data set for attributes of interest. The hope is that upon completing a similar exercise, a student will be equipped with the tools to be able to extract, process and analyze online text based information for their own purposes.
As the motivation of this project is to dive into the world of text mining, it seems natural to choose a book or other literary work that is publicly accessible electronically. In order to help guide the analysis portion, a piece that has some noteworthy feature is an optimal choice. The novel Gadsby by Ernest Vincent Wright is famous for being “A Story of 50,000 Words Without Using the Letter "E"”. The lack of the letter “E” offers an interesting quirk that can be proven using text mining techniques. In the process of testing this fact, a student will be required to separate the data set into words, characters or other groupings. Having already processed the data into a easily manipulated from offers a good starting point for further analysis which is only limited by the student’s imagination.
Gadsby is available for public use on Wikisource at https://en.wikisource.org/wiki/Gadsby. We can analyze some of the features of the novel such as word/character concentration and other categorical attributes along with the content of the novel. For these purposes, we will need to obtain data sets consisting of all the words and characters with the ability to sort. We can use the tidytext library to organize the text in the tidy text format. Tidy text tables are tables organized with one token (usually words) per row. Ideally, we want to construct a tidy dataset to facilitate easier manipulation as we will then be able to perform tasks such as sorting and searching.
There are numerous packages and resources in R that can be used to extract information from webpages. Before selecting an approach, we should examine the source to search for features in the structure that can be used to automate extraction. Opening up the Wikisource page and clicking through the chapters, we can see that the book is organized into 43 webpages, 1 per chapter with URL “https://en.wikisource.org/wiki/Gadsby/Chapter_X” where “X” is the chapter numbers. Given that the URL is structured very cleanly, we can iterate through each chapter when extracting. However, the webpage has provided little insight on how to perform the extraction within our for loop. As can be seen in Figure 1 below, the webpage is very clean looking but we need to examine the source code to identify how the data is stored and search for any patters than can be used to extract and clean the data.
Sample Wikisource Chapter Layout (Chapter 1)
Figure 2 below shows a small piece of the HTML source containing the text. We can see the first few sentences present. However, we can see that the text is not stored cleanly. There are links embedded in the paragraphs as hyperlinks on specific words as well as text formatting tags. These are noteworthy features that will complicate the extraction and cleaning process as we must be able to guarantee that we do not catch these unwanted features.