Bibliography

Biblio Citation Abstract
Ball, C.. N. (1994).  Automated Text Analysis: Cautionary Tales. Literary and Linguistic Computing. 9, 295–302.

Ball begins this paper by arguing that it is combination of online textual objects and computational analysis tool that provide scholars with the ability to analyze vast amounts of data in a short time. However, Ball asserts that this ease of access does not mean that the results are always accurate or reliable. The balance of Ball's paper delves into some of the key issues with text analysis research: corpus design, recall, limitations of individual tools, and hidden variable. Despite focusing on the pitfalls of this research, Ball concludes by making clear that he is not discouraging this type of research but rather promoting awareness in order to better equip scholars.

Blei, D. M. (2013).  Topic Modeling and Digital Humanities. Journal of Digital Humanities. 2,

Blei's begins his article with a savvy and concise definition of topic modelling: "a suite of algorithms to discover hidden thematic structure in large collections of texts. The results of topic modeling algorithms can be used to summarize, visualize, explore, and theorize about a corpus." Topic modelling, unlike other methods of data mining, does not require that the researcher identify the topics or categories but rather the algorithm uncovers this structure. Topic modelling is one mode of a larger field called "probabilistic modelling." Blei ends by producing a potential humanities research scenario where using topic modelling would be a useful practice.

Brett, M. R. (2012).  Topic Modeling: A Basic Introduction. Journal of Digital Humanities. 2,

Brett's article purposes to introduce and exemplify topic modelling tools. Categorized as a form of text mining, Brett points to topic modelling as a way of sourcing out patterns in a corpus. In order to describe how topic modelling works, Brett illustrates using this analogy: "imagine working through an article with a set of highlighters. As you read through the article, you use a different color for the key words of themes within the paper as you come across them. When you were done, you could copy out the words as grouped by the color you assigned them. That list of words is a topic, and each color represents a different topic." Brett lists the "ingredients" necessary to successfully use topic modelling: a large corpus, familiarity with that corpus, a tool designed for topic modelling, and the knowledge to understand your results. While not necessarily useful as evidence, Brett argues that topic modelling is a great discovery tool.

Meeks, E. (Submitted).  More Networks in the Humanities or Did books have DNA?. Digital Humanities Specialist.

Elijah Meeks begins by breaking digital humanities research into three "pillars": text analysis, spatial analysis, and network analysis. Honing in on network analysis, Meeks defines network analysis as "useful for the modeling and analysis of relationships between a wide variety of objects." The most common network layout - force-directed - relies on three factors: the size of the node (which exerts repulsion), the strength of the connection (which draws nodes together), and gravity (which allows these factors to play against each other). The problem with force-directed graphs, Meeks points out, is the random and changing positions of nodes. Meeks argues that devising a more accurate conception of gravity will help to stabilize these visualizations.

Van Hulle, D. (2004).  Textual Awareness: A Genetic Study of Late Manuscripts by Joyce, Proust, and Mann.

Hulle begins the introduction to this book by differentiating between writing as product and writing as process. The approach of genetic criticism emphasizes writing as a process that strives towards invention. Hulle argues that genetic criticism brings forth a way to enhance the author's awareness of the writing process rather than focusing on the "artificial results." By privileging the material evidence for the creative process, genetic criticism is linked to scholarly editing in the sense that both are concerned with representing texts. In this book, Hulle defines and applies genetic criticism to three modernist writers (Joyce, Proust, and Mann) in order to clarify and exemplify the applicability of the theory.

Zillig, B. L. Pytlik (2009).  TEI Analytics: converting documents into a TEI format for cross-collection text analysis. Literary and Linguistic Computing. 24, 187–192.

In theory, Zillig and Pytlik argue, TEI-XML documents can be combined into a large corpus and search across because they are written in the same mark-up language. However, in practice, most TEI-XML mark-up schemes are highly customized and are, therefore, interoperable. In order to remedy this incompatibility and to expand the usefulness of TEI, Zillig and Pytlik introduce MONK: an application with the goal amalgamating digital projects into a single, searchable interface. MONK has developed a common TEI format - TEI Analytics - that is P5 compliant and helps to facilitate interoperability be relying the the project's common denominators rather than their individual customizations. As Zillig and Pytlik remark in closing, TEI-A is all about "making patterns already present across large numbers of textual objects noticeable."

Muralidharan, A., & Hearst M. A. (2012).  Supporting exploratory text analysis in literature study. Literary and Linguistic Computing. fqs044.

In this article, Aditi Muralidharan and Marti Hearst discuss the text analysis tool WordSeer. Muralidharan and Hearst claim that WordSeer fills a specific gap in text analysis tools because WordSeer was designed with the specifics of literary research questions in mind. This stands in opposition to the appropriation of general text analysis software to answer literary questions. Muralidharan and Hearst conceptualize WordSeer as a tool that encourages scholarly queries, provides tools for both distant and close reading, and works within the exploratory process of reading, interpreting, and understanding. In order to demonstrate this, Muralidharan and Hearst present a case study using Shakespeare’s corpus and the research question “what are some things that are his and some things that are hers?”

Yu, B. (2008).  An evaluation of text classification methods for literary study. Literary and Linguistic Computing. 23, 327–343.

In this article, Bei Yu tests the accuracy and consistency of using computational algorithms to classify literary texts based on subject. Yu argues that even though algorithms have been critically examined in the past the test data sets are almost always non-literary prose. Yu’s experiments set forth a new challenge in using poetry and fictional monographs to examine the usefulness of Naïve Bayes and SVM algorithms. The experiment looks at the classification of texts are erotic/non-erotic and sentimental/non-sentimental. Feature reduction, stopwords, and stemming are all taken into consideration across the study’s experiments.

Bradley, J. (2003).  Finding a Middle Ground between ‘Determinism’ and ‘Aesthetic Indeterminacy’: a Model for Text Analysis Tools. Literary and Linguistic Computing. 18, 185–207.

In this article, Bradley discusses the balance between the computer’s ability to solve complex, formal tasks and the humanities bias towards indeterminacy. Bradley argues that researchers in the humanities use computers for one of two distinct types of tasks: manipulating data in order to facilitate research or performing complex textual analysis. Bradley demonstrates how analog research techniques can be replicated and rejuvenated in digital practices. In conclusion, Bradley argues that the questions that remain are not whether there are digital environments able to facilitate traditional research practices but (a) whether scholars are willing to use them and (b) whether they are successful.

Green, H. E. (2013).  Under the Workbench: An analysis of the use and preservation of MONK text mining research software. Literary and Linguistic Computing. fqt014.

In this article, Green explores how scholars are using text-mining software by conducting a detailed study of user reports from the tool MONK (Metadata Opens New Knowledge). MONK is a NEH-funded, online humanities research tool that facilitates user-directed data mining. Green reviewed web analytics from 18 months of user interaction and conducted five interviews with researchers that used the tool in order to create a holistic understanding of how the tool was being deployed in practice. Green was interested in interrogating workflow, page direction, and user location. In general, Green discovered that MONK was being used as a gateway into more substantive research.

Savoy, J. (2013).  Authorship attribution based on a probabilistic topic model. Information Processing & Management. 49, 341–354.

In this article, Jacques Savoy tested the efficiency and reliability of using LDA (Latent Dirichlet Analysis) as a means of author attribution. Savoy opens the article by discussing the basic components of author attributions: a frequent word data set and a distance measurement. Noting that there is a limited number of test corpora for authorship attribution, Savoy compiles and describes the two test corpora used in his study: selections from English language publication the Glasgow Herald and selections from Italian language publication La Stompa. Savoy conducts authorship attribution tests on these data sets using delta, chi-sqaure, Kullback-Leibler, and Naive Bayes calculations as well as the LDA. While LDA is generally used to categorize texts into topics, Savoy concludes that it can be useful for authorship attribution.

Pang, B., & Lee L. (2008).  Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval. 2, 1–135.

In this article, Pang and Lee address how the growing availability of opinion-rich resources has opened up new opportunities for information-gathering that reveal what people think. Personal blogs and review sites have motivated a "sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text." In their survey, Pang and Lee's techniques and approaches are oriented towards information-seeking systems. With a focus on method, Pang and Lee address the challenges raised by this type of research. In conclusion, Pang and Lee suggest topics for future work and discuss available resources.

Roberts, C. W. (2000).  A Conceptual Framework for Quantitative Text Analysis. Quality and Quantity. 34, 259–274.

In this article, Roberts provides a history and critical exploration of quantitative text analysis methods. Roberts seeks to "provide long-needed structure on a wide spectrum of text analysis methodologies" - something that he sees as lacking or being misrepresented in current scholarship. The objective of Roberts' article is to aid researchers in selecting the appropriate text analysis method for their project. Roberts discusses such examples as contingency analysis, instrumental analysis, thematic analysis, semantic analysis, and network analysis. As way of conclusion, Roberts cautions against common scholarly pitfalls in conducting proper quantitative text analysis, such as the importance of context and understanding how to interpret the data.

Rockwell, G. (2003).  What is Text Analysis, Really?. Literary and Linguistic Computing. 18, 209–219.

In this article, Rockwell explores the role of tools in text analysis. Rockwell asserts that our current conception of what a text is and how to explore it is often incongruent with the types of tools designed for analysis. Because of this, computational tools have not impacted the literary community. Rockwell argues that, like concordancing tools of decades passed, computational tools should facilitate an interaction with the text that spurs new questions about the object. Rockwell argues that, ideally, tools will be built on the principle that good research is disciplined play. Rockwell discusses TAPoR as an example of an environment that encourages exploration. As a playpen or laboratory, TAPoR brings together groups to discover texts.

Sinclair, S. (2003).  Computer‐Assisted Reading: Reconceiving Text Analysis. Literary and Linguistic Computing. 18, 175–184.

In this article, Sinclair discuss the - still - fraught relationship between the humanities and computing. Sinclair argues that scholars still see the sciences and humanities as opposing paradigms and that in order to rectify this relationship we must begin to see the computer as a resource for all types of scholarship. Sinclair encourages researchers to see computers as facilitating reading, exploration, and play when it comes to text analysis. He concedes that many of the current text analysis tools do little to convince skeptics that "good humanities research can be done with a computer." Sinclair introduces HyperPo as a model computational tool and as a case study to demonstrate how a tool can facilitate meaningful textual criticism. To conclude, Sinclair encourage scholars to explore texts with a bias towards serendipity - especially in using computational techniques.

Goldstone, A. (2012).  What Can Topic Models of {PMLA} Teach Us About the History of Literary Scholarship?. Journal of Digital Humanities. 2,

In this article, Ted Underwood and Andrew Goldstone discuss the methodology and results of conducting topic modelling on the PMLA journal. Underwood, Goldstone, and three other colleagues set out on this literary experiment in hopes of gaining a fuller understanding of the evolution of the discipline across time. In their explanation of the various topic patterns presented in the PMLA corpus, Underwood and Goldstone suggest that there is merit in visualizing topic models as networks. What a network allows is for topics and vocabulary to cross between each other - creating an interconnected graphic. Finally, Underwood and Goldstone purpose that topic models - while generally used to reveal what is being written about - can be used to detail how something is being written about: "topic modeling can identify discourses as well as subject categories and embedded languages."

Templeton, C. (2011).  Topic Modeling in the Humanities: An Overview. Maryland Institute for Technology in the Humanities.

In this blog post, Clay Templeton discusses topic modelling as a tool of "hermeneutic empowerment" in as far as the technique allows researchers to draw out the hidden structure of a corpus. Templeton's piece maps out the "genealogy of topic modelling in the humanities" by discussing key features and research initiatives related to this type of scholarship. Templeton acknowledges that topic modelling is often associated with "distant reading" because it isn't about the content of individual documents but rather about sourcing out patters among a group of documents. Templeton concludes his blog post by pointing outwards to scholarship that has successfully utilized topic modelling.

van Dalen-Oskam, K. (2012).  Names in novels: an experiment in computational stylistics. Literary and Linguistic Computing.

In this essay, van Dalen-Oskam explores onomastics - the study of names in literature - through a quantitative lens. In the past, onomastics has been studied qualitatively using a small corpus. However, in this research initiative, van Dalen-Oskam considers the stylistics, use, and function of names in literature across a 20-year time period. This essay interrogates a corpus of 64 Dutch and English novels. van Dalen-Oskam uses case studies, graphics, and numerical data to launch arguments about the effectiveness of her techniques. Overall, van Dalen-Oskam cites cultural transfer, corpus size, accessing texts, and the lack of standards as the key issues she came up against in her research. In way of conclusion, van Dalen-Oskam suggests the creation of a customized, virtual research space to alleviate these issues.

Blei, D. M., Ng A. Y., & Jordan M. I. (2003).  Latent Dirichlet Allocation. J. Mach. Learn. Res.. 3, 993–1022.

In this highly technical article written by computer science scholars, Blei, Ng, and Jordan discuss topic modelling textual corpora through LDA (latent Dirichlet allocation). This article begins by defining LDA and then moves through the steps necessary to carry out this type of analysis. Blei, Ng, and Jordan help their audience understand the common problems with LDA and purpose troubleshooting solutions. The article concludes with an exemplar analysis, complete with illustrative figures.

Meeks, E., & Weingart S. (2012).  The Digital Humanities Contribution to Topic Modeling. Journal of Digital Humanities. 2,

In this introduction to a special issue of the Journal of Digital Humanities, Weingart and Meeks assert that this collection of essays aims to present "how to do topic modeling, what to use, its dangers, and some excellent examples of topic models in practice." Weingart and Meeks recount an abbreviated history of topic modelling: it arrived about 15 years ago from computer science, in 2002-2003 LDA originated, and over the past several years topic modelling has gained momentum as the popular topic modelling tool MALLET came to fruition and Stanford's work began to explode. With a hope to "make topic modeling more accessible for new digital humanities scholars", Weingart and Meeks present the journal in three sections: concepts, applications and critiques, and tools.