Textual Analysis

Textual Analysis

TextDNA

TextDNA is a free tool designed to facilitate large-scale analysis projects of linguistic data. Using the Sequence Surveyor system, "TextDNA supports the comparison of ordered sets of linguistic data by visualizing the sequences as colored rows and elements within the set as colored blocks within each row". TextDNA identifies pattens within a dataset and enables users to compare across corpora. TextDNA is compatible with the data stored in Google N-Grams.

MONK (Metadata Offer New Knowledge)

MONK (Metadata Offer New Knowledge) is a digital environment for humanities scholars. It is desgined to assist with the discovery and analysis of patterns within texts, incorporating full text content from corpora such as ECCO, EEBO and Early American Fiction directly into the tool. The MONK Workbench is the primary environment, permitting users to create worksets, perform analytics and save their results. In addition, the MONK workbench can be used in conjunction with the Flameco faceted browser and Zotero. [Credit to TAPoR for this exceptional annotation]

PDF Extract

PDF Extract is an open source set of tools that "allow you to identify and extract the individual references from a scholarly journal article". PDF Extract utilizes the visual clues present in an academic article via formatting to "identify semantically important areas of a PDF" and facilitate appropriate extraction of material. PDF Extract was created to assist "small and medium-sized publishers to meet CrossRef’s linking requirements and to participate in CrossRef’s Cited-by service".

Tesseract

Tesseract is a free raw OCR engine originally developed by HP Labs and now maintained by Google. It works with the Leptonica Image Processing Library, and is capable of reading a variety of image formats. It can convert images to text in over 40 languages. [Credit to TAPoR for this exceptional annotation]

PhiloLogic

PhiloLogic is a "primary full-text search, retrieval and analysis tool" developed by the University of Chicago. PhiloLogic leverages the "wide array of XML data specifications and the recent deployment of basic XML processing tools provides an important opportunity for the collaborative development of higher-level, interoperable tools for Humanities Computing applications".

Lexomics

Lexomics is a text mining software that leverages computational techniques and statistical analysis to answer literary questions. Lexomics searches through texts for word patterns and determines how different parts of a work relate to one another. The web-based Lexomics tools "enables you to "scrub" (clean) your unicode text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, tokenize with character- or word- Ngrams or TF-IDF weighting, and choose from a suite of analysis tools for investigating those texts".

Brat Rapid Annotation Tool

"Brat is a web-based tool for text annotation". Brat is particularly designed for structured annotation where the textual notes are fixed and can be easily categorized in order to aid automated computer processing and interpretation.

GeoNames

GeoNames is massive geographical database that contains "over 10 million geographical names and consists of over 9 million unique features". GeoNames integrates geographical data with place names in various languages, physical features (area, elevation, longitude/latitude), and social statistics (population, currency, postal codes, national flag). GeoNames is a collaborative project that encourages user participation by allowing users to "manually edit, correct and add new names using a user friendly wiki interface".

CATMA: Computer Aided Textual Markup & Analysis

CATMA is a "practical and intuitive tool for literary scholars, students and other parties with an interest in text analysis and literary research". CATMA facilitates efficient literary analysis by "helping perform many of the procedures [...] that normally have to be carried out entirely manually". CATMA's key features include advanced search in the text, visualization of the distribution items of interest, the possibility of analysis a whole corpora of texts in one step, easy toggling between modules, and freely producible Tagsets.

Scrapy

Scrapy is "an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way". Scrapy is a site crawling application that is structured to retrieve structured and useful data from websites for the purposes of data mining, information processing or historical archiving. Scrapy facilitates the data extraction from nearly any website by allowing user to write their own Spiders - directions for locating and retrieving website data. Scrapy is fast, powerful, easily extensible, and yet entirely simple.