Building a Better Mousetrap – The Digital Humanists’ Manifesto

“Literary criticism is exemplary” (Ullyot and Bradley 144). This opening line from Ullyot and Bradley’s section of Shakespeare’s Language in Digital Media serves as the impetus for the Art Criticism Lab’s (ACL) sonnet database project. The central question driving the creation of the database is simple: how can technology enable definitive critical statements? Tools like the Natural Language Toolkit[i] (NLTK) exist to aid in the analysis of digital texts, but tools for storing and querying digital text are almost non-existent. Project Gutenberg[ii] stores a vast quantity of raw text in various formats; however, these texts are not encoded for programmatic analysis. With the Gutenberg corpus it is possible to examine an entire text against another entire text, but an examination focused on the specific contents of a given text is difficult, if not impossible, using Gutenberg alone. The Gutenberg corpus contains data (page headers, footnotes, and editor’s introductions) useless to a digital analysis, as well as lacks data (delimiters for chapters, sections, and lines; and titles for individual poems) required for an accurate computational analysis. The primary aim of the ACL sonnet database is to address these issues of specificity: rather than storing a vast quantity of raw text, the database is designed to store only the sonnets contained within a text in a format that allows programmatic analysis. In other words, the database strives to gather and meta-tag all sonnets in the English language in a format designed to enable ‘definitive’ critical claims about its contents. Specificity of meta-data (data about data) is what separates the ACL sonnet database from other text-focused digital corpora. Other databases allow a user to search and obtain the contents of a single text, and some of the better databases allow a user to search the contents of many texts but lack a method to simultaneously ‘download’ the results of a search in a meta-tagged format for digital analysis. The EEBO corpus[iii] contains tens of thousands of digital texts, but unless a scholar is willing to download and meta-tag each document one by one, it is impossible to preform a wholistic analysis of the corpus’s contents. The EEBO database is perfectly designed for a human user, the ACL sonnet database is designed with both humans and computers in mind.

The primary difference between the ACL sonnet database and databases like the EEBO is the integration of a RESTful (Representational State Transfer) compliant public API[iv] (Application Programmer Interface.) This type of API is the standard protocol for moving structured data from one computer to another in a format both computers can understand. The ACL’s implementation of a RESTful API allows a user to preform queries on the database from a Python script, an R project, or other computer language without the use of the database’s browser-based website. However, all the features of the API are also available on the site itself, the API is a computational interface whereas the site is a human interface. The results of a user’s programmatic query to the API are sent back to their computer as a JSON[v] (JavaScript Object Notation) formatted file their computer program can interact with natively[vi]. The ACL site’s various API endpoints[vii] allow a user to gather only the specific data (sonnets) they require for their analysis; the JSON formatted response allows a user to preform an analysis across a large data set without the need to first strip away important details about the data. For example, a user can preform an analysis on all the text (i.e. the lines of poetry) in the database and easily locate the specific author and title of each sonnet in their results. Rather than analyzing a corpus of many different texts condensed into a single text file, the use of JSON files allows an analysis of many texts without the need to first condense them into a single file. While this type of JSON based API is extremely common in public websites, the academic world has yet to integrate them into their websites and databases. This lack of standardized digital corpora already presents a major problem to those working in the digital humanities: future failure to implement open-source, publicly available, meta-tagged, APIs of digital corpora will force any project in the field to spend most of its available time and budget collecting and collating data. Moreover, the creation of non-public digital collections of texts inevitably leads to a massive amount of duplication; digital humanists will spend sparse research dollars collecting and tagging texts already digitized by another researcher. The ACL sonnet database is the first step towards an API based open-source collection of text, and it is my hope many similar projects will follow the ACL’s example. Modifying the code used to operate the ACL database to integrate formats beyond sonnets is a project of trivial difficulty; the future of the digital humanities relies on individuals and groups willing to adapt, maintain, and contribute to open-source software projects like the ACL. While those with coding skill are always important, those willing to encode and sanitize (make readable/regular) raw text are much more important to the field. Gathering a huge volume of raw text to analyze is only the first—and perhaps the easiest—step of an algorithmic analysis, the text is practically useless until it is tagged, sorted, cleaned, and regularized.

Data regularization presents a problem to any project working with textual data, especially those working with texts from earlier periods. This problem seems straightforward, but digital analysis requires an attention to details a human reader can easily overlook. For example, how should spelling be dealt with if the misspelling is important to the metre of the line? For example, when Shakespeare uses an acute accent, “agéd” instead of “aged,” to add an extra syllable to a line the database must somehow account for this variation. Moreover, when preforming an analysis should agéd and aged count as the same word? Can the problem be solved by simply allowing agéd = aged? If so, how many other uses of an acute accent would we have to map onto such a system of substitution? The problem cannot be solved by a simple substitution, nor can it be solved by simply removing the accented letter: in both instances the loss of essential data is too great. Furthermore, how should non-standard spelling be corrected? Which standard spelling should be used? British English? Canadian English? American English? How is ‘standard’ defined? It is easy to switch “neuer” to “never,” but what happens when the spelling change is not obvious? What about misspelled proper names? How should words like “amazeth” be lemmatized? Any extant database for lemmatization will need to include archaic tenses or a newly created database will need to account for them. How should metrical contractions like “prick’d” and “imprison’d” be dealt with? Should we remove the metrical regularity by correcting the term to its dictionary spelling or should we map all possible metrical contractions to some form of substitution scheme? If we choose the former we lose the ability to algorithmically analyze the metre of the poem, if we choose the latter we must create a database to correct for all such substitutions when preforming a lemmatized analysis of word frequency. The solution is not simple, and I do not have answers to these questions. This serves as an example of how a seemingly simple problem becomes a hugely complex issue when one focuses on the details. Furthermore, these examples do not represent a comprehensive list of all the problems any large-scale text digitization project will face.

Now, I will address some of the solutions to specific problems I discovered while creating the ACL database, and, by extension, problems any similar database project undertaken in the digital humanities will face. First, what is meta-tagging and what should any such tags contain? A meta-tag is, put simply, a piece of information related to another piece of information. For example, each sonnet in the ACL database is ‘tagged’ with the author’s first and last name, the sonnet’s title, where the sonnet came from, how many lines it contains, who added it to the database, when it was put in the database, and so on. However, determining what information each sonnet should be tagged with is not a simple undertaking. In the initial construction of the ACL database I relied on the Text Encoding Initiative’s (TEI) P5 Guidelines[viii] to determine the structure of the tags; the TEI guidelines form the de facto standards used by many large-scale text digitization projects. To create a valid TEI formatted XML file, the file must contain the author’s first and last name, the names of any editors and/or contributors, a description of the text’s source, a publication statement, and a title (The TEI Consortium). However, these standards are not specific enough for use in an augmented analysis. The next step was to determine the domain of the problem the database needs to solve. For this step, I asked my fellow English 523 students and Dr. Michael Ullyot what kind of questions the database should answer. The common thread in these discussions came down to two questions: what is a sonnet? and, is the sonnet a form or a genre? From these questions I was able to separate the domain of the problem into a set of elements (or features) the database needs to capture. The chosen meta-tags must aid in computational analysis and database queries. Moreover, the tags must be unique enough to prevent a sonnet matching an existing sonnet’s title, source, and author last name from being re-added. Based on these requirements, I chose the following meta-tags:

  • The author’s first name (optional.)
  • The author’s last name (mandatory.)
  • The year of initial publication (optional.)
  • The title of the sonnet (mandatory, but the sonnet’s first line is used when a title is not provided.)
  • A timestamp of when the sonnet was added or the last time it was updated (automatic.)
  • A publication statement of the sonnet’s publication rights (mandatory.)
  • A description of the sonnet’s source (mandatory.)
  • The username of the user who added the sonnet, or the user who last edited the sonnet (automatic.)
  • The total number of lines in the sonnet (automatic.)
  • The period of initial publication (mandatory.)

Optional meta-tags allow for some flexibility within the data, and required meta-tags enforce a regularity across the varied contents of the database. Without a baseline of regularity, deriving definitive results from the data set is impossible; without regularization, the database becomes a motley assortment of poems with no definitive context one can utilize in an analysis, defeating its entire purpose. For the database to definitively provide a user with, for example, all of Shakespeare’s sonnets, every sonnet in the database must contain the author’s last name. In addition, each last name must be added to the database in an exacting format. To illustrate this problem, Elizabeth Barrett Browning’s name could be added in two ways:

  • First name: “Elizabeth” | Last name: “Barrett Browning”
  • First name: “Elizabeth Barrett” | Last name: “Browning”

As this shows, regularization of inputs is extremely important. Entering information into the database requires a detailed consideration of every possible manifestation the data may present, not only the intuitive manifestations common to any given corpus. These regularized and specific meta-tags allow the database to produce valid TEI files while also allowing complex and specific analysis of the database’s contents. Consistent meta-tagging, in concert with a public API, is what separates the ACL database from other digital corpora like the EEBO database or Project Gutenberg.

The ACL database provides an augmented interface to enable augmented criticism, and the EEBO and Gutenberg databases provide standard interfaces to enable standard criticism. The elusive ‘definitive’ criticism the digital humanities aims to produce requires the former, but only the latter exists today. This presents an existential problem to any pursuit in the digital humanities: before any valid discussion around digital methods can begin, before any augmented tools can be built, such projects must first define and regularize the data set these methods and tools will utilize. In other words, definitive criticism must operate on a standard data set shared and validated across a broad spectrum of academic fields. Moreover, analysis of such a data set must use standardized tools and metrics designed by the same academics in their various analyses. In the pursuit of definitive criticism, an academic must produce repeatable results another academic can verify. Therefore, before any real work in the realm of digital humanities can begin, a shared data set must exist; and the only way such a data set can exist is if it is built and maintained by a community of academics. Before one can claim ‘all {x} equals {y}’ one must first collect all the {x} and ensure the resulting {y} is valid. Collecting all the {x} for a simple question like the one posed by English 523 (what is a sonnet?) presents a near-impossible problem. However, if this type of critical project is undertaken the way an open-source software project is managed, a critical project becomes a collaboration of many people working separately toward similar goals. Maybe the work of an academic digitizing and tagging the works of James Joyce would provide further data to another project on modernist literature; maybe someone digitizing and tagging all the works of unknown female poets would unintentionally provide profound insight to another person working on the feminist implications of the renaissance. The point is simple: definitive criticism is only possible if the data set under analysis contains anything and everything remotely applicable to its claims. And, the only way one could ever hope to create such a data set is through the massive, shared, open-source effort of many academics working together on different projects. In other words, we need many people to follow the same guidelines on the same database across many and varied projects before we can form anything resembling a definitive claim. Consequently, a comprehensive data set must always precede a definitive claim.

The technology to power a massive literary database like the one described above is the same technology used in large companies to power accounting, messaging, and other business-oriented tasks. The digital humanities does not need a new type of database technology or a new file standard for encoding and sharing data; any attempt to build such a technology would inevitably reproduce another technology already deployed in another field. For this reason, I chose to use Spring Boot[ix] (part of the Spring Framework) to power the ACL database and website. While many other technologies could handle the ACL’s specific needs, I chose to use Spring Boot because of my own familiarity with it, its wide corporate and open-source adoption, the availability of accurate documentation, and the availability of developers experienced in its use. Furthermore, the Spring project is open source and can be used without any special licensing or branding requirements. Spring enabled me to produce a working site quickly without the need for thousands of lines of boilerplate code[x] or abstract security considerations; Spring’s security and database modules are proven to work, and it would have taken me hundreds of hours to create and test code with similar functionality, and the resulting code would have been sub-par at best. The data is stored in a MySQL database using the InnoDB[xi] dialect to ensure high-speed returns of the most commonly queried data. Search is currently handled by the Apache Lucene[xii] search engine, but I plan to migrate onto Elasticsearch[xiii] (a more powerful implementation of Lucene) soon; Lucene does not provide the customization available in Elasticsearch, and the complex nature of the database requires a custom solution. Moreover, combining Spring and Elasticsearch enables easy expansion of the types, forms, and genres contained in the database without the need to start from the beginning and throw out existing code. For example, I could add another poetic form to the ACL database with the addition of less than 400 lines of code; an experienced developer could add a new poetic form to the database in less than a single day’s work. This is to say, any project in the digital humanities does not need to rebuild the wheel. Using existing frameworks allows a project to grow beyond a single academic and/or institution: a truly open-source project is developed in a way that allows developers unfamiliar with the project to quickly ‘get on board’ and contribute. For a large, multi-user project to succeed, its architecture must follow standard development practices and use widely available frameworks. When an open-source project is too esoteric for someone unfamiliar with it to quickly understand and contribute code, it is only open-source in name; no one will want to contribute if they first need to spend many hours learning a project specific nomenclature or design specification they cannot use anywhere else. Standards make everyone’s life easier, and there is no reason for the digital humanities to design a new standard when the current standard has already been shown to work. To ‘build a better mousetrap’ is not to ‘design a new mousetrap from scratch,’ one should only change the elements specific to one’s use case and keep what already works in place.

In conclusion, the ACL sonnet database provides an example of how a large-scale text digitization project could use existing technology and standards to satisfy the needs of the digital humanities. It strives to improve upon existing human-centric online corpora by enabling programmatic access via an JSON based RESTful API and standardized meta-tags. Rather than providing large blocks of raw text, the ACL database provides specific and regularized text with the important meta-tags intact. Moreover, the open-source nature of the code behind the database provides a template for other projects with similar aims. However, rather than building a domain specific database, those in the digital humanities should focus on the creation of a collaborative general-purpose database of as many and varied texts as such a group could obtain. Any potentially definitive claim requires a vast corpus of supporting data, and it is nearly impossible for a single person or institution to gather, sort, tag, and regularize the volume of data such claims require. The digital humanities cannot thrive in the realm of individual critics making individual claims on a set of data judged applicable by an individual’s opinion; for the digital humanities to succeed a massive collaborative effort must be undertaken to collect and regularize a massive set of literature. Without a comprehensive set of data to ground our arguments upon, we are simply extending our existing critical conceptions to include more data. For a truly definitive claim our conceptions must include every piece of data with any potential to influence our claims. To build a truly definitive criticism, we must allow every piece of literature—irrespective of canon, creator, and critic—the same status. To create a definitive criticism, we must first remove the critic from the claim.

Works Cited

Jenstad, Janelle, et al., editors. Shakespeare’s Language in Digital Media. Routledge, 2018.

The TEI Consortium. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Edited by C.M. Sperberg-McQueen and Lou Burnard, 31 Jan. 2018, http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf.

Ullyot, Michael, and Adam James Bradley. “Past Texts, Present Tools, and Future Critics: Toward Rhetorical Semantics.” Shakespeare’s Language in Digital Media, pp. 144–56.

Notes

[i] https://www.nltk.org/

[ii] http://www.gutenberg.org/

[iii] https://eebo.chadwyck.com/home

[iv] https://en.wikipedia.org/wiki/Representational_state_transfer

[v] https://en.wikipedia.org/wiki/JSON

[vi] A native file is a format a computer can use with no additional components (i.e. a file the computer can use without installing any new software.)

[vii] An endpoint is a URL (i.e. https://database.acriticismlab.ort/sonnets/all) that preforms an API function.

[viii] http://www.tei-c.org/guidelines/p5/

[ix] https://spring.io/projects/spring-boot

[x] Code that runs the ‘plumbing’ of a piece of software (i.e. HTTP handlers and database connectors.)

[xi] https://en.wikipedia.org/wiki/InnoDB

[xii] https://lucene.apache.org/

[xiii] https://www.elastic.co/