Pirating Texts

Project Summary

In tracing the publication history, geographical spread, and content of “pirated” copies of Daniel Defoe’s Robinson Crusoe, Gabriel Guedes (Math, Global Cultural Studies), Lucian Li (Computer Science, History), and Orgil Batzaya (Math, Computer Science) explored the complications of looking at a data set that saw drastic changes over the last three centuries in terms of spelling and grammar, which offered new challenges to data cleanup. By asking questions of the effectiveness of “distant reading” techniques for comparing thousands of different editions of Robinson Crusoe, the students learned how to think about the appropriateness of myriad computational methods like doc2vec and topic modeling. Through these methods, the students started to ask, at what point does one start seeing patterns that were invisible at a human scale of reading (reading one book at a time)? While the project did not definitively answer these questions, it did provide paths for further inquiry.

The team published their results at: https://orgilbatzaya.github.io/pirating-texts-site/

Click here for the Executive Summary

Themes and Categories
Year
2018
Contact
Paul Bendich
Mathematics
bendich@math.duke.edu

Disciplines Involved: English, Literature, History, Geography, Visual & Media Studies

Project Lead: Charlotte Sussman

Project Manager: Grant Glass

This project aimed at further exploring how to better develop different methods for doing humanities based research by combining the open-ended nature of humanities projects with the methodological rigor of fields like statistics and computer science. Lucuan Li noticed the potential for finding new ways to link these methods to the humanities: “The open-endedness gave us tremendous freedom to determine our modes of analysis and which parts of the data we would use.” Orgil Batzaya found drawing links between data insights and historical facts compelling: “We looked at distributions of the concentration of publication in different countries and it was fun trying to link historical periods to peaks and troughs in publication.” Some of these links became profoundly obvious according to Gabe Guedes: “As for the final outcome, I was surprised to be able to see such a strong correlation between historical events and publication volume, to the point where you had very noticeable peaks when countries made substantial imperial forays.”

The team was directed and mentored by Grant Glass, a graduate student in the English Department at UNC-CH. Grant’s own research focuses on the question, what is a text? This project allowed Grant to begin to form the data structure for creating a new edition of Robinson Crusoe by understanding how thousands of copies are related to one another. The experience and insights took Grant by surprise: “I did not think that there was as much variance between the copies as there was. This new understanding of the text will help me describe how reading publics, publishers, and editors shape the text long after the author is gone.”

Related People

Related Projects

Understanding how to generate, analyze, and work with datasets in the humanities is often a difficult task without learning how to code or program. In humanities centered courses, we often privilege close reading or qualitative analysis over other methods of knowing, but by learning some new quantitative techniques we better prepare the students to tackle new forms of reading. This class will work with the data from the HathiTrust to develop ideas for thinking about how large groups and different discourse communities thought of queens of antiquity like Cleopatra and Dido.

Social and environmental contexts are increasingly recognized as factors that impact health outcomes of patients. This team will have the opportunity to collaborate directly with clinicians and medical data in a real-world setting. They will examine the association between social determinants with risk prediction for hospital admissions, and to assess whether social determinants bias that risk in a systematic way. Applied methods will include machine learning, risk prediction, and assessment of bias. This Data+ project is sponsored by the Forge, Duke's center for actionable data science.

Project Leads: Shelly Rusincovitch, Ricardo Henao, Azalea Kim

Project Manager: Austin Talbot

Producing oil and gas in the North Sea, off the coast of the United Kingdom, requires a lease to extract resources from beneath the ocean floor and companies bid for those rights. This team will consult with professionals at ExxonMobil to understand why these leases are acquired and who benefits. This requires historical data on bid history to investigate what leads to an increase in the number of (a) leases acquired and (b) companies participating in auctions. The goal of this team is to create a well-structured dataset based on company bid history from the U.K. Oil and Gas Authority; data which will come from many different file structures and formats (tabular, pdf, etc.). The team will curate these data to create a single, tabular database of U.K. bid history and work programs.

Project Lead: Kyle Bradbury

Project Manager: Artem Streltsov