Pirating Texts

Project Summary

In tracing the publication history, geographical spread, and content of “pirated” copies of Daniel Defoe’s Robinson Crusoe, Gabriel Guedes (Math, Global Cultural Studies), Lucian Li (Computer Science, History), and Orgil Batzaya (Math, Computer Science) explored the complications of looking at a data set that saw drastic changes over the last three centuries in terms of spelling and grammar, which offered new challenges to data cleanup. By asking questions of the effectiveness of “distant reading” techniques for comparing thousands of different editions of Robinson Crusoe, the students learned how to think about the appropriateness of myriad computational methods like doc2vec and topic modeling. Through these methods, the students started to ask, at what point does one start seeing patterns that were invisible at a human scale of reading (reading one book at a time)? While the project did not definitively answer these questions, it did provide paths for further inquiry.

The team published their results at: https://orgilbatzaya.github.io/pirating-texts-site/

Click here for the Executive Summary

Themes and Categories
Year
2018
Contact
Paul Bendich
Mathematics
bendich@math.duke.edu

Disciplines Involved: English, Literature, History, Geography, Visual & Media Studies

Project Lead: Charlotte Sussman

Project Manager: Grant Glass

This project aimed at further exploring how to better develop different methods for doing humanities based research by combining the open-ended nature of humanities projects with the methodological rigor of fields like statistics and computer science. Lucuan Li noticed the potential for finding new ways to link these methods to the humanities: “The open-endedness gave us tremendous freedom to determine our modes of analysis and which parts of the data we would use.” Orgil Batzaya found drawing links between data insights and historical facts compelling: “We looked at distributions of the concentration of publication in different countries and it was fun trying to link historical periods to peaks and troughs in publication.” Some of these links became profoundly obvious according to Gabe Guedes: “As for the final outcome, I was surprised to be able to see such a strong correlation between historical events and publication volume, to the point where you had very noticeable peaks when countries made substantial imperial forays.”

The team was directed and mentored by Grant Glass, a graduate student in the English Department at UNC-CH. Grant’s own research focuses on the question, what is a text? This project allowed Grant to begin to form the data structure for creating a new edition of Robinson Crusoe by understanding how thousands of copies are related to one another. The experience and insights took Grant by surprise: “I did not think that there was as much variance between the copies as there was. This new understanding of the text will help me describe how reading publics, publishers, and editors shape the text long after the author is gone.”

Related People

Related Projects

This team is part of an ongoing project dedicated to exploring how states and local communities responded to the causes of the 2007-09 Global Financial Crisis. Led by faculty from the Global Financial Markets Center at Duke Law the Data+ team  will conduct analysis of multiple states mortgage enforcement databases to gain a better understanding of how state regulators were, or were not, enforcing existing state law pertaining to mortgages leading up to the crisis. Our website has an example of what this will look like, as last year we analyzed North Carolina’s mortgage enforcement actions and displayed them by topic.

Project Lead: Lee Reiners

Nationally there is a disproportionate number of children of color (African American & Latino) in the child welfare system. Durham County is no different. However, reviewing this problem through the lens of data has not been done to formulate or implement possible solutions. Durham County Department of Social Services Child & Family Services would like to evaluate systems to identify where and how disproportionality and disparity are occurring. It is occurring at the entry point of Reporting child abuse and neglect? Is it occurring at the case decision? Is our reunification time different for African American children? Or Does it take longer for a child of color to achieve permanence through adoption? Organizing the data to show us our “hot spots” would facilitate further discussion and focus on solutions to an age-old systemic problem.

Faculty Lead: Greg Herschlag

Project Lead: Jovetta L Whitfield

Student teams will develop a benchmark dataset and explore its efficacy in an in house competition where they will put new innovative techniques such as machine learning to the test through a series of challengesA team of students will develop benchmark data pertaining to network performance in the presence of intentional and non-intentional degradation, ranging from sensor failure and additive noise to adversarial interference.  The students will analyze the baseline performance of the network, and measure performance of the degraded network with and without the inclusion of robust techniques that shore up robustness.  Students will have the opportunity to present findings to scientists & engineers from the Air Force Research Laboratory.

Faculty leads: Robert Calderbank, Vahid Tarokh, Ali Pezeshki

Client leads: Dr. Lauren Huie, Dr. Elizabeth Bentley, Dr. Zola Donovan, Dr. Ashley Prater-Bennette, Dr. Erin Trip

Project Manger: Suya Wu