In tracing the publication history, geographical spread, and content of “pirated” copies of Daniel Defoe’s Robinson Crusoe, Gabriel Guedes (Math, Global Cultural Studies), Lucian Li (Computer Science, History), and Orgil Batzaya (Math, Computer Science) explored the complications of looking at a data set that saw drastic changes over the last three centuries in terms of spelling and grammar, which offered new challenges to data cleanup. By asking questions of the effectiveness of “distant reading” techniques for comparing thousands of different editions of Robinson Crusoe, the students learned how to think about the appropriateness of myriad computational methods like doc2vec and topic modeling. Through these methods, the students started to ask, at what point does one start seeing patterns that were invisible at a human scale of reading (reading one book at a time)? While the project did not definitively answer these questions, it did provide paths for further inquiry.
The team published their results at: https://orgilbatzaya.github.io/pirating-texts-site/