Durham, NC — July 7, 2025 — As artificial intelligence tools grow more powerful, they’re also growing hungrier—for data. A team of Data+ students at Duke University is stepping in to better understand how AI companies gather that data from the internet, and what it means for the future of online content.
Through Duke’s Data+ summer research program, undergrads Adonias Ketema, Hamza Ayfan, James Xiao, Eva Aggarwal, and graduate student Zini Yang are working on a project called “Building Honeypots to Track AI Web Scrapers.” Their goal is to uncover how and when AI bots—automated programs that scan websites for information—collect data from the web.
To do this, the team is creating “honeypots”—fake web pages designed to attract and track these bots. By analyzing how and when the bots interact with these pages, the students can learn more about how quickly scraped content ends up in AI models, and whether those models can tell the difference between real and fake information.
“Web scraping is happening at an enormous scale, and it’s often invisible to the average person,” said Dr. Emily Wenger, the project’s lead researcher. “We’re trying to bring transparency to that process and understand its impact.”
The project is based in Duke’s Argus Lab and uses real web traffic data from Duke’s own websites. It also explores whether AI models are able to filter out low-quality or misleading content—or if they’re being trained on it without knowing.
This work is part of a growing effort to make AI development more ethical and accountable, and to protect the integrity of online information.
Learn more about this project and others at our +Programs Poster Session on July 25th from 2-3:00 p.m. at the Gross Hall Energy Hub at Duke University! For a parking pass, please email ariel.dawn@duke.edu.
For more information, visit bigdata.duke.edu.
Media Contact:
Ariel Dawn
Rhodes iiD Communications Specialist
ariel.dawn@duke.edu
919-684-9312