Learning to Search More Deeply with Data+
Winston Henderson, Duke Alumni Association Board Vice President and founder of Sankofa, Inc., wants to find a solution to a problem that affects many internet users: search results that do not relate well to race, gender, or other minority traits. For example, entering a Google search on women’s hair will show a majority of women’s hairstyles for white women, while results for hairstyles for African Americans, Hispanics, or Asians will be much further down the result list. Sankofa, Inc., Mr. Henderson’s new company, works to address these cultural blind spots in technology with new tech solutions.
When Winston Henderson heard about how the Information Initiative’s summer Data+ program addresses real-world problems by marshalling, analyzing, and visualizing data, he saw an opportunity to have a team of bright and motivated Data+ students tackle this question, and explore possible solutions. Data+ was the perfect place to explore these questions with a culturally diverse team to look for solutions to a problem that combined marketing, social science, and math to find an outcome that was not too limited to the scope of the problem. Specifically, could machine learning remedy some of the cultural blind spots that can be found in web search tools? To go even further, could search engines be operationalized to provide carefully curated results in response to who the searcher is to create a more useful, efficient experience for minority users?
During the summer of 2016, Sankofa, Inc. decided to sponsor the Learning to Search More Deeply Data+ project team in the Information Initiative at Duke to figure out how current search rankings work, and if they could find solutions that provide more relevant results. Professor Sayan Mukherjee from the Mathematics Department was the project lead, and Michael Lindon, a Ph.D. candidate in Statistical Science, was the project mentor for the group. Through a connection Winston Henderson has with North Carolina Central University’s FabLab, the Learning to Search More Deeply project partnered with North Carolina Central University and two students from NCCU, Samuel Watson (Physics) and Jarret Weathersby (Physics), joined the Data+ program to work on this project with Duke students Jennifer Du (Computer Science) and Weiyao Wang (Math, Computer Science). During the 10-week program, this diverse team of Data+ students decided to focus on one area in the vast web of Google rankings: the algorithms used for Google Image Search.
The Learning to Search More Deeply team decided that in order to understand Google’s search algorithms, they would first need to reverse-engineer some of them to understand how these web features are ranked and weighted. To do this, the group used a permutation probability model (Plackett-Luce Model) to learn how Google ranks images. Their machine learning algorithm learned the weights of features by calculating the sample without replacement probability that an image appear in a certain rank.
Then, the team incorporated new data into the ranking algorithms. Since Google indexes less than 10 percent of all Tweets, the group decided to incorporate Twitter into the rankings, as a way to include public opinion on a subject. The group performed sentiment analysis on tweets using the R package TwitteR to scrape tweets from Twitter and used the Python API Indico to perform sentiment analysis on the tweets, which gives a percentage score of how positive a text is. Using this percentage score, the group was able to re-rank the top 80 shampoo brands provided by Ranker.com based on the mean sentiment score of each brand, and weighting tweets more heavily in the mean. The goal was to create a seeded search method geared toward specific communities that combines out web scraped data, Twitter sentiment analysis, and research minority-related sites.
In conducting this research over a 10-week project period this summer, the Learning to Search More Deeply Data+ project team only dipped their toes in the sea of search ranking data and the ways that search ranking can be improved for minority users in the future. However, their new algorithms showed that results change quite a lot depending on sentiment analysis and where those sources are coming from. In addition, the team discovered large differences in minority populations (African American, Asian, Hispanic, etc.) that were not easily captured by the algorithms currently in use for search. Sankofa, Inc., was very pleased with the initial findings of this project and the potential to expand the technology being developed at Sankofa, Inc. to solve problems of cultural relevance in tech.
Sayan Mukherjee shared that one of the things he most enjoyed about this project was the question of how to get media to people better – and not just minorities. “Understanding how well big data is doing at understanding smaller subgroups in the population will improve search for everyone in the long-term”, Mukherjee explains. One of the biggest challenges of the project was understanding the client’s goals and translating that into a data problem, and creating a common vocabulary for the data they were using. The biggest lesson from the project was in seeing how drawing information from these different sources (like Twitter feeds) changed the information provided in a search. Team member Weiyao Wang has continued his machine learning research with an independent project this semester with Professor Mukherjee.
Paul Bendich, Associate Director for Curricular Engagement for the Information Initiative and Data+ program administrator, said that the group had to work on a very complicated social problem and had to use very difficult math and statistics for a solution. Grouping the Learning to Search More Deeply team with other Data+ project teams like the Geometry and Topology for Data and Eye Movement and Food Choice teams that were also doing serious math and statistics calculations fostered the exchange of skills and knowledge between project teams, which is one of the strengths of the Data+ program.
One question that arises from creating technology to improve search for individual users is that of privacy. In providing more relevant results to a user, more information about that user will be pulled by algorithms doing search. While acknowledging there are definitely sensitivities and concerns, “every element of privacy is also an element of convenience and service,” explains Bendich. Winston Henderson shared that from a business perspective, the population deserves and could have a better product than what is currently out there, and the population also deserves tools that work for everyone, not just the “perceived majority internet user”. Sankofa, Inc. is working to provide those tech solutions. We look forward to hearing more about how Sankofa, Inc. translates this project into marketable tools for improving search for all internet users in the future!
To learn more about sponsoring your own Data+ project, please visit http://bigdata.duke.edu/data, or contact Ashlee Valente (Ashlee.firstname.lastname@example.org) or Paul Bendich (email@example.com) for more information.