Text Analysis of Political Speech

Project Summary

The data that students see in their statistics courses are often constrained to numeric and tabular data. However, there is an exciting field of data science and statistics known as text analysis. This expedition introduces students to the concept of treating text as data frames of words, and demonstrates how to perform basic analyses on bodies of text using R. Tweets of four Democratic candidates for the 2020 Primary are used as data, and demonstrated text analysis techniques in the expedition include comparisons of word frequencies, log-odds ratios for word usage, and pairwise word correlations.

Themes and Categories

Graduate Students: Becky Tang & Graham Tierney, Department of Statistical Science

Faculty: Maria Tackett (Sta 199); Fan Li (Sta 440)

Course: Sta 199: Introduction to Data Science and Sta 440: Case Studies in the Practice of Statistics


Sta 199

During this Data Expedition, students were introduced to the concept of text as data. We used tweets from four Democratic 2020 Primary candidates–Joe Biden, Kamala Harris, Bernie Sanders, and Elizabeth Warren–as a working dataset to demonstrate several text analysis tools. Our goal was to familiarize students with this area of data science with the hope that students would employ text analysis in their final projects. Students learned what sorts of questions one can ask of text data, such as:

  • What words/phrases does Author Y use the most?
  • What words are most distinctive of Author Y when compared to Author Z?
  • What is the overall sentiment associated with Document X?

In order to answer these flavors of questions, we introduced the following concepts:

  • Transforming text into a format suitable for statistical analyses
  • Processing and cleaning text data
  • Sentiment analysis
  • Comparing documents by word frequencies
  • Comparing authors by log-odds ratios of word usage
  • Examining pairwise correlations for bi-grams

We focused on comparisons in word usage and sentiment between Biden and Warren, and along the way we also gave students examples of effective visualizations to convey the findings. Students then spent the last bit of class applying these methods to a second dataset comprised of the State of the Union addresses from Presidents Obama and Trump. Overall, the expedition exposed students to the challenges associated with text data as well as the interesting insights that can be gleaned through statistical text analyses. As a result of this expedition, some students in the course employed text analysis in their final projects.

Sta 440 

We first taught this data expedition to Sta 199 students earlier in the Fall 2019 semester. However, students in Sta 440 have a stronger statistical background so we incorporated more advanced topics for this expedition. Students were introduced to the idea of text as data, and we provided two datasets to demonstrate several text analysis tools. The first dataset is comprised of tweets from four Democratic 2020 Primary candidates–Joe Biden, Kamala Harris, Bernie Sanders, and Elizabeth Warren. With this dataset, students were introduced to concepts such as sentiment analysis and comparing documents and text by word frequencies, sentiment, and log-odds of word usage.

The expedition then shifted focus to more complicated text analysis techniques of classification and topic modeling, demonstrated using transcripts from State of the Union addresses beginning with President Nixon’s 1970 address. In particular, the classification task of interest was recovering the political party from the paragraphs of a speech. We employed logistic regression using sentiment analysis as a dimension reduction tool to predict political party. Then moving away from sentiment, we introduced a bag-of-words generative model for classification using Dirichlet, Multinomial, and Dirichlet-Multinomial distributions. Students saw that these methods were able to correctly classify Republican presidents from the text, but sometimes misclassified Democratic presidents. However, classification requires labeled data and it is not always the case that labels are provided, nor does classification tell us about the content of a document. Thus, we finished the expedition by introducing topic models using the Latent Dirichlet Allocation (LDA).

This data expedition gave students an opportunity to see data analysis in action, and were encouraged them to perform text analysis for their final project.


The Twitter data were scraped using the rtweet() package in R. We obtained the 1,200 most recent tweets at the time of the data expedition from Joe Biden, Kamala Harris, Bernie Sanders, and Elizabeth Warren. In addition to the author and text, other variables include each tweet’s unique identifier, the timestamp of each tweet, the type of device that posted tweet, and how many retweets and favorites a tweet garnered (as of October 2, 2019, the time of scraping). However, the analysis mainly focuses on the author and text. A benefit of this dataset is that each tweet is a complete body of text, allowing for easier identification of the sentiment in a tweet as compared to analyzing an entire book or speech. These data also demonstrates the challenge of working with short text, as tweets are constrained to 280 characters.

The State of the Union address data used during the in-class exercise were obtained from The American Presidency Project (https://www.presidency.ucsb.edu/). There are four columns in the dataset: prsident, year, a sentence from the address, and the line number of that sentence.

The following R packages were pre-loaded for the students: tidytext, tidyverse, stringr, scales, textdata, wordcloud, reshape2, lubridate, widyr, tidyr, igraph, and ggraph.

Related Projects

A large and growing trove of patient, clinical, and organizational data is collected as a part of the “Help Desk” program at Durham’s Lincoln Community Health Center. Help Desk is a group of student volunteers who connect with patients over the phone and help them navigate to community resources (like food assistance programs, legal aid, or employment centers). Data-driven approaches to identifying service gaps, understanding the patient population, and uncovering unseen trends are important for improving patient health and advocating for the necessity of these resources. Disparities in food security, economic stability, education, neighborhood and physical environment, community and social context, and access to the healthcare system are crucial social determinants of health, which studies indicate account for nearly 70% of all health outcomes.

We led a 75-minute class session for the Marine Mammals course at the Duke University Marine Lab that introduced students to strengths and challenges of using aerial imagery to survey wildlife populations, and the growing use of machine learning to address these "big data" tasks.

Most phenomena that data scientists seek to analyze are either spatially or temporally correlated. Examples of spatial and temporal correlation include political elections, contaminant transfer, disease spread, housing market, and the weather. A question of interest is how to incorporate the spatial correlation information into modeling such phenomena.


In this project, we focus on the impact of environmental attributes (such as greenness, tree cover, temperature, etc.) along with other socio-demographics and home characteristics on housing prices by developing a model that takes into account the spatial autocorrelation of the response variable. To this aim, we introduce a test to diagnose spatial autocorrelation and explain how to integrate spatial autocorrelation into a regression model



In this data exploration, students are provided with data collected from remote sensing, census, and Zillow sources. Students are tasked with conducting a regression analysis of real-estate estimates against environmental amenities and other control variables which may or may not include the spatial autocorrelation information.