Text Analysis of Political Speech

Project Summary

The data that students see in their statistics courses are often constrained to numeric and tabular data. However, there is an exciting field of data science and statistics known as text analysis. This expedition introduces students to the concept of treating text as data frames of words, and demonstrates how to perform basic analyses on bodies of text using R. Tweets of four Democratic candidates for the 2020 Primary are used as data, and demonstrated text analysis techniques in the expedition include comparisons of word frequencies, log-odds ratios for word usage, and pairwise word correlations.

Themes and Categories

Graduate Students: Becky Tang & Graham Tierney, Department of Statistical Science

Faculty: Maria Tackett (Sta 199); Fan Li (Sta 440)

Course: Sta 199: Introduction to Data Science and Sta 440: Case Studies in the Practice of Statistics


Sta 199

During this Data Expedition, students were introduced to the concept of text as data. We used tweets from four Democratic 2020 Primary candidates–Joe Biden, Kamala Harris, Bernie Sanders, and Elizabeth Warren–as a working dataset to demonstrate several text analysis tools. Our goal was to familiarize students with this area of data science with the hope that students would employ text analysis in their final projects. Students learned what sorts of questions one can ask of text data, such as:

  • What words/phrases does Author Y use the most?
  • What words are most distinctive of Author Y when compared to Author Z?
  • What is the overall sentiment associated with Document X?

In order to answer these flavors of questions, we introduced the following concepts:

  • Transforming text into a format suitable for statistical analyses
  • Processing and cleaning text data
  • Sentiment analysis
  • Comparing documents by word frequencies
  • Comparing authors by log-odds ratios of word usage
  • Examining pairwise correlations for bi-grams

We focused on comparisons in word usage and sentiment between Biden and Warren, and along the way we also gave students examples of effective visualizations to convey the findings. Students then spent the last bit of class applying these methods to a second dataset comprised of the State of the Union addresses from Presidents Obama and Trump. Overall, the expedition exposed students to the challenges associated with text data as well as the interesting insights that can be gleaned through statistical text analyses. As a result of this expedition, some students in the course employed text analysis in their final projects.

Sta 440 

We first taught this data expedition to Sta 199 students earlier in the Fall 2019 semester. However, students in Sta 440 have a stronger statistical background so we incorporated more advanced topics for this expedition. Students were introduced to the idea of text as data, and we provided two datasets to demonstrate several text analysis tools. The first dataset is comprised of tweets from four Democratic 2020 Primary candidates–Joe Biden, Kamala Harris, Bernie Sanders, and Elizabeth Warren. With this dataset, students were introduced to concepts such as sentiment analysis and comparing documents and text by word frequencies, sentiment, and log-odds of word usage.

The expedition then shifted focus to more complicated text analysis techniques of classification and topic modeling, demonstrated using transcripts from State of the Union addresses beginning with President Nixon’s 1970 address. In particular, the classification task of interest was recovering the political party from the paragraphs of a speech. We employed logistic regression using sentiment analysis as a dimension reduction tool to predict political party. Then moving away from sentiment, we introduced a bag-of-words generative model for classification using Dirichlet, Multinomial, and Dirichlet-Multinomial distributions. Students saw that these methods were able to correctly classify Republican presidents from the text, but sometimes misclassified Democratic presidents. However, classification requires labeled data and it is not always the case that labels are provided, nor does classification tell us about the content of a document. Thus, we finished the expedition by introducing topic models using the Latent Dirichlet Allocation (LDA).

This data expedition gave students an opportunity to see data analysis in action, and were encouraged them to perform text analysis for their final project.


The Twitter data were scraped using the rtweet() package in R. We obtained the 1,200 most recent tweets at the time of the data expedition from Joe Biden, Kamala Harris, Bernie Sanders, and Elizabeth Warren. In addition to the author and text, other variables include each tweet’s unique identifier, the timestamp of each tweet, the type of device that posted tweet, and how many retweets and favorites a tweet garnered (as of October 2, 2019, the time of scraping). However, the analysis mainly focuses on the author and text. A benefit of this dataset is that each tweet is a complete body of text, allowing for easier identification of the sentiment in a tweet as compared to analyzing an entire book or speech. These data also demonstrates the challenge of working with short text, as tweets are constrained to 280 characters.

The State of the Union address data used during the in-class exercise were obtained from The American Presidency Project (https://www.presidency.ucsb.edu/). There are four columns in the dataset: prsident, year, a sentence from the address, and the line number of that sentence.

The following R packages were pre-loaded for the students: tidytext, tidyverse, stringr, scales, textdata, wordcloud, reshape2, lubridate, widyr, tidyr, igraph, and ggraph.

Related Projects

This data expedition focused on the mechanisms animals use to orient using environmental stimuli, the methods that scientists use to test hypotheses about orientation, and the statistical methods used with circular orientation data. Students collected their own data set during the class period, performed hypothesis testing on their data using circular statistics in R, and aggregated their data to formally test the hypothesis that isopods orient with light using an RShiny online application.

This exercise served as a capstone to a series of four class sessions on orientation and navigation, where students read primary scientific literature that used circular statistics in their methods. This data exercise was used to give students the opportunity to collect their own data, discover why linear statistics wouldn’t be sufficient to analyze them, and then implement their own analysis. The goal of this course was to give students a better understanding of circular statistics, with hands-on application in forming and testing a hypothesis.

In this two-day, virtual data expedition project, students were introduced to the APIM in the context of stress proliferation, linked lives, the spousal relationship, and mental and physical health outcomes.

Stress proliferation is a concept within the stress process paradigm that explains how one person’s stressors can influence others (Thoits 2010). Combining this with the life course principle of linked lives explains that because people are embedded in social networks, stress not only can impact the individual but can also proliferate to people close to them (Elder Jr, Shanahan and Jennings 2015). For example, one spouse’s chronic health condition may lead to stress-provoking strain in the marital relationship, eventually spilling over to affect the other spouse’s mental health. Additionally, because partners share an environment, experiences, and resources (e.g., money and information), as well as exert social control over each other, they can monitor and influence each other’s health and health behaviors. This often leads to health concordance within couples; in other words, because individuals within the couple influence each other’s health and well-being, their health tends to become more similar or more alike (Kiecolt-Glaser and Wilson 2017, Polenick, Renn and Birditt 2018). Thus, a spouse’s current health condition may influence their partner’s future health and spouses may contemporaneously exhibit similar health conditions or behaviors.

However, how spouses influence each other may be patterned by the gender of the spouse with the health condition or exhibiting the health behaviors. Recent evidence suggests that a wife’s health condition may have little influence on her husband’s future health conditions, but that a husband’s health condition will most likely influence his wife’s future health (Kiecolt-Glaser and Wilson 2017).

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve.


To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulation experiments. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time. 


In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.