A Machine Learning Approach to Characterizing Clusters in Turbulent Flow

Project Summary

Fluid mechanics is the study of how fluids (e.g., air, water) move and the forces on them. Scientists and engineers have developed mathematical equations to model the motions of fluid and inertial particles. However, these equations are often computationally expensive, meaning they take a long time for the computer to solve.


To reduce the computation time, we can use machine learning techniques to develop statistical models of fluid behavior. Statistical models do not actually represent the physics of fluids; rather, they learn trends and relationships from the results of previous simulation experiments. Statistical models allow us to leverage the findings of long, expensive simulations to obtain results in a fraction of the time. 


In this project, we provide students with the results of direct numerical simulations (DNS), which took many weeks for the computer to solve. We ask students to use machine learning techniques to develop statistical models of the results of the DNS.

Themes and Categories
Reza Momenifar or Jonathan Holt
mohammadreza.momenifar@duke.edu or jonathan.holt@duke.edu

Graduate Students: Reza Momenifar and Jonathan Holt, Department of Civil & Environmental Engineering

Faculty: Simon Mak, Department of Statistical Science

Course: "Machine Learning and Data Mining" (STA 325)


Statisticians and machine learning specialists are often asked to analyze data from obscure sources. No matter the source of data, analysts must be comfortable applying their skills to solve the client’s problem. This Data Expeditions course prepares students for the real world by asking students to analyze data from a field with which they have little experience: turbulent flow in fluids. Furthermore, this Data Expeditions course challenges students to interpret their results in order to gain an understanding of the behavior of fluids in turbulent flow. 

Students are first given a 1-hour lecture introducing the data. We explain that fluid dynamics is a classic field of physics pertaining to the motion of fluid particles. We discuss the concept of turbulence, a phenomenon that most people are familiar with in the context of airplane travel. Next, we explain why it is computationally expensive to model turbulence using direct numerical simulation (DNS). It would be much faster, we tell the students, if we had a statistical model of our data. We illustrate the concept of particle clustering in turbulence and explain how we employed the Voronoi tessellation analysis to identify clusters. We provide students with our dataset and ask them to return in three weeks with a proposed statistical model. Specifically, we ask students to model the first four moments of particle cluster size given three parameters: the Reynolds number, Stokes number, and Froude number.

During a two-hour follow-up session, students present their solutions to the TAs and professor. In addition to the presentation, students write a report on their findings. 

Guiding Questions

  1. What is a fluid? How does a fluid behave?
  2. What is turbulence? Where do we see turbulence in our everyday lives? How particles move in turbulent flow and why they form cluster?
  3. What are the important properties of fluid and particles in particle-laden flows? 
  4. Why is direct numerical simulation important for the study of turbulence and particle dispersion in turbulent flow?
  5. Why is direct numerical simulation expensive?
  6. How can machine learning reduce computation time?
  7. How can machine learning provide insight into the behavior of particle motion?

The Dataset

The data were collected by Reza Momenifar as part of his doctoral thesis to investigate the properties of particle clusters in turbulent flow. The dataset is extracted from many numerical simulations in 3D space, performed in Reza’s Theoretical and Computational Fluid Dynamic Group. The simulations model the distributions of particles under idealized turbulence in a cubic box. In these simulations, three independent control parameters representing the properties of turbulent flow and particles are varied. The particles’ positions and other dynamic properties of the flow fields (e.g., velocity) are stored. Next, Voronoi tessellation analysis was performed and particle clusters were identified. The particle clusters are represented by the first four moments of cluster size distributions. 

In this analysis the predictor variables are the fluid and particle properties (Reynolds number, Stokes number, Froude number). The response variables are the first four moments of the cluster size distribution. The students receive a dataset with 120 observations (rows).

In-Class Exercises

Reza and Jon first presented a lecture introducing the concept of turbulence and how turbulence manifests in everyday phenomena. This lecture began with a background on the study of fluid dynamics. Then, they introduced direct numerical simulation (DNS) and explained why DNS is computationally expensive. Afterwards, they explained the Voronoi tessellation analysis and its applications, particularly in particle-clustering.  Finally, Reza described how he generated the dataset that students will use for their assignment. 

After Reza’s lecture, Reza and Jon gave students their assignment. Students were told that they had three weeks to develop four statistical models - one for each of the four moments - given the sample dataset. 

The course (STA 325) had weekly lab sessions, which provided a natural venue for students to ask questions about the assignment. The main issue that students had was scaling the variables. The ranges of some of the predictor and response variables were quite large; therefore, students had to think critically about how to appropriately scale these variables. Regarding the models themselves, students were quite comfortable using the R programming language to develop different types of models, including linear, generalized additive, and tree-based models. The students were particularly well-suited for the assignment because they had just learned about different types of models from their regular course instruction. 

Students presented their results during a two-hour presentation session. In addition to presenting their findings, students were asked to use their models to make predictions on a test set (data that includes response variables only). Students submitted their predictions to the TA’s, who determined which group had the most accurate models. Model performance was taken into account for assigning student grades on the assignment. 

After the final presentations, students reflected on their experience in the Data Expeditions project. Several students noted that the Data Expedition felt like a real-life client project, similar to what they might experience at a consulting firm. Other students noted that they were able to directly apply the material learned in class to a novel dataset.  

Below are images from student submissions:

Moment I graphsInteraction between Re and Fr graphs


Source of the Data

Momenifar, M., Bragg, A.~D.\ 2019.\ Local analysis of the clustering, velocities and accelerations of particles settling in turbulence.\ arXiv e-prints arXiv:1908.00341.


assignment (PDF)

data-test (CSV)

data-train (CSV)

presentation_slides (PDF)

proposal (PDF)

rubric (DOC)

Related Projects

In this two-day, virtual data expedition project, students were introduced to the APIM in the context of stress proliferation, linked lives, the spousal relationship, and mental and physical health outcomes.

Stress proliferation is a concept within the stress process paradigm that explains how one person’s stressors can influence others (Thoits 2010). Combining this with the life course principle of linked lives explains that because people are embedded in social networks, stress not only can impact the individual but can also proliferate to people close to them (Elder Jr, Shanahan and Jennings 2015). For example, one spouse’s chronic health condition may lead to stress-provoking strain in the marital relationship, eventually spilling over to affect the other spouse’s mental health. Additionally, because partners share an environment, experiences, and resources (e.g., money and information), as well as exert social control over each other, they can monitor and influence each other’s health and health behaviors. This often leads to health concordance within couples; in other words, because individuals within the couple influence each other’s health and well-being, their health tends to become more similar or more alike (Kiecolt-Glaser and Wilson 2017, Polenick, Renn and Birditt 2018). Thus, a spouse’s current health condition may influence their partner’s future health and spouses may contemporaneously exhibit similar health conditions or behaviors.

However, how spouses influence each other may be patterned by the gender of the spouse with the health condition or exhibiting the health behaviors. Recent evidence suggests that a wife’s health condition may have little influence on her husband’s future health conditions, but that a husband’s health condition will most likely influence his wife’s future health (Kiecolt-Glaser and Wilson 2017).

Female baboons occasionally exhibit large swellings on their behinds. Although these ‘sexual swellings’ may evoke disgust from human on-lookers, they provide important information to group members about a female’s reproductive state. To figure out what these sexual swellings mean and whether male baboons notice, we need to look at the data.  

A large and growing trove of patient, clinical, and organizational data is collected as a part of the “Help Desk” program at Durham’s Lincoln Community Health Center. Help Desk is a group of student volunteers who connect with patients over the phone and help them navigate to community resources (like food assistance programs, legal aid, or employment centers). Data-driven approaches to identifying service gaps, understanding the patient population, and uncovering unseen trends are important for improving patient health and advocating for the necessity of these resources. Disparities in food security, economic stability, education, neighborhood and physical environment, community and social context, and access to the healthcare system are crucial social determinants of health, which studies indicate account for nearly 70% of all health outcomes.