Data Science Project Match
Data Science Project Match
,Matching students with data science research opportunities with Yale faculty.
Data Science Project Match
Tuesday, August 30, 2022 3:00PM to 4:00PM @ Dunham Lab, Room 220
Yale livestream: https://yale.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3b1ed239-a30e-4ef8-9569-aef500d479f9
Introduction:
Daniel A. Spielman, Sterling Professor Computer Science; Professor of Statistics and Data Science, and of Mathematics
Project Presentations:
Hyojung Seo, Assistant Professor of Psychiatry and Neuroscience, Yale School of Medicine
hyojung.seo@yale.edu
https://medicine.yale.edu/profile/hyojung_seo
Understanding how the brain generates intelligent behavior via complex neural networks
Cognitive behavior is generated by coordinated activity across networks of neurons, but it remains poorly understood how complex spatiotemporal features of network activity mediate diverse elements of cognition. Exploiting recent advances in neurotechnology and computer science, we are interested in exploring new tools to analyze and model neural activity underlying cognition. First, several statistical methods have been proposed to decompose/analyze high-dimensional population neural activity recorded simultaneously from many neurons. The project aims to explore and assess the proof-of-concept methods in dynamic systems approach, by applying them to analyze neural data recorded from diverse brain areas and under different behavioral contexts. Second project aims to use artificial neural network and deep reinforcement learning to model how cognition such as theory of mind can emerge from neural network interacting with other agents in simple games. Finally, as we plan to collect neural data from large-scale networks, we would like to explore network analysis methods to model and understand how distinctive cognitive and motivational functions are generated by collective and interactive dynamics of constituent nodes of the networks, and how specific perturbations of the network dynamics could lead to the symptoms of psychiatric illnesses.
Forrest Crawford, Associate Professor, Biostatistics, Statistics & Data Science, Operations, EEB
forrest.crawford@yale.edu
http://www.crawfordlab.io
Dynamics of the January 6, 2021 insurrection at the US Capitol
On January 6, 2021, President Trump Donald J. Trump led the “Stop the Steal” rally at the Ellipse outside the White House in Washington DC. During and immediately after the President’s speech, the crowd moved toward the US Capitol, breached the building security perimeter, and a riot ensued. Members of the crowd broke into the Capitol and attempted to disrupt the counting of electoral votes from the 2020 election. Hundreds of participants and police were injured, and at least 5 deaths occurred as a result of the riot; 884 individuals have been charged with crimes for their role in the events of January 6. The purpose of this project is to study the network and aggregate movement dynamics of participants during the Capitol riot using mobile device location data. Specifically, the analytic goals include understanding the flow dynamics of participants from the rally toward the Capitol and into secure areas, crowd density estimates within the riot zone, locations and times where participants breached the Capitol building, and proximity network patterns among subgroups of riot participants.
Claire Bowern, Professor of Linguistics
claire.bowern@yale.edu
https://campuspress.yale.edu/clairebowern/ or http://www.pamanyungan.net
Neural Network Classifier for Voynich Plant Illustrations
The Voynich Manuscript (MS 408) is a 15th century cipher manuscript in Yale’s Beinecke Library. Two of its five sections include illustrations of plants and astrological diagrams. While there have been attempts to link the botanical illustrations to known plant species, the search space is large. This project builds on a prototype that matches Voynich illustrations with images from other medieval manuscripts. The prototype scrapes selected manuscript archives (e.g. the British Library and Bodleian Library collections), processes illustrations, and trains a neural network to classify the images and create a database of possible Voynich matches. I am looking to work with one or two students to scale up the prototype to full development. Experience with python and neural network classification models is needed.
Amin Karbasi, Associate Professor of Electrical Engineering, Computer Science, and Statistics and Data Science
amin.karbasi@yale.edu
http://iid.yale.edu
Gaming the Learning
If you like games (e.g., chess) and statistical learning, then this might be of interest to you. Consider the task of learning an unknown concept from a given concept class; to what extent does interacting with a domain expert accelerate the learning process? It turns out the answer is hidden in better understanding the game between an adversary (that tries to deceive) and a learner.
Margaret S. Clark, John M. Musser Professor of Psychology, Head, Trumbull College Dean of Academic Affairs
margaret.clark@yale.edu
https://clarkrelationshiplab.yale.edu
Emotional dynamics in close relationships
People experience and express (or suppress) emotions primarily in the context of their close relationships with friends, family and romantic partners and they constantly monitor their partners’ emotions within these same relationships. We have two data sets relevant to these processes which remain to be explored. First, we have some longitudinal data from 108 couples (216 individuals) including some personality measures and self-reports of their tendencies to experience and express a variety of positive and negative emotions and their perceptions of their partner’s experiences and expressions of the same emotions as well as two 5 day daily diary studies) in which they report how they felt upon giving and receiving benefits from one another. We a second data set from just over 200 couples (400 individuals) in which the same personality measures and self-report measures were collected and, in addition, couples engaged in four taped discussions of a positive and a negative event that occurred for each of them. These tapes have been coded for verbal and non-verbal expressions of a variety of positive and negative emotions (both by objective observers, the expressor of the emotion and that person’s partner). Do people project their own emotions onto what they see in their partners? Do people’s self-reports of their general tendencies to experience and express emotions match what they themselves self-report feeling in the moment, what their partners’ report? What objective observers report? How do personality factors relate to emotion expression?
Is being romantically partnered linked to better mental health for people of all sexual identities?
Two well-established findings are: #1 That gay/lesbian and bisexual individuals, on average, suffer from greater depression and anxiety than do heterosexual individuals and #2 that adults who are partnered, on average, experience less depression and anxiety than do those who are not partnered. However, studies supporting the latter finding have been done with exclusively (or primarily) heterosexual samples. We have collected a data set including partnered (for at least one year) and not-partnered (for at least one year) heterosexuals, gays, lesbians and bi-sexual individuals. They all filled out measures of anxiety, depression, life-satisfaction, relationship satisfaction and discrimination (experienced within the last year and over the course of their life-times). The data set can be used to explore not just if partnering is associated with the same benefits for members of all sexual orientations but if so why and if not why not?
Luke Sanford
Assistant Professor of Environmental Policy and Governance, Yale School of the Environment
luke.sanford@yale.edu
https://lcsanford.github.io/
Satellite imagery and machine learning for causal impact evaluation
This project develops machine learning methods to measure economic development or environmental damage from satellite imagery. We show that many outcome variables as measured with existing remote sensing/machine learning methods can generate bias when used in causal impact evaluation. When standard machine learning methods minimize loss they produce estimates which are on average unbiased across the training data. However, this unbiasedness is not likely to hold across important subsets of the data, including the range of the true values of the outcome variable, or across important independent variables. We propose two strategies. First we use adversarial debiasing algorithms–originally developed to ensure that machine learning methods do not encode racial or other demographic biases–to generate suitable measures. Second, we use an active learning labeling method to reduce bias in existing methods while reducing the total amount of labeling researchers have to conduct.
We are looking for students who have background in both statistics and machine learning. Any experience in the areas of adversarial methods, active learning, computer vision, or spatial data analysis are a plus!
Meg Urry, Israel Munson Professor of Physics, Director, Yale Center for Astronomy and Astrophysics
Presented by Aritra Ghosh, senior graduate student in Prof. Urry’s group
meg.urry@yale.edu
aritra.ghosh@yale.edu
https://urrylab.yale.edu
Assessing the Shapes of Galaxies & AGN using Machine Learning
Have you ever wondered how many galaxies are out there in this universe? While the real answer is infinite, in our broad neighborhood of the universe, we estimate that number to be 100 billion! New telescopes, including the James Webb Space Telescope (JWST) and the upcoming Rubin Survey, has allowed a significant expansion on the distance (and number of galaxies) we can image. The sheer volume of this imaging dataset makes it very difficult to analyze it using traditional astronomical tools. Our lab has developed two flagship algorithms – GaMorNet and GaMPEN to help determine the shapes and sizes of ~10 million galaxies, spanning multiple surveys and redshifts (distance from Earth). We have also adapted another generative network, PSFGAN, to apply the above algorithms to Active Galactic Nuclei [AGN; galaxies where very massive black holes at their center release lots of energy in the form of electromagnetic radiation]. Since galaxies and AGN with different shapes evolve differently over time, assessing the shapes of these objects allows us to infer how galaxies and black holes evolve and how their evolution is correlated.
Project options: a) developing a machine learning tool that can detect merging galaxies – an important subclass which represents 15-30% of all galaxies. Merging galaxies are astronomically interesting because this is one of the mechanisms via which galaxies evolve/change shape, and mergers have also been shown to affect the rate of formation of stars and AGN activity. Due to their distorted/unusual shapes, merging galaxies often confuse our shape-determining algorithms. Thus we would like to develop an ML framework that can flag mergers and subsequently determine the shapes of the merging galaxies; b) improving the uncertainty quantification of GAMPEN – we are interested in using deep ensembles / stochastic weight averaging / simulation-based inference to verify whether any of these can produce better-calibrated uncertainties than our current approach; and c) improving the auto-cropping feature of GAMPEN – GaMPEN includes a Spatial Transformer Network to automatically crop input galaxies to an optimal size.
David van Dijk, Assistant Professor of Medicine and Computer Science
david.vandijk@yale.edu
www.vandijklab.org
Graph-neural networks for brain dynamics and spatial genomics
Graph-neural networks (or geometric deep learning) are revolutionizing machine learning and data science. They combine ideas from graph theory, geometry, topology, and deep learning to learn powerful non-linear models on graphical data. In the Van Dijk Lab we are developing several new types of graph neural networks, based on ideas from integral equations and self-attention models, and apply these to diverse biomedical applications. In one application, we are using graph neural networks to model spatiotemporal brain activity data, such as whole-cortex calcium imaging and fMRI recordings. In a second application, we are using graph neural networks to model spatial transcriptomic data – a new technology for the measurement of high-dimensional gene expression at the single-cell level with spatial resolution. Using our algorithm, we infer cell-cell interactions in measurements from kidney cancer and brain tissue of multiple sclerosis patients. In these projects, there is the opportunity to focus more on the algorithmic side or on the application, and you will work closely with postdocs and grad students in the lab.
Y. Richard Yang, Professor of Computer Science, Director of Undergraduate Studies (on leave Fall 2022), Department of Computer Science
yry@cs.yale.edu
https://csl.yale.edu
Building Network Infrastructures for Data-Intensive Sciences, with Data-Science Techniques
The project is on the analysis, design and implementation of the control planes of the network infrastructures supporting data-intensive sciences. Increasingly, the workflows of many branches of sciences generate and process a large amount of data. Supported mainly by public funding, they do not have resources comparable to commercial hyper-scalers such as Google and Facebook to design bespoke systems, and hence their designs are often outdated. The objective of this project is to conduct modern design, and implement the design in flagship systems, in particular, with the CERN Rucio, FTS, and data infrastructure teams, which run the largest data-intensive science network. To guide the design, the project will collect and analyze the infrastructure data at CERN, using data science techniques. Closing the loop, the project seeks to feedback the design to the global Internet, to influence the Internet architecture, such as the Web/HTTP design from the CERN community has made a fundamental impact on the Internet architecture.
Rohan Khera, Assistant Professor of Medicine, Cardiovascular Medicine, Yale School of Medicine, Assistant Professor of Biostatistics, Health Informatics, Yale School of Public Health, Clinical Director, Data Analytics Center, Yale-New Haven Hospital Center for Outcomes Research and Evaluation, Lead Investigator, Cardiovascular Data Science (CarDS) Lab, Yale University
rohan.khera@yale.edu
https://www.cards-lab.org
CarDS Lab: Leveraging Informatics and Artificial Intelligence for Data-Driven Innovation in Cardiovascular Care
The Cardiovascular Data Science (CarDS) Lab at Yale University focuses on leveraging large data streams in the Electronic Health Record (EHR) and randomized clinical trials. These data streams that span both structured and unstructured elements are used to identify novel disease populations, to build AI-based strategies for the detection of undiagnosed diseases, and to optimize both efficiency and quality of care. These applications allow unique training in (a) developing state-of-the-art data architectures optimized for managing large and dynamic real-world data streams, and (b) building custom applications of AI to clinical trials, time series data, natural language, cardiovascular images, and signal-based data. Current trainees span post-doctoral fellows, graduate students, clinical fellows/residents, and undergraduates, and many of them have publications in leading scientific journals and are named inventors on intellectual property generated as a part of their work. The experience is ideal for students looking to learn both informatics and data science in healthcare, and planning a future career in academic healthcare and/or health technology and data science industry. Typical skills helpful to be successful are programming in python and similar programming languages, and an interest in application-driven learning. The lab has resources to support scholarly development including domain-track meetings, journal clubs, and tutorials/articles curated to support learning.