The series is envisioned as a vital contribution to the intellectual, cultural, and scholarly environment at The University of Texas at Austin for students, faculty, and the wider community. Each talk is free of charge and open to the public. For more information, contact Stephanie Tomlinson at sat[@]austin[dot]utexas[dot]edu.

#### FALL 2018 SEMINAR SERIES

**August 31, 2018 – Katherine Heller**

(Department of Statistical Science, Duke University)

"Machine Learning for Health Care" **CBA 4.328, 2:00 to 3:00 PM**

**September 21, 2018 – Su Chen**

(Department of Statistics and Data Sciences, UT Austin)

"Fast Bayesian Variable Selection: Solo Spike and Slab"**CBA 4.328, 2:00 to 3:00 PM**

**September 28, 2018 – Lucas Janson**

(Department of Statistics, Harvard University)

"Using Knockoffs to find important variables with statistical guarantees” **CBA 4.328, 2:00 to 3:00 PM**

**October 5, 2018 – Carlos Pagani Zanini**

(Department of Statistics and Data Sciences, UT Austin)

"A Bayesian Random Partition Model for Sequential Refinement and Coagulation” **CBA 4.328, 2:00 to 3:00 PM**

**October 12, 2018 – Mingzhang Yin**

(Department of Statistics and Data Sciences, UT Austin)

"ARM: Augment-REINFORCE-merge gradient for discrete latent variables"**CBA 4.328, 2:00 to 3:00 PM**

**October 19, 2018 – Debdeep Pati**

(Department of Statistics, Texas A&M University)

"Constrained Gaussian processes and the proton puzzle problem.” **CBA 4.328, 2:00 to 3:00 PM**

**October 26, 2018 – Yuguo Chen**

(Department of Statistics, University of Illinois at Urbana-Champaign)

"Statistical Inference on Dynamic Networks"**CBA 4.328, 2:00 to 3:00 PM**

**November 2, 2018 – Evan Ott**

(Department of Statistics and Data Sciences, UT Austin)

"Bayesian Deep Learning: Extending Probabilistic Backpropagation and Transfer Learning” **CBA 4.328, 2:00 to 3:00 PM**

**November 9, 2018 – Mengjie Wang**

(Department of Statistics and Data Sciences, UT Austin)

"A Data Dependent Posterior Density Generative Model” **CBA 4.328, 2:00 to 3:00 PM**

**November 16, 2018 – David Yeager**

(Department of Psychology, UT Austin)

"Heterogeneous Effects of a Scalable Growth-Mindset Intervention on Adolescents’ Educational Trajectories" **CBA 4.328, 2:00 to 3:00 PM**

**November 29, 2018 – Sonia Petrone**(Universita' Bocconi, Department of Decision Sciences)

"Quasi-Bayes properties of a sequential procedures for mixtures."

**GDC 4.302, 3:00 to 4:00 PM**

**November 30, 2018 – Choudur Lakshminarayan**

(Statistics and Data Sciences, UT Austin)

"Data Compression, and Statistical Pattern Recognition in Healthcare Applications" **CBA 4.328, 2:00 to 3:00 PM**

**December 7, 2018 – Cory Zigler**(Dell Medical School, Statistics and Data Sciences, UT Austin)

"Bipartite Causal Inference with Interference: Estimating Health Impacts of Power Plant Regulations"

**CBA 4.328, 2:00 to 3:00 PM**

** Katherine Heller**** **(Department of Statistical Science, Duke University)

**T****itle:** Machine Learning for Health Care

**Abstract:**We will present multiple ways in which healthcare data is acquired and machine learning methods are currently being introduced into clinical settings. This will include:

* Modelling the prediction of disease, including Sepsis, and ways in which the best treatment decisions for Sepsis patients can be made, from electronic health record (EHR) data using Gaussian processes and deep learning methods

* Predicting surgical complications and transfer learning methods for combining databases

* Using mobile apps and integrated sensors for improving the granularity of recorded health data for chronic conditions. Current work in these areas will be presented and the future of machine learning contributions to the field will be discussed.

**Su Chen** (Department of Statistics and Data Sciences, UT Austin)

**Title:** Fast Bayesian Variable Selection: Solo Spike and Slab

**Abstract:** We present a method for fast Bayesian variable selection in the normal linear regression model with high dimensional data. A novel approach is adopted in which an explicit posterior probability for including a covariate is obtained. The method is sequential but not order dependent, one deals with each covariate one by one, and a spike and slab prior is only assigned to the coefficient under investigation. We adopt the well-known spike and slab Gaussian priors with a sample size dependent variance, which achieves strong selection consistency for marginal posterior probabilities even when the number of covariates grows almost exponentially with sample size. Numerical illustrations are presented where it is shown that the new approach provides essentially equivalent results to the standard spike and slab priors, i.e. the same marginal posterior probabilities, which are estimated via Gibbs sampling. Hence, we obtain the same results via the direct calculation of p probabilities, compared to a stochastic search over a space of 2^p elements. Our procedure only requires p probabilities to be calculated, which can be done exactly, hence parallel computation when p is large is feasible.

** Lucas Janson **(Department of Statistics, Harvard University)**Title:** Using Knockoffs to find important variables with statistical guarantees**Abstract:** Many contemporary large-scale applications, from genomics to advertising, involve linking a response of interest to a large set of potential explanatory variables in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively select important variables while controlling the fraction of false discoveries, even in high-dimensional logistic regression, not to mention general high-dimensional nonlinear models. To address such a practical problem, we propose a new framework of model-X knockoffs, which reads from a different perspective the knockoff procedure (Barber and Candès, 2015) originally designed for controlling the false discovery rate in low-dimensional linear models. Model-X knockoffs can deal with arbitrary (and unknown) conditional models and any dimensions, including when the number of explanatory variables p exceeds the sample size n. Our approach requires the design matrix be random (independent and identically distributed rows) with a known distribution for the explanatory variables, although we show preliminary evidence that our procedure is robust to unknown/estimated distributions. As we require no knowledge/assumptions about the conditional distribution of the response, we effectively shift the burden of knowledge from the response to the explanatory variables, in contrast to the canonical model-based approach which assumes a parametric model for the response but very little about the explanatory variables. To our knowledge, no other procedure solves the controlled variable selection problem in such generality, but in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. We also apply our procedure to data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.**Note**: Although model-X knockoffs is a frequentist procedure, it has two aspects that may appeal to Bayesian researchers as well: (1) it can be used as a wrapper around Bayesian variable importance/selection methods, retaining high power when the Bayesian method is successful while always guaranteeing false discovery rate control (even if the prior and/or model are wrong, or the Bayesian computation doesn't converge); (2) While model-X knockoffs provides a general framework, applying it requires generating the knockoff variables, which is in general a challenging conditional sampling problem that I believe many tools from Bayesian computation can be brought to bear on.

**Carlos Pagani Zanini **(Department of Statistics and Data Sciences, UT Austin)**Title:** A Bayesian Random Partition Model for Sequential Refinement and Coagulation**Abstract:** We analyze time-course protein activation data to track the changes in the clustering of protein expression over time after the proteins are exposed to drugs such as protein inhibitors. Protein expression is expected to change over time in response to the drug intervention in different ways due to biological pathways. We therefore allow for proteins to cluster differently at different time points. As the effect of the drug wears off, the protein expression may resort back to the level before the drug treatment. In addition, different drugs, doses, and cell lines may have different effects in altering the protein expression. To model and understand this process we develop random partition models to identify the refinement and coagulation of protein clusters over time. We demonstrate the approach using a time-course reverse phase protein array (RPPA) dataset consisting of protein expression measurements under three different drugs, each with different dose levels, and with different cell lines. The developed model can be applied in general to time-course data where clustering of the experimental units is expected to change over time in a sequence of refinement and coagulation.

**Mingzhang Yin**(Department of Statistics and Data Sciences, UT Austin)

**Title:**ARM: Augment-REINFORCE-merge gradient for discrete latent variables

**Abstract:**To backpropagate the gradients through stochastic binary layers, we propose the augment-REINFORCE-merge (ARM) estimator that is unbiased and has low variance. Exploiting data augmentation, REINFORCE, and reparameterization, the ARM estimator achieves adaptive variance reduction for Monte Carlo integration by merging two expectations via common random numbers. The variance-reduction mechanism of the ARM estimator can also be attributed to antithetic sampling in an augmented space. Experimental results show the ARM estimator provides state-of-the-art performance in multiple tasks in variational auto-encoding and maximum likelihood inference, for discrete latent variable models with one or multiple stochastic binary layers.

**Debdeep Pati**(Department of Statistics, Texas A&M University)

**Title:**Constrained Gaussian processes and the proton puzzle problem.

**Abstract:**The proton radius puzzle is an unanswered problem in physics relating to the size of the proton. Historically the proton radius was measured via two independent methods, which converged to a value of about 0.8768 femtometers.. This value was challenged by a 2010 experiment utilizing a third method, called the muonic lamb shift experiment which produced a radius about 5% smaller than this. The discrepancy is explained in the current literature either by changing the laws of physics or suspecting that the original data collected from the electron scattering experiment were erroneous. Although new datasets with high precision measurements confirm that the radius might actually be closer to 0.84 fm, the discrepancy stemming from the original dataset remains unresolved, and is a topic of ongoing research.

We approach this problem from a nonparametric Bayesian function estimation perspective, with physical constraints explicitly accounted for in the estimation procedure. Our analysis of the electron-form factor measurements versus potential transfer values data confirms the value obtained from the new datasets (0.84 fm) as the radius. Incorporating the physical constraints substantially reduces the uncertainty and 95 % credible intervals obtained from our method do not contain the previous value of 0.8768 fm.

**Yuguo Chen **(Department of Statistics, University of Illinois at Urbana-Champaign)** **

**Title:** Statistical Inference on Dynamic Networks

**Abstract:** Dynamic networks are used in a variety of fields to represent the structure and evolution of the relationships between entities. We present a model which embeds longitudinal network data as trajectories in a latent Euclidean space. A Markov chain Monte Carlo algorithm is proposed to estimate the model parameters and latent positions of the nodes in the network. The model parameters provide insight into the structure of the network, and the visualization provided from the model gives insight into the network dynamics. We apply the latent space model to simulated data as well as real data sets to demonstrate its performance.

**Evan Ott** (Department of Statistics and Data Sciences, University of Texas at Austin)

**Title: **Bayesian Deep Learning: Extending Probabilistic Backpropagation and Transfer Learning

**Abstract: **Deep neural network models are capable of expert-level performance in real world problems, but fail to capture uncertainty in the model parameters. By applying Bayesian inference to neural networks, we quantify the uncertainty in our parameter estimates and can use that uncertainty when making predictions. However, methods like MCMC require untenable computational resources, and methods like Variational Inference underestimate the posterior uncertainty. Probabilistic Backpropagation (PBP) (Hernández-Lobato and Adams, 2015) is a Bayesian neural network method that relies on moment-matching to form an approximate posterior using Assumed Density Filtering. In this seminar, I’ll discuss the PBP method, along with some planned improvements, extensions, and applications.

**Mengjie Wang** (Deparment of Statistics and Data Sciences, University of Texas at Austin)

**Title:** A Data Dependent Posterior Density Generative Model

**Abstract: ** Data dependent posterior distributions (Berlitzer, Annals of Statistics) discusses the idea of a direct construction of a posterior using the data and observing a number of desirable coverage rules.

Our observation is that nonparametric posteriors are typically centered at a mle density estimate. Hence our data dependent posterior is centered at a mle with a Gaussian process used to generate random densities around it. To get a right coverage we match the mean and variance of the distance of these posterior densities from the mle with bootstrapped mle's from the mle, which mimics the distribution of the distance of mle density estimate is from the true density.

The math is made simple when we use the Fisher information distance.

**David Yeager** (Department of Psychology, University of Texas at Austin)

**Title: **Heterogeneous Effects of a Scalable Growth-Mindset Intervention on Adolescents’ Educational Trajectories

**Abstract: **The* National Study of Learning Mindsets* was a longitudinal, double-blind, randomized trial conducted in a representative sample of U.S. public high schools. The study delivered a short *growth mindset* intervention—an intervention teaching that intellectual abilities are not fixed but can be developed—to an entire class of 9^{th} grade students with the goal of understanding where the intervention redirected the educational trajectories of lower-achieving students. That is, the study prioritized *treatment effect heterogeneity*. Three main findings emerged. First, this short, universal, preventative psychological intervention had modest but consequential effects on outcomes such as grades in core classes over the academic year (the primary outcome) and rates of taking advanced math courses the next year (an exploratory outcome). These effects compared favorably in size to many of the most rigorously-evaluated comprehensive adolescent interventions in the literature, but came at a much lower per-person cost, and in a population-generalizable sample. Second, the study identified a school-level factor—behavioral norms regarding challenging schoolwork—that moderated treatment effects on the primary outcome of grades; effects were stronger when the peer norms aligned with its message. Third, because the study took a number of steps to reduce false discoveries, including independent data collection, pre-registration of analyses, and implementation of a conservative, flexible Bayesian model, it provided an example for how to examine treatment effect heterogeneity in a way that is both reproducible and generalizable. This example could be followed in trials conducted in medicine, public health, policy analysis, and other disciplines, to better understand when and where interventions improve social well-being.

** Sonia Petrone **(Universita' Bocconi, Department of Decision Sciences)

**Title: **Quasi-Bayes properties of a sequential procedures for mixtures.

**Abstract:** Mixture models have wide application in many areas. However, when data arrive sequentially, fast computations remain a challenge. M.Newton et al. (see Newton and Zhang, 1999,Biometrika) proposed a recursive rule as a fast approximation of the Bayesian estimate of the mixing distribution in Dirichlet process mixture models. A special case is the quasi-Bayes sequential procedure proposed by Smith and Makov (JRSS,B, 1978) for unsupervised sequential learning and classification with finite mixtures. Convergence results have proven the validity of Newton's recursive scheme as a consistent frequentist estimator. However, the original motivation, to what extent it provides an approximation of a Bayesian procedure, remains open.

In this work we address this question. Using the notions of asymptotic exchangeability and conditionally identically distributed sequences, we show that the recursive algorithm does provide an asymptotic approximation of a Bayesian procedure obtained under exchangeability. Beyond the - important - case of mixture models, our study suggests a rigorous framework to formalize the idea that, with nowadays abundance of data and pressure for fast computations, a slightly miss-specified but computationally more tractable model may provide an attractive compromise.

This is a joint work with Sandra Fortini.** **

**Choudur Lakshminarayan** (Department of Statistics and Data Sciences, Univeristy of Texas at Austin)

**Title:** Data Compression, and Statistical Pattern Recognition in Healthcare Applications

**Abstract: **Sensors are used in healthcare, energy, retail and other industries. Most notably in healthcare, software applications leverage embedded sensors in mobile devices for continuous monitoring. Applications exist for heart rate, respiration, skin resistance, motion, location, and fetal monitoring. Mobile devices in continuous monitoring can provide data to plan rapid responses in hospitals to prevent unnecessary admissions to reduce costs and improve healthcare. As healthcare services transition from hospitals to telemedicine many problems arise. For example; when vibration based sensors collect bio-signals such as heart function; vibrations from the heart, can mix with noise sources from the surrounding environment and other organs, making it difficult to detect salient features in heart rhythms. This talk addresses the problem of signal compression and disaggregation using statistical pattern recognition. We apply statistical methods such as independent component analysis for signal disaggregation. Furthermore, we will share some statistical methods for heart arrhythmia detection and feature selection with data collected from electronic cardiograms (ECG) and how they can be used for real-time intervention.

**About the speaker:** Choudur K. Lakshminarayan is an Engineering Fellow at Teradata Labs and concurrently, an Adjunct Assistant Professor in the department of statistics and data sciences any the University of Texas at Austin. He specializes in the areas of Mathematical Statistics, Applied Mathematics, Machine Learning, and algorithms with applications in sensors and sensing, energy, and Digital Marketing. He is widely published in peer-reviewed international conferences and journals, and his name appears as an inventor in over 50 patents; granted, published, or pending. He has conducted workshops in data mining and analytics in India, Hong Kong, China, the Middle East, Europe and the USA. He regularly speaks at international conferences, symposia, universities, and serves on the program committees of international conferences, and also as a referee to many journals. He teaches a course on Big Data at the annual summer statistics institute at the University of Texas at Austin. He served as a consultant to government, and private industry in the US and India. He holds a PhD in mathematical sciences, and lives in Austin, Texas.

**Cory Zigler** (Dell Medical School, Statistics and Data Sciences, University of Texas at Austin)

**Title: **Bipartite Causal Inference with Interference: Estimating Health Impacts of Power Plant Regulations

**Abstract: **A fundamental feature of evaluating causal health effects of air quality regulations is that air pollution moves through space, rendering health outcomes at a particular population location dependent upon regulatory actions taken at multiple, possibly distant, pollution sources. Motivated by studies of the public-health impacts of power plant regulations in the U.S., this talk introduces the novel setting of bipartite causal inference with interference, which arises when 1) treatments are defined on observational units that are distinct from those at which outcomes are measured and 2) there is interference between units in the sense that outcomes for some units depend on the treatments assigned to many other units. Interference in this setting arises due to complex exposure patterns dictated by physical-chemical atmospheric processes of pollution transport, with intervention effects framed as propagating across a bipartite network of power plants and residential zip codes. New causal estimands are introduced for the bipartite setting, along with an estimation approach based on generalized propensity scores for treatments on a network. The new methods are deployed to estimate how emission-reduction technologies implemented at coal-fired power plants causally affect health outcomes among Medicare beneficiaries in the U.S..