The series is envisioned as a vital contribution to the intellectual, cultural, and scholarly environment at The University of Texas at Austin for students, faculty, and the wider community. Each talk is free of charge and open to the public. For more information, contact Stephanie Tomlinson at sat[@]austin[dot]utexas[dot]edu.

#### FALL 2018 SEMINAR SERIES

**August 31, 2018 – Katherine Heller**

(Department of Statistical Science, Duke University)

"Machine Learning for Health Care" **CBA 4.328, 2:00 to 3:00 PM**

**September 21, 2018 – Su Chen**

(Department of Statistics and Data Sciences, UT Austin)

"Fast Bayesian Variable Selection: Solo Spike and Slab"**CBA 4.328, 2:00 to 3:00 PM**

**September 28, 2018 – Lucas Janson**

(Department of Statistics, Harvard University)

"Using Knockoffs to find important variables with statistical guarantees” **CBA 4.328, 2:00 to 3:00 PM**

**October 5, 2018 – Carlos Pagani Zanini**

(Department of Statistics and Data Sciences, UT Austin)

"A Bayesian Random Partition Model for Sequential Refinement and Coagulation” **CBA 4.328, 2:00 to 3:00 PM**

**October 12, 2018 – Mingzhang Yin**

(Department of Statistics and Data Sciences, UT Austin)

"ARM: Augment-REINFORCE-merge gradient for discrete latent variables"**CBA 4.328, 2:00 to 3:00 PM**

**October 19, 2018 – Debdeep Pati**

(Department of Statistics, Texas A&M University)

"Constrained Gaussian processes and the proton puzzle problem.” **CBA 4.328, 2:00 to 3:00 PM**

**October 26, 2018 – Yuguo Chen**

(Department of Statistics, University of Illinois at Urbana-Champaign)

"Statistical Inference on Dynamic Networks"**CBA 4.328, 2:00 to 3:00 PM**

**November 2, 2018 – Evan Ott**

(Department of Statistics and Data Sciences, UT Austin)

"TBD” **CBA 4.328, 2:00 to 3:00 PM**

**November 9, 2018 – Mengjie Wang**

(Department of Statistics and Data Sciences, UT Austin)

"TBD” **CBA 4.328, 2:00 to 3:00 PM**

**November 16, 2018 – David Yeager**

(Department of Psychology, UT Austin)

"TBD” **CBA 4.328, 2:00 to 3:00 PM**

**November 30, 2018 – Choudur Lakshminarayan**

(Statistics and Data Sciences, UT Austin)

"TBD” **CBA 4.328, 2:00 to 3:00 PM**

**December 7, 2018 – Cory Zigler**(Dell Medical School, Statistics and Data Sciences, UT Austin)

"TBD”

**CBA 4.328, 2:00 to 3:00 PM**

** Katherine Heller**** **(Department of Statistical Science, Duke University)

**T****itle:** Machine Learning for Health Care

**Abstract:**We will present multiple ways in which healthcare data is acquired and machine learning methods are currently being introduced into clinical settings. This will include:

* Modelling the prediction of disease, including Sepsis, and ways in which the best treatment decisions for Sepsis patients can be made, from electronic health record (EHR) data using Gaussian processes and deep learning methods

* Predicting surgical complications and transfer learning methods for combining databases

* Using mobile apps and integrated sensors for improving the granularity of recorded health data for chronic conditions. Current work in these areas will be presented and the future of machine learning contributions to the field will be discussed.

**Su Chen** (Department of Statistics and Data Sciences, UT Austin)

**Title:** Fast Bayesian Variable Selection: Solo Spike and Slab

**Abstract:** We present a method for fast Bayesian variable selection in the normal linear regression model with high dimensional data. A novel approach is adopted in which an explicit posterior probability for including a covariate is obtained. The method is sequential but not order dependent, one deals with each covariate one by one, and a spike and slab prior is only assigned to the coefficient under investigation. We adopt the well-known spike and slab Gaussian priors with a sample size dependent variance, which achieves strong selection consistency for marginal posterior probabilities even when the number of covariates grows almost exponentially with sample size. Numerical illustrations are presented where it is shown that the new approach provides essentially equivalent results to the standard spike and slab priors, i.e. the same marginal posterior probabilities, which are estimated via Gibbs sampling. Hence, we obtain the same results via the direct calculation of p probabilities, compared to a stochastic search over a space of 2^p elements. Our procedure only requires p probabilities to be calculated, which can be done exactly, hence parallel computation when p is large is feasible.

** Lucas Janson **(Department of Statistics, Harvard University)**Title:** Using Knockoffs to find important variables with statistical guarantees**Abstract:** Many contemporary large-scale applications, from genomics to advertising, involve linking a response of interest to a large set of potential explanatory variables in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively select important variables while controlling the fraction of false discoveries, even in high-dimensional logistic regression, not to mention general high-dimensional nonlinear models. To address such a practical problem, we propose a new framework of model-X knockoffs, which reads from a different perspective the knockoff procedure (Barber and Candès, 2015) originally designed for controlling the false discovery rate in low-dimensional linear models. Model-X knockoffs can deal with arbitrary (and unknown) conditional models and any dimensions, including when the number of explanatory variables p exceeds the sample size n. Our approach requires the design matrix be random (independent and identically distributed rows) with a known distribution for the explanatory variables, although we show preliminary evidence that our procedure is robust to unknown/estimated distributions. As we require no knowledge/assumptions about the conditional distribution of the response, we effectively shift the burden of knowledge from the response to the explanatory variables, in contrast to the canonical model-based approach which assumes a parametric model for the response but very little about the explanatory variables. To our knowledge, no other procedure solves the controlled variable selection problem in such generality, but in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. We also apply our procedure to data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.**Note**: Although model-X knockoffs is a frequentist procedure, it has two aspects that may appeal to Bayesian researchers as well: (1) it can be used as a wrapper around Bayesian variable importance/selection methods, retaining high power when the Bayesian method is successful while always guaranteeing false discovery rate control (even if the prior and/or model are wrong, or the Bayesian computation doesn't converge); (2) While model-X knockoffs provides a general framework, applying it requires generating the knockoff variables, which is in general a challenging conditional sampling problem that I believe many tools from Bayesian computation can be brought to bear on.

**Carlos Pagani Zanini **(Department of Statistics and Data Sciences, UT Austin)**Title:** A Bayesian Random Partition Model for Sequential Refinement and Coagulation**Abstract:** We analyze time-course protein activation data to track the changes in the clustering of protein expression over time after the proteins are exposed to drugs such as protein inhibitors. Protein expression is expected to change over time in response to the drug intervention in different ways due to biological pathways. We therefore allow for proteins to cluster differently at different time points. As the effect of the drug wears off, the protein expression may resort back to the level before the drug treatment. In addition, different drugs, doses, and cell lines may have different effects in altering the protein expression. To model and understand this process we develop random partition models to identify the refinement and coagulation of protein clusters over time. We demonstrate the approach using a time-course reverse phase protein array (RPPA) dataset consisting of protein expression measurements under three different drugs, each with different dose levels, and with different cell lines. The developed model can be applied in general to time-course data where clustering of the experimental units is expected to change over time in a sequence of refinement and coagulation.

**Mingzhang Yin**(Department of Statistics and Data Sciences, UT Austin)

**Title:**ARM: Augment-REINFORCE-merge gradient for discrete latent variables

**Abstract:**To backpropagate the gradients through stochastic binary layers, we propose the augment-REINFORCE-merge (ARM) estimator that is unbiased and has low variance. Exploiting data augmentation, REINFORCE, and reparameterization, the ARM estimator achieves adaptive variance reduction for Monte Carlo integration by merging two expectations via common random numbers. The variance-reduction mechanism of the ARM estimator can also be attributed to antithetic sampling in an augmented space. Experimental results show the ARM estimator provides state-of-the-art performance in multiple tasks in variational auto-encoding and maximum likelihood inference, for discrete latent variable models with one or multiple stochastic binary layers.

**Debdeep Pati**(Department of Statistics, Texas A&M University)

**Title:**Constrained Gaussian processes and the proton puzzle problem.

**Abstract:**The proton radius puzzle is an unanswered problem in physics relating to the size of the proton. Historically the proton radius was measured via two independent methods, which converged to a value of about 0.8768 femtometers.. This value was challenged by a 2010 experiment utilizing a third method, called the muonic lamb shift experiment which produced a radius about 5% smaller than this. The discrepancy is explained in the current literature either by changing the laws of physics or suspecting that the original data collected from the electron scattering experiment were erroneous. Although new datasets with high precision measurements confirm that the radius might actually be closer to 0.84 fm, the discrepancy stemming from the original dataset remains unresolved, and is a topic of ongoing research.

We approach this problem from a nonparametric Bayesian function estimation perspective, with physical constraints explicitly accounted for in the estimation procedure. Our analysis of the electron-form factor measurements versus potential transfer values data confirms the value obtained from the new datasets (0.84 fm) as the radius. Incorporating the physical constraints substantially reduces the uncertainty and 95 % credible intervals obtained from our method do not contain the previous value of 0.8768 fm.

**Yuguo Chen**(Department of Statistics, University of Illinois at Urbana-Champaign)

**Title:**Statistical Inference on Dynamic Networks

**Abstract:**Dynamic networks are used in a variety of fields to represent the structure and evolution of the relationships between entities. We present a model which embeds longitudinal network data as trajectories in a latent Euclidean space. A Markov chain Monte Carlo algorithm is proposed to estimate the model parameters and latent positions of the nodes in the network. The model parameters provide insight into the structure of the network, and the visualization provided from the model gives insight into the network dynamics. We apply the latent space model to simulated data as well as real data sets to demonstrate its performance.