The Department of Statistics and Data Sciences is pleased to announce the line-up for the 2017 Spring SDS Seminar Series. In its 7th year, the lecture series provides participants with the opportunity to hear from leading scholars and experts who work in different applied areas, including business, biology, machine learning, computer vision, economics, and public health.

The series is envisioned as a vital contribution to the intellectual, cultural, and scholarly environment at The University of Texas at Austin for students, faculty, and the wider community. Each talk is free of charge and open to the public. For more information, contact Rachel Poole.

#### Spring Seminar

**January 17, 2017 – Kristin Linn**

(Perelman School of Medicine PennSIVE Lab, University of Pennsylvania)

“Confounding in Imaging-based Predictive Modeling” **GDC 2.210, 2:00 to 3:00 PM****January 20, 2017 – Amy Willis **

(Department of Statistical Science, Cornell University)

“Confidence sets for phylogenetic trees” **CLA 1.106, 2:00 to 3:00 PM****January 24, 2017 – Abhra Sarkar **

(Department of Statistical Science, Duke University)

“Novel Statistical Frameworks for Analysis of Structured Sequential Data” **GDC 2.210**, 2:00 to 3:00 PM**January 27, 2017 – Rajarshi Mukherjee **

(Department of Statistics, Stanford University)

“Sparse Signal Detection with Binary Outcomes" **CLA 1.106, 2:00 to 3:00 PM****January 31, 2017 – Alexander Franks **

(eScience Institute, University of Washington)

“Bayesian Covariance Estimation with Applications in High-throughput Biology" **GDC 2.210**, 2:00 to 3:00 PM**February 3, 2017 – Lorin Crawford **

(Department of Statistical Science, Duke University)

“Bayesian Approximate Kernel Regression with Variable Selection" **CLA 1.106, 2:00 to 3:00 PM**

**February 10, 2017 – Pierpaolo De Blasi **

(University of Torino and Collegio Carlo Alberto)

“Birth-and-death Polya urns and stationary random partitions" **CLA 1.106, 2:00 to 3:00 PM**

**February 14, 2017 – Matteo Ruggiero**

(University of Torino and Collegio Carlo Alberto)

“Conjugacy properties of time-evolving Dirichlet and gamma random measures" **MEZ 2.124, 2:00 to 3:00 PM**

**March 10, 2017 – Guy Cole**

(Department of Statistics and Data Sciences, The University of Texas at Austin)

“Stochastic Blockmodels with Edge Information” **CLA 1.106, 2:00 to 3:00 PM**

**March 24, 2017 – David Van Dyk**

(Department of Mathematics, Imperial College London)

“Quantifying Discovery in Astro/Particle Physics: Frequentist and Bayesian Perspectives” **CLA 1.106, 2:00 to 3:00 PM****March 31, 2017 – Xi Chen**

(Leonard N. Stern School of Business, NYU Stern)

“Statistical Inference for Model Parameters with Stochastic Gradient Descent” **CLA 1.106, 2:00 to 3:00 PM**

**April 3, 2017 – Robert Tibshirani**

(Department of Statistics, Stanford University)

“Some Progress and Challenges in Biomedical Data Science” **WEL 2.122, 3:30 to 4:30 PM**

**April 7, 2017 – Natesh Pillai**

(Department of Statistics, Harvard University)

“*TBD*" **CLA 1.106, 2:00 to 3:00 PM****April 14, 2017 – Peter Hoff**

(Department of Statistical Science, Duke University)

“Adaptive FAB confidence intervals with constant coverage" **CLA 1.106, 2:00 to 3:00 PM****April 28, 2017 – Nicholas Polson **

(School of Business, The University of Chicago Booth)

“Deep Learning Predictors for Traffic Flows" **UTC 1.116, 1:00 to 2:00 PM****May 5, 2017 – Juhee Lee**

(Jack Baskin School of Engineering, UC Santa Cruz)

“*TBD*" **CLA 1.106, 2:00 to 3:00 PM**

** **

**Kristin Linn**** **(Perelman School of Medicine PennSIVE Lab, University of Pennsylvania)

**T****itle:** "Confounding in Imaging-based Predictive Modeling"

**Abstract:** The multivariate pattern analysis (MVPA) of neuroimaging data typically consists of one or more statistical learning models applied within a broader image analysis pipeline. The goal of MVPA is often to learn about patterns of variation encoded in magnetic resonance images (MRI) of the brain that are associated with brain disease incidence, progression, and response to therapy. Every model choice that is made during image processing and analysis can have implications with respect to the results of neuroimaging studies. Here, attention is given to two important steps within the MVPA framework: 1) the standardization of features prior to training a supervised learning model, and 2) the training of learning models in the presence of confounding. Specific examples focus on the use of the support vector machine, as it is a common model choice for MVPA, but the general concepts apply to a large set of models employed in the field. We propose novel methods that lead to improved classifier performance and interpretability, and we illustrate the methods on real neuroimaging data from a study of Alzheimer’s disease.

**Amy Willis **(Department of Statistical Science, Cornell University)**Title: "**Confidence sets for phylogenetic trees"**Abstract:** Phylogenetic trees represent evolutionary histories and have many important applications in biology, anthropology and criminology. The branching structure of the tree encodes the order of evolutionary divergence, and the branch lengths denote the time between divergence events. The target of interest in phylogenetic tree inference is high-dimensional, but the real challenge is that both the discrete (tree topology) and continuous (branch lengths) components need to be estimated. While decomposing inference on the topology and branch lengths has been historically popular, the mathematical and algorithmic developments of the last 15 years have provided a new framework for holistically treating uncertainty in tree inference. I will discuss how we can leverage these developments to construct a confidence set for the Fréchet mean of a distribution with support on the space of phylogenetic trees. The sets have good coverage and are efficient to compute. I will conclude by applying the procedure to revisit an HIV forensics investigation, and to assess our confidence in the geographical origins of the Zika virus.

** Abhra Sarkar **(Department of Statistical Science, Duke University)

**Title: "**Novel Statistical Frameworks for Analysis of Structured Sequential Data"

**Abstract:**We are developing a broad array of novel statistical frameworks for analyzing complex sequential categorical data sets. Our research is primarily motivated by a collaboration with neuroscientists trying to understand the neurological, genetic and evolutionary basis of human communication using bird and rodent models. The data sets comprise structured sequences of syllables or ‘songs’ produced by animals from different genotypes under different experimental conditions. The primary goals are to elucidate the roles of different genotypes and experimental conditions on animal vocalization behaviors and also to learn complex serial dependency structures and systematic patterns in the vocalizations. We are developing novel statistical methods based on first and higher order Markovian dynamics that help answer these important scientific queries. The methods have appealing theoretical properties and practical advantages and are of very broad utility, with applications not limited to analysis of animal vocalization experiments. Our research also paves the way to advanced automated methods for many other sophisticated dynamical systems that can accommodate more general data types.

**Rajarshi Mukherjee **(Department of Statistics, Stanford University)

**Title:** "Sparse Signal Detection with Binary Outcomes”**Abstract: **In this talk, I will discuss some examples of sparse signal detection problems in the context of binary outcomes. These will be motivated by examples from next generation sequencing association studies, understanding heterogeneities in large scale networks, and exploring opinion distributions over networks. Moreover, these examples will serve as templates to explore interesting phase transitions present in such studies. In particular, these phase transitions will be aimed at revealing a difference between studies with possibly dependent binary outcomes and Gaussian outcomes. The theoretical developments will be further complemented with numerical results.

** Alexander Franks **(eScience Institute, University of Washington)

**Title:**"Bayesian Covariance Estimation with Applications in High-throughput Biology"

**Abstract:**Understanding the function of biological molecules requires statistical methods for assessing covariability across multiple dimensions as well as accounting for complex measurement error and missing data. In this talk, I will discuss two models for covariance estimation which have applications in molecular biology. In the first part of the talk, I will describe a model-based method for evaluating heterogeneity among several p x p covariance matrices in the large p, small n setting and will illustrate the utility of the method for exploratory analyses of high-dimensional multivariate gene expression data. In the second half of the talk, I will describe the role of covariance estimation in quantifying how cells regulate protein levels. Specifically, estimates of the correlation between steady-state levels of mRNA and protein are used to assess the degree to which protein levels are determined by post-transcriptional processes. Differences in cell preparation, measurement technology and protocol, as well as the pervasiveness of missing data complicate the accurate estimation of this correlation. To address these issues, I fit a Bayesian hierarchical model to a compendium of 58 data sets from multiple labs to infer a structured covariance matrix of measurements. I contextualize and contrast our results to conclusions drawn in previous studies.

**Lorin Crawford **(Department of Statistical Science, Duke University)**Title: **"Bayesian Approximate Kernel Regression with Variable Selection"**Abstract:** Nonlinear kernel regression models are often used in statistics and machine learning due to greater accuracy than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an analog of the effect size of each explanatory variable for Bayesian kernel regression models when the kernel is shift-invariant---for example the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that (1) captures nonlinear structure and (2) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as the analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. By adapting some classical results in compressive sensing, we state conditions under which BAKR can recover a sparse set of effect sizes, simultaneous variable selection and regression. We illustrate the utility of BAKR by examining, in some detail, two important problems in statistical genetics: genomic selection (predicting phenotype from genotype) and association mapping (inference of significant variables or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings. We will also outline how our proposed framework was used to reveal a MYC-driven transcriptional program in *BRAF*-mutant melanomas that have become resistant to MAP kinase (MAPK) inhibitors.

**Pierpaolo De Blasi **(University of Torino and Collegio Carlo Alberto)**Title: "**Birth-and-death Polya urns and stationary random partitions"**Abstract:** We introduce a class of birth-and-death Polya urns, which allow for both sampling and removal of observations governed by an auxiliary inhomogeneous Bernoulli process, and investigate the asymptotic behaviour of the induced allelic partitions. By exploiting some embedded models, we show that the asymptotic regimes exhibit a phase transition from partitions with almost surely infinitely many blocks and independent counts, to stationary partitions with a random number of blocks. The first regime corresponds to limits of Ewens-type partitions and includes a result of Arratia, Barbour and Tavare’ (1992) as a special case. We identify the invariant and reversible measure in the second regime, which preserves asymptotically the dependence between counts, and is shown to be a mixture of Ewens sampling formulas, with a tilted Negative Binomial mixing distribution on the sample size.

**Matteo Ruggiero **(University of Torino and Collegio Carlo Alberto)**Title: "**Conjugacy properties of time-evolving Dirichlet and gamma random measures"**Abstract:** We extend classic characterisations of posterior distributions under Dirichlet process and gamma random measures priors to a dynamic framework. We consider the problem of learning, from indirect observations, two families of time-dependent processes of interest in Bayesian nonparametrics: the first is a dependent Dirichlet process driven by a Fleming–Viot model, and the data are random samples from the process state at discrete times; the second is a collection of dependent gamma random measures driven by a Dawson–Watanabe model, and the data are collected according to a Poisson point process with intensity given by the process state at discrete times. Both driving processes are diffusions taking values in the space of discrete measures whose support varies with time, and are stationary and reversible with respect to Dirichlet and gamma priors respectively. A common methodology is developed to obtain in closed form the time-marginal posteriors given past and present data. These are shown to belong to classes of finite mixtures of Dirichlet processes and gamma random measures for the two models respectively, yielding conjugacy of these classes to the type of data we consider. We provide explicit results on the parameters of the mixture components and on the mixing weights, which are time-varying and drive the mixtures towards the respective priors in absence of further data. Explicit algorithms are provided to recursively compute the parameters of the mixtures. Our results are based on the projective properties of the signals and on certain duality properties of their projections.

**Guy Cole **(Department of Statistics and Data Sciences, The University of Texas at Austin)

**Title: "**Stochastic Blockmodels with Edge Information"**Abstract:** Stochastic blockmodels (SBMs) assume that nodes belong to communities, where each pair of communities is associated with a parameter for the interactions between nodes in each community. Traditional SBMs assume binary- or integer-valued interactions, while this talk will focus on models developed for SBMs with more complex interactions, e.g. combined message counts and bag-of-words topic models. The goals of these models are to improve interaction prediction, community identification, and interaction attribution.

**David Van Dyk **(Department of Mathematics, Imperial College London)

**Title: "**Quantifying Discovery in Astro/Particle Physics: Frequentist and Bayesian Perspectives"**Abstract:** The question of how best to compare models and select among them has bedeviled statisticians, particularly Bayesian statisticians, for decades. The difficulties with interpreting p-values are well known among Bayesians and non-Bayesians alike. Unfortunately, the strong dependence on the choice of prior distribution of the most prominent fully Bayesian alternative , the Bayes Factor, has limited its popularity in practice. In this talk, we explore a class of non-standard model comparison problems that are important in astrophysics and high-energy physics. The search for the Higgs boson, for example, involved quantifying evidence for a narrow component added to a diffuse background distribution. The added component corresponds to the Higgs mass distribution, accounting for instrumental effects, and cannot be negative. Thus, not only is the null distribution on the boundary of the parameter space, but the location of the added component is unidentifiable under the null. Because many researchers have a strong preference for frequency-based statistical methods, they employ a sequence of likelihood ratio tests on a grid of possible null values of the unidentifiable location parameter. We compare Bonferroni and a Markov bounds on the resulting p-value, both of which are popular methods for correcting for the multiple testing inherent in this procedure. We then suggest a Bayesian strategy that employs a prior distribution on the location parameter and show how this prior automatically corrects for the multiple testing. The Bayesian procedure is significantly more conservative in that it avoids the well-known tilt of p-values toward the alternative when testing a precise null hypothesis. Finally, we discuss the circumstance under which the dependence of the Bayes Factor can be interpreted as a natural correction for multiple testing.

**Xi Chen **(Leonard N. Stern School of Business, NYU Stern)

**Title: "**Statistical Inference for Model Parameters with Stochastic Gradient Descent"**Abstract:** In this talk, we investigate the problem of statistical inference of the true model parameters based on stochastic gradient descent (SGD). To this end, we propose two consistent estimators of the asymptotic covariance of the average iterate from SGD: (1) an intuitive plug-in estimator and (2) a computationally more efficient batch-means estimator, which only uses the iterates from SGD. As the SGD process forms a time-inhomogeneous Markov chain, our batch-means estimator with carefully chosen increasing batch sizes generalizes the classical batch-means estimator designed for time-homogenous Markov chains. Both proposed estimators allow us to construct asymptotically exact confidence intervals and hypothesis tests. We further discuss an extension to conducting inference based on SGD for high-dimensional linear regression.

**Robert Tibshirani **(Department of Statistics, Stanford University)**Title: "**Some Progress and Challenges in Biomedical Data Science"**Abstract:** I will present some new developments and challenges in working with big data in public health. Examples will include (a) an application of the lasso method for high dimensional supervised learning, applied cancer diagnosis via mass spectometry, (b) predicting the number of units of platelets that will be needed by a hospital, and (c) "Patients like me"- estimating heterogeneous treatment effects for personalized treatment recommendations.

**Natesh Pillai **(Department of Statistics, Harvard University)

**Title: "**TBD"**Abstract:** TBD

**Peter Hoff **(Department of Statistical Science, Duke University)

**Title: "**Adaptive FAB confidence intervals with constant coverage"**Abstract:** Confidence intervals for the means of multiple normal populations are often based on a hierarchical normal model. While commonly used interval procedures based on such a model have the nominal coverage rate on average across a population of groups, their actual coverage rate for a given group will be above or below the nominal rate, depending on the value of the group mean.

In this talk I present confidence interval procedures that have constant frequentist coverage rates and that make use of information about across-group heterogeneity, resulting in constant-coverage intervals that are narrower than standard t-intervals on average across groups.

These intervals are obtained by inverting Bayes-optimal frequentist tests, and so are "frequentist, assisted by

Bayes" (FAB). I present some asymptotic optimality results and some extensions to other multiparameter models, such as linear regression.

**Nicholas Polson **(School of Business, The University of Chicago Booth)

**Title: "**Deep Learning Predictors for Traffic Flows"**Abstract:** We develop deep learning predictors for modeling traffic flows. The challenge to modeling traffic flows arises from sharp nonlinearities due to transitions from free flow to breakdown and to congestion. Our methodology constructs a deep learning architecture to capture nonlinear spatio-temporal flow effects. We show how traffic flow data from road sensors can be predicted using deep learning. We illustrated our methodology on traffic data from Chicago's Interstate I-55 and we forecast traffic flows during two special events, a Chicago Bears football game and a snowstorm. Both examples lead to a sharp traffic flow regime which can occur very suddenly and we show how deep learning tackles short term traffic forecasting in an efficient manner. Finally, we discuss directions for future research.

**Juhee Lee **(Jack Baskin School of Engineering, UC Santa Cruz)

**Title: "**TBD"**Abstract:** TBD