Button to scroll to the top of the page.

The series is envisioned as a vital contribution to the intellectual, cultural, and scholarly environment at The University of Texas at Austin for students, faculty, and the wider community. Each talk is free of charge and open to the public. For more information, contact Stephanie Tomlinson at sat[@]austin[dot]utexas[dot]edu.


January 23, 2019 – Andee Kaplan
(Department of Statistical Science, Duke University)
"Life After Record Linkage: Tackling the Downstream Task with Error Propagation"
PAR 301, 2:00 to 3:00 PM

January 25, 2019 – Eric Jonas
(UCBerkeley Center for Computational Imaging & RISELab)
"Exploiting computational scale for richer model-based inference"
GDC 4.302, 2:00 to 3:00 PM

January 28, 2019 – Maricela Cruz
(Department of Statistics, University of California, Irvine)
"Interrupted Time Series Models for Analyzing Complex Healthcare Interventions” 
BUR 136, 2:00 to 3:00 PM

January 30, 2019 – Prasad Patil
(Department of Data Sciences, Dana-Farber Cancer Institute, Department of Biostatistics, Harvard T.H. Chan School of Public Health)
"Replicability of genomic signatures and scientific results” 
PAR 301, 2:00 to 3:00 PM

February 1, 2019 – Rachel Nethery
(Department of Biostatistics at the Harvard T.H. Chan School of Public Health)
"Bayesian and Machine Learning Approaches to Estimate Causal Effects in Environmental Health Applications"
GDC 4.302, 2:00 to 3:00 PM

February 4, 2019 – Antonio Linero
(Department of Statistics at Florida State University)
"Theory and Practice for Bayesian Regression Tree Ensembles..” 
BUR 136, 2:00 to 3:00 PM

February 22, 2019 – Lorenzo Trippa 
(Department of Biostatistics and Computational Biology, Harvard University)
"Bayesian Designs for Glioblastoma Clinical Trials."
BUR 136, 2:00 to 3:00 PM

 March 1, 2019 – Raquel Prado
(Department of Statistics, University of California Santa Cruz)
Bayesian models for complex-valued fMRI"
BUR 136, 2:00 to 3:00 PM

 March 8, 2019 – Veera Baladandayuthapani
(School of Public Health, University of Michigan)
"Bayesian Models for Richly Structured Data in Biomedicine"
BUR 136, 2:00 to 3:00 PM

 March 11, 2019 – Roger Peng
(Department of Biostatistics,Johns Hopkins Bloomberg School of Public Health)
"Estimating the health impacts of modifying air pollution mixtures"
BUR 136, 1:00 to 2:30 PM

 March 29, 2019 – Matteo Vestrucci
(Department of Statistics, University of Texas at Austin)
"Cognitive disease progression models for clinical trials in autosomal-dominant Alzheimer's disease"
BUR 136, 2:00 to 3:00 PM

April 5, 2019 – Paul Rathouz
(Director of the Biomedical Data Science Hub, University of Texas at Austin, Dell Medical School)
Semiparametric Generalized Linear Models: Small, Large, and Biased Samples"
BUR 136, 2:00 to 3:00 PM

 April 26, 2019 – Junming Yin
(Department of Management Information Systems, University of Arizona)
"Towards Better Learning from Crowd Labeling"
BUR 136, 2:30 to 3:30 PM (**Note different time**)

 May 3, 2019 – Surya Tokdar
(Department of Statistical Science, Duke University)
"Quantile Regression for Correlated Response"
BUR 136, 2:00 to 3:00 PM

May 7, 2019 – Subhajit Dutta
(Department of Mathematics & Statistics,IIT Kanpur)
"On Perfect Classification and Clustering for Gaussian Processes"
GDC 7.514, 2:00 to 3:00 PM

May 10, 2019 – Georgia Papadogeorgou
(Department of Statistical Science, Duke University)
"Unmeasured spatial confounding in air pollution studies"
BUR 136, 2:00 to 3:00 PM

Andee Kaplan (Department of Statistical Science, Duke University)

Title: Life After Record Linkage: Tackling the Downstream Task with Error Propagation

Abstract: Record linkage (entity resolution or de-duplication) is the process of merging noisy databases to remove duplicate entities that often lack a unique identifier. Linking data from multiple databases increases both the size and scope of a dataset, enabling post-processing tasks such as linear regression or capture-recapture to be performed. 

Any inferential or predictive task performed after linkage can be considered as the "downstream task.” While recent advances have been made to improve flexibility and accuracy of record linkage, there are limitations in the downstream task due to the passage of errors through this two-step process. In this talk, I present a generalized framework for creating a representative dataset post-record linkage for the downstream task, called prototyping. Given the information about the representative records, I explore two downstream tasks—linear regression and binary classification via logistic regression. In addition, I discuss how error propagation occurs in both of these settings. I provide thorough empirical studies for the proposed methodology, and conclude with a discussion of practical insights into my work.

jonasEric Jonas (UC Berkeley Center for Computational Imaging & RISELab)

Title: Exploiting computational scale for richer model-based inference

Abstract: Understanding the deluge of scientific data acquired from next-generation technologies, from astronomy to neuroscience, requires advances in translating our existing knowledge to useful models. Here I show how our recent advances in scalable computing, from “serverless” cloud offerings to deep function approximation, can let us capture and exploit this prior knowledge. Examples include models derived from human intuition (for neural connectomics), carefully-engineered physical systems (for imaging through scattering media), and even direct simulation (for superresolution microscopy). By expanding the space of models we can work with, we can avoid common data science pitfalls while making computing at scale accessible to the entire scientific community.

 Maricela Cruz (Department of Statistics, University of California, Irvine)

Title: Interrupted Time Series Models for Analyzing Complex Healthcare Interventions

Abstract: Now more than ever, patients, providers, resources and contexts of care interact in dynamic ways to produce various measurable health outcomes that often do not align with expectations. This complexity and interdependency make it difficult to assess the true impact of interventions designed to improve patient healthcare outcomes, in terms of both research design and statistical analysis. Healthcare intervention data can be modeled as interrupted time series (ITS): sequences of measurements for an outcome collected at multiple time points before and after an intervention. There are, however, limitations to the current statistical methodology for analyzing ITS data. Namely, contemporary methods restrict the interruption’s effect to a predetermined time point or remove data for which the effects of the intervention may not be realized. In addition, commonly used methods often neglect plausible differences in temporal dependence and volatility and restrict analyses to a single hospital unit.

In this talk, I will discuss novel statistical methods developed for evaluating the effect of interventions on health outcomes. I will present the ‘robust-ITS’ model, able to estimate (rather than merely assume) the lagged effect of an intervention on a health outcome. I will illustrate components of robust-ITS that allow researchers to determine whether health outcomes are more predictable (as measured by stronger temporal dependence and smaller variability), and thus more desirable, after an intervention. I will then introduce the ‘Robust Multiple ITS’ model, an extension to allow for the incorporation of multi-unit ITS data, as well as a supremum Wald test that allows one to formally test for the existence of a change point across unit specific mean functions. In total, this methodology accommodates crucial intricacies of interventions under real-world circumstances and overcomes many of the omissions and limitations of current approaches. I illustrate the methods by analyzing patient centered data from a hospital that implemented and evaluated a new care delivery model in multiple units.

Patil Prasad Patil (Department of Data Sciences, Dana-Farber Cancer Institute, Department of Biostatistics, Harvard T.H. Chan School of Public Health)

 Title:  Replicability of genomic signatures and scientific results

 Abstract: There has been an increased emphasis on replicability - the belief that a result   can be obtained again or confirmed in a new sample - in the conduct of scientific   research. I will describe a series of projects (1) assessing replicable behavior in the training of genomic signatures (predictors that use gene expression measurements as input) and (2) communicating the replicability of scientific study results.

In genomic signature development, technical choices and cohort composition can change a signature’s predictions for the same patient. I will show empirical and preliminary theoretical results for training ensemble predictors using multiple studies’ worth of data. In this “intermediate-data” setting, out-of-study predictive generalizability can be improved by leveraging inter-study heterogeneity in features and their associations with the outcome. Closely related to transfer learning, distributed/federated learning, and covariate shift, this problem arises out of site-specific differences in patient recruitment, patient characteristics, and measurement technology.

Discussions on reproducibility and replicability of the results of scientific studies have gone forward without consensus establishment of definitions for these terms and of expectations for how many and how often studies should replicate. I will offer a re-analysis of the conclusions of a major replication effort and an R package designed to visually compare and communicate replication attempts.

NetheryRachel Nethery (Department of Biostatistics at the Harvard T.H. Chan School of Public Health)

Title: Bayesian and Machine Learning Approaches to Estimate Causal Effects in Environmental Health Applications

Abstract: I will discuss two projects on Bayesian and machine learning methods for causal inference applied to investigations of (1) the health impacts of exposure to natural gas infrastructure and (2) the causes of cancer clusters. In the first project, we seek to estimate the average causal effect of exposure to natural gas compressor stations on cancer mortality in the US. Because the data exhibit propensity score non-overlap (i.e. regions of poor support), estimation of population average causal effects requires reliance on model specifications. All existing methods to address non-overlap (e.g. trimming) change the estimand which can diminish the study's impact. We make two contributions on this topic. We first propose a data-driven definition of the overlap and non-overlap regions. Next, we develop a Bayesian machine learning method to estimate population average causal effects in the presence of non-overlap, which delegates the tasks of estimating causal effects in the overlap and non-overlap regions to two distinct models, suited to the degree of data support in each region.

In the second project, we propose a causal inference framework for cancer cluster investigations. These investigations arise when a community reports high cancer rates, often suspecting a relationship between the cancer and a hazardous exposure. Departments of health typically perform a standardized incidence ratio (SIR) analysis in response. This approach has several well-documented limitations. Assuming that a potentially hazardous exposure in the community is identified a priori, we introduce an estimand called the causal SIR (cSIR): the expected cancer incidence in the exposed population divided by the expected cancer incidence for the same population under the counterfactual scenario of no exposure. To estimate the cSIR we must (1) identify unexposed populations similar to the exposed one to inform estimation of the counterfactual and (2) resolve the spatial over-aggregation of publicly available cancer incidence data for these unexposed populations. We address the first challenge with matching and the second by developing a Bayesian model that borrows information from other sources to impute cancer incidence at the desired level of spatial aggregation.

Antonio Linero (Department of Statistics at Florida State University)

Title: Theory and Practice for Bayesian Regression Tree Ensembles

Abstract: Ensembles of decision trees have become a standard component of the data analyst's toolkit; commonly used algorithms include random forests and boosted decision trees. In this talk, we investigate the properties of regression tree ensembles from a Bayesian standpoint. We focus on the interplay between theory and practice to study the properties of ensembles and obtain insights into (a) why decision tree ensembles are successful in practice and (b) where they might be improved. We provide validation for the long-held hypothesis that BART ensembles perform well due to their ability to detect low-order interactions, a property which describes many real-world settings. Further, we identify two areas in which BART ensembles can be expected to be suboptimal: under sparsity, and when the underlying regression function exhibits higher-order smoothness. We give theoretical support for these insights by establishing posterior contraction at near-optimal rates adaptively across a large family of function spaces, and provide empirical support by applying our methodology to benchmark datasets. We conclude by presenting extensions of our methodology which account for other interesting structures beyond sparsity and smoothness, and discuss how the insights we obtain can be extended to non-Bayesian decision tree ensembling methods.

 PradoLorenzo Trippa (
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard University)

Bayesian Designs for Glioblastoma Clinical  Trials

There have been few treatment advances for patients with glioblastoma (GBM) despite increasing scientific understanding of the disease. While factors such as intrinsic tumor biology and drug delivery are challenges to developing efficacious therapies, it is unclear whether the current clinical trial landscape is optimally evaluating new therapies and biomarkers. We queried ClinicalTrials.gov for interventional clinical trials for patients with GBM initiated between January 2005 and December 2016 and abstracted data regarding phase, status, start and end dates, testing locations, endpoints, experimental interventions, sample size, clinical presentation/indication, and design to better understand the clinical trials landscape. Only approximately 8%–11% of patients with newly diagnosed GBM enroll on clinical trials with a similar estimate for all patients with GBM. Trial duration was similar across phases with median time to completion between 3 and 4 years. While 93% of clinical trials were in phases I–II, 26% of the overall clinical trial patient population was enrolled on phase III studies. We  use  the  results of  this  meta  analysis to discuss  pros  and  cons of  trial designs  in GBM. This  analysis  includes  platforms  and  the  use of external controls. Platform  design  allow  to  add  new  experimental  arms  to  the  clinical  study  when   they  become  available, while  the  aim  of an external  controls is  to  leverage  external data  that  were  not  generated  from the  clinical  study.

 PradoRaquel Prado (Department of Statistics at the University of California Santa Cruz)

Title: Bayesian models for complex-valued fMRI 

Abstract: Detecting which voxels/regions are actived by an external stimulus is one of the main goals in the analysis of functional magnetic resonance imaging (fMRI) data.  Voxel time series in fMRI are complex-valued signals consisting of magnitude and phase components, however, most studies discard the phase and only use the magnitude data. We present a Bayesian variable selection approach for detecting activation at the voxel level from complex-valued fMRI (f(c)MRI) recorded during task experiments. We show that this approach leads to fast and improved detection of activation when compared to alternative magnitude-only approaches. We discuss and illustrate modeling extensions that incorporate additional spatial structure via kernel convolution for more flexible analysis of f(c)MRI. The complex-valued spatial model encourages voxels to be activated in clusters, which is appropriate in applied settings, as the execution of complex cognitive tasks, and therefore brain activation, usually involve populations of neurons spanning across many voxels rather than isolated voxels. Finally, we present models that can handle multi-subject data and allow us to infer connectivity at the region-specific level in addition to voxel-specific activation. Model performance is evaluated through extensive and physically realistic simulation studies and in the analysis of human f(c)MRI.  

 PradoVeera Baladandayuthapani (School of Public Health at the University of Michigan)

Title: Bayesian Models for Richly Structured Data in Biomedicine

Abstract: Modern scientific endeavors generate high-throughput, multi-modal datasets of different sizes, formats, and structures at a single subject-level. In the context of biomedicine, such data include multi-platform genomics, proteomics and imaging; and each of these distinct data types provides a different, partly independent and complementary, high-resolution view of various biological processes. Modeling and inference in such studies is challenging, not only due to high dimensionality, but also due to presence of rich structured dependencies such as serial, functional, tree, and shape-based correlations.  In this talk I will cover some regression and clustering frameworks for modeling data, where the observations (statistical atoms) lie on non-standard spaces such as densities, trees and shapes.  Using coherent data-based projections (basis functions and metric spaces), we will show how to build probabilistic frameworks that can extract maximal information from such data for inference. These approaches will be illustrated using several biomedical case examples especially in oncology.


 PengRoger Peng (Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health)

Title: Estimating the health impacts of modifying air pollution mixtures


Abstract: The traditional statistical approach to studying the health impacts of complex environmental mixtures has been to select individual components and adjust for the presence of other components. This approach allowed for the straightforward translation of research on health effects to interventions or policies. If pollutant X is harmful, then we should reduce exposure to pollutant X. The geometry of the multi-pollutant exposure space removes the natural directionality of the single pollutant approach, breaking the simplicity that had previously connected health effects research and potential interventions. We propose a statistical approach for estimating the health impacts of air pollution mixtures by introducing the concept of a composition-altering contrast, which is any comparison, intervention, policy, or natural experiment that can be used to observe a mixture’s composition in multiple states. Composition-altering contrasts allow us to link changes in mixture composition to observed and interpretable changes in the environment and assess the health effects of mixtures associated with those changes in composition. We apply this approach to data on wildfire particulate matter and hospitalization in the western United States.

 Matteo Vestrucci (Department of Statistics, University of Texas at Austin)

Title: Cognitive disease progression models for clinical trials in autosomal-dominant Alzheimer's disease.

Abstract: Clinical trial outcomes for Alzheimer's disease are typically analyzed by using the mixed model for repeated measures (MMRM) or similar models that compare an efficacy scale change from baseline between treatment arms with or without participants' disease stage as a covariate. The MMRM focuses on a single-point fixed follow-up duration regardless of the exposure for each participant. In contrast to these typical models, a semiparametric cognitive disease progression model (DPM) has been developed for autosomal dominant Alzheimer's disease based on the Dominantly Inherited Alzheimer Network (DIAN) observational study. This model includes two improvements compared to MMRM, in which the DPM aligns and compares participants by disease stage, and incorporates extended follow-up data from participants with different follow-up durations using all data until last participant visit. Variations of this model will be introduced together with some developments aiming at predicting the onset of the disease in the general population.

 PRPicPaul Rathouz (Director of the Biomedical Data Science Hub, University of Texas at Austin, Dell Medical School)

Title: Semiparametric Generalized Linear Models: Small, Large, and Biased Samples

Abstract: Rathouz and Gao (2009) proposed a novel class of generalized
linear models indexed by a linear predictor and a link function for the
mean of (Y|X). In this class, the distribution of (Y|X) is left unspecified and estimated from the data via exponential tilting of a reference distribution, yielding a response model that is a member of the natural exponential family. Originally, asymptotic results were developed for a response distribution with finite support under the framework of regular maximum likelihood estimation. Allowing support to be either finite or infinite (as will arise with continuous Y), in this talk, we present more recent results on inferences under small sample sizes, on asymptotics under infinite support, and on scalable computational methods as n->infinity. We also show how, with very easy-to-implement modifications, the model can accomodate biased samples arising from extensions of case-control designs to continuous response distributions.


KW: Baseline distribution; Canonical link; Density-ratio model; Exponential tilting; Linear exponential family; Natural exponential family; Quasi-likelihood; Case-control; Outcome dependent sample.

 Junming YinJunming Yin (Department of Management Information Systems, University of Arizona)

Title: Towards Better Learning from Crowd Labeling

Abstract: Microtask crowdsourcing has emerged as a cost-effective approach to obtaining large-scale labeled data. Crowdsourcing platforms such as MTurk provide an online marketplace where task requesters can submit a batch of microtasks for a crowd of workers to complete for a small monetary compensation. However, as the information collected from a crowd can be prone to errors, significant effort is required to infer the ground truth labels from noisy annotations supplied by a crowd of workers with heterogeneous and unknown labeling accuracy. Moreover, it would be very beneficial to identify and then possibly filter out low-reliability workers to foster the creation of a healthy and sustainable crowdsourcing ecosystem. Much of the existing literature on crowd labeling has focused on the single-label (i.e., binary and multi-class) setting, while in various applications it is common that each item to be annotated can be assigned to multiple categories simultaneously.

In this work, we consider the problem of learning from crowd labeling in the general multi-label setting. We propose a new Bayesian hierarchical model for the underlying annotation process of crowd workers, and introduce a mixture of Bernoulli distribution to capture the unknown label dependency up to and including higher-order interactions. An efficient variational inference procedure is then developed to jointly infer ground truth labels, worker quality, and label dependency. Results based on extensive numerical experiments and a real-world experiment on MTurk show that our proposed approach achieves a significant improvement over state-of-the-art methods. Our study clearly highlights that effective modeling of both worker quality and label dependency is crucial to the success of a multi-label crowdsourcing application.



  Subhajit Dutta(Department of Mathematics & Statistics,IIT Kanpur)subhajit


Title: On Perfect Classification and Clustering for Gaussian Processes

Abstract: According to the Hajek-Feldman property, two Gaussian distributions are either equivalent or mutually singular in the infinite-dimensional case. Motivated by singularity of a class of Gaussian measures, we first state a result based on the classic Mahalanobis distance and give an outline of the proof. Using this basic result, a joint transformation is proposed and its theoretical properties are investigated. In a classification problem, this transformation induces complete separation among the competing classes and a simple component-wise classifier leads to 'perfect classification' in such scenarios. In the second part of this talk, we shall discuss the problem of identifying groups in a mixture of Gaussian processes (clustering) by using a new transformation involving Mahalanobis distances. It is curious to note that the proposed method is useless in homoscedastic cases, however, it yields 'perfect clustering' for groups having differences in their covariance operators.

(a joint work with Prof. Juan A. Cuesta-Albertos)


 Surya TokdarSurya Tokdar(Department of Statistical Science, Duke University)

Title: Quantile Regression for Correlated Response


Abstract: Quantile regression (QR, Koenker and Bassett, 1978) is widely recognized as a fundamental statistical tool for analyzing complex predictor-response relationships, with a growing list of applications in ecology, economics, education, public health, climatology, and so on. In QR, one replaces the standard regression equation of the mean with a similar equation for a quantile at a given quantile level of interest. But the real strength of QR lies in its potential to analyze any quantile level of interest, and perhaps more importantly, contrasting many such analyses against each other with fascinating consequences. 

In spite of the popularity of QR, it is only recently that an analysis framework has been developed (Yang and Tokdar, JASA 2017) which transforms Koenker and Basett's four decade old idea into a model based inference and prediction technique in its full generality. In doing so, the new joint estimation framework has opened doors to many important advancements of the QR analysis technique to address additional data complications. In this talk I will present recent such developments, specifically focusing on the issue of additional dependence between observation units when observations are spatially indexed and likely spatially correlated.

 Georgia PapadogeorgouGeorgia Papadogeorgou (Department of Statistical Science, Duke University)

Title: Unmeasured spatial confounding in air pollution studies

Abstract: Unmeasured confounding is a threat to the validity of all causal inference studies. In air pollution, unmeasured confounders are often expected to have a spatial structure: nearby locations are expected to be more similar with respect to the unmeasured variables. In this talk, I will discuss how spatial information can be incorporated in the analysis to adjust for unmeasured spatial confounding. In the first part of the talk I will discuss binary treatments, where the propensity score is augmented to incorporate information on units’ spatial proximity. Matching on the distance adjusted propensity scores encourages matching of treated to control units that are similar in terms of propensity scores and are located near each other. In the second part of the talk, I will discuss estimating causal effects of continuous exposures using outcome regression tools popular within the spatial literature. Based on a set of assumptions relating the exposure and outcome of interest to the unmeasured variables, the causal exposure response is identifiable using observed data. An affine estimator is proposed, and a regularized restricted maximum likelihood approach is used for estimating the model components. Both approaches are used to evaluate the effect of power plant emissions and weather conditions on ambient pollution concentrations, and to investigate the potential threat from unmeasured spatial confounders in this context.