Button to scroll to the top of the page.



SSI Course Spotlight: Scalable Machine Learning: Methods and Tools

The Department of Statistics and Data Sciences at The University of Texas at Austin is hosting the 13th annual UT Summer Statistics Institute (SSI) May 26–29, 2020! Learn more and register.


What is the course? 

This course introduces participants to common methods and tools for machine learning in practice and how to run at scale for large-scale data. A number of common methods in data transformation, unsupervised learning, supervised learning and deep learning techniques will be introduced including principle component analysis, multi-dimensional scaling, K means clustering, Gaussian Mixture Model, Regression, supported vector machine, Naïve Bayesian classification, decision tree and random forest. Several deep neural network structure, including autoencoder, convolutional neural network, and recurrent neuron network will also be introduced. Through the course, participants will learn existing packages and tools supporting those basic methods such as Scikit-learn, Panda, tensorflow and Keras.  To help scale the computations, the course will introduce Spark programming framework and Spark MLlib along with its interface to other programming language including Python.  To scale deep learning workflow over multiple nodes, the course will introduce Horovod package.  The purpose of this course is to teach participants broad and applicable knowledge about machine mining and how to use open source tools to carry out those analyses in practice.  The class will focus on discussing the pro and cons among different methods and focus on how existing tools can be used rather than teaching participants how to implement a particular method from scratch.  The course will contain lectures, exemplar code and demonstration, and in-class discussion and practice.

Who can take this course? 

This course is intended for people with practical needs of conducting machine learning and in-depth learning analysis but without formal training in computer science. Participants are expected to have an overall understanding of common methods covered by this course and can use appropriate tools for their analysis problem based on the knowledge they have gained through this course. This course will benefit people whoa re interested in large scale data and machine learning in practice. Participants should have a general understanding of statistics and linear algebra and have knowledge of executing programs from the command line interface. They should also have a working knowledge of computer programming language and basic knowledge of Python.

What are the requirements to enroll? 

A personal laptop is required with the installation of Java 1.80, Python, and Juptyer Notebook. This summer all courses will be offered online and will have a one hour synchronous session in the morning, an asynchronous session time for participants to run through exercises and/or comprehension checks, another one hour synchronous session in the afternoon, follow up exercises if applicable before the next morning session.

Who is teaching the course? 

Weijia Xu, a Research Engineer and a Scientist Associate Manager, is teaching this course at 11 AM and 3 PM. Dr. Weijia Xu is a research scientist and the Manager of the Data Mining and Statistics group at the TACC. He is a computation scientist by training and an experienced data scientist. Dr. Xu's main research interest is to enable data-driven discoveries through developing new computational methods and applications that facilitate the data-to-knowledge transfer process.  His projects have been funded through various federal and state agencies including NIH, NSF, TXDoT and USDA and resulted over 40 peer-reviewed publications. In recent years, he has been active in the field of Big Data and served in program committees for several workshops and conferences in this area, most recently as co-chair for 2016 IEEE Conference on Big Data. Dr. Xu leads the group that supports large scale data driven analysis and machine learning workflow using computing resources at the TACC. The support includes providing platform and software support, collaborating with users and offering training workshop and tutorials on topics such as R, Python, Hadoop, Spark, for machine learning and deep learning.

SSI Course Spotlight: Introduction to SQL and Rela...
Updated: New Model Forecasts When States Likely to...

Related Posts