The ever-increasing accumulation of data continues to outstrip the graduate training needed to meaningfully mine the data collected. This issue is further complicated by the fact that holistic training in biomedical big data analysis requires PhD-level expertise in not one, but three core research areas: (1) biology (2) statistics and (3) computer science, yet the majority of traditional PhD training programs demand that students choose just one of these areas as their focus. A growing number of biomedical PhD students are recognizing the need to develop data analysis and computational biology skills, at the same time that a growing number of computer science and statistics PhD students are realizing that their marketability could be substantially expanded if they knew how to apply their skills to solve outstanding problems in the health arena.

The purpose of this pre-doctoral training program at The University of Texas at Austin is for the trainee to become an expert in one of the following areas: 1. Statistics (STAT); 2. Computer Science (CS); 3. Computational science, engineering, and mathematics (CSEM); or 4. Biology (via a PhD in one of a. neuroscience [NS]; b. ecology, evolution, and behavior [EEB]; c. cell and molecular biology [CMB]; or d. Biomedical Engineering [BME]) while also obtaining essential training in all three core areas (statistics, computer science, and biology).

Training

Training for the program involves three formal components:

  1. core courses (3)
  2. research lab rotations (2)
  3. seminar/workshop course

Core Courses

Trainees with sufficient skills in computer science, statistics, and biology take the following three courses in the second year of their PhD programs.

BIO 382K: Computational and Statistical Biology (new course): An introduction to modern biology for students with quantitative backgrounds. The course includes a survey of modern biology and also introduces modern statistical approaches, bioinformatics analyses, and computational approaches fundamental to "big data" and other life sciences in contexts students are likely to encounter in their own research. [Syllabus]

CSE 380: Tools and Techniques of Computational Sciences: Graduate level introduction to the practical use of high performance computing hardware and software engineering principles for scientic technical computing. Topics include computer architectures, operating systems, programming languages, data structures, interoperability, and software development, management and performance. [Syllabus]

SDS 385: Statistical Models for Big Data (new course): This course will cover big data modeling approaches including linear models, graphical models, matrix and tensor factorizations, clustering, and latent factor models. Algorithms explored will include sketching, fast n-body problems, random projections and hashing, large-scale online learning, and parallel learning. [Syllabus]

Research Rotations

Each student will participate in (at least) two lab rotations designed to give the students direct mentoring, the experience of working on a research team, and experience working on real problems in big data. Trainees will do a "quantitative" and a "biomedical" rotation. Research group rotations will last a semester and each student will be expected to complete two rotations (typically both during the fall semester of year 3 of their PhD). For each rotation, the students will register for a 3 credit course.

Weekly seminar course

The goal of the one credit (per semester) seminar/workshop is to teach students how to engage in research and communicate effectively in both written and oral formats. 

Program Timeline

First Year:
During the first year of the students' PhD program, the students will take the standard curriculum for their respective programs. In addition, students will sit in on the seminar/workshop course (if they have been identified at this time) and take it for one credit per semester for the third through fifth semesters (for a total of 3 credits). Biology PhD students will typically pick their primary dissertation advisor at this time.

Second Year:
Trainees will take three program core courses and enroll in the seminar/workshop course in the second year of their PhD programs. Statistics and computation PhD students will typically pick their dissertation advisor during this year.

Third Year:
The first semester of the third year will comprise two rotations and any additional coursework. Trainees will also enroll in the seminar/workshop course for both semesters. The second semester will be spent developing and/or finalizing a dissertation topic. All third-years will submit F31 applications, assuming their dissertation work is fundable by an institute that participates in the F31 mechanism.

Teaching Experience:
The trainee will serve as a Teaching Assistant (TA) for at least one semester. This will typically occur in either the second semester of year three, the first semester of year four, or the first year in the PhD program.

Post Training program period:
Fourth year and beyond: The student will be focused on dissertation research and supported by a member of the program faculty or an NIH F31.

Benefits of Program

  • Two years of prestigious fellowship funding
  • Training in statistics, computer science, and biology via three formal courses, a seminar course, and research rotations
  • Funding for travel to NIH BD2K consortium meeting in Bethesda and to a national Big Data science meeting (while supported by the training grant) 
  • Excellent and enhanced job prospects upon graduation

CONTACT

For additional information about the Biomedical Big Data Training program please email Vicki Keller at vicki.keller[@]cns[dot]utexas[dot]edu or James Scott at james.scott[@]mccombs[dot]utexas[dot]edu.