SDS Graduate Students Put Skills to Work through Summer Internships
Many Ph.D. and M.S. in Statistics students find summer internships a beneficial way to build and practice the skills they learned in their coursework. Kelly Kang, Chandra Shekhar, and Mingzhang Yin provide details about how they spent their summer of 2019.
Kelly Kang (Ph.D. in Statistics): Microsoft
"I did my internship at Microsoft Window Defender ATP team this summer. The job of my team is to help Windows end point users to block malwares. The rough idea is that we have a lot of blocking rules sitting on cloud, whenever an end point user is trying to download something, we are going to collect some information about that file/software, like file name, installation path, etc., and then send this information to the cloud, and the cloud will then return a recommendation on whether it is legitimate or not.
My internship project was to help my team optimize our cloud classifiers using Automated Machine Learning (AutoML) techniques developed by Microsoft Research folks. Automated Machine Learning is a recommender system, similar to how we recommend movies for users; it recommends machine learning pipelines for datasets. Each Machine Learning pipeline contains data pre-processing, feature selection, hyper-parameter tuning, algorithm selection steps. I helped my team build AutoML pipeline on Azure using Python, and also on ML.net using C#."
Chandra Shekhar (M.S. in Statistics, Ph.D. in Petroleum Engineering): Schlumberger
"I was part of the Sonic Team which collects data through the tool called Sonic Scanner. The tool is triggered at each depth of a well, the waveforms are generated which travel through the formation and received by the receivers. Then, Slowness Time Coherence (STC) processing of these waveforms is done. After the processing, we get multiple coherence peaks at each depth which are then assigned to either compressional slowness group, shear slowness group or to the 'other' group. So the main objective was: for each depth, classify peaks to 3 groups — shear, compressional or the other group which comprises of the peaks which did not belong to either compressional or shear.
I built an XGBoost classifier which is basically an ensemble of many weak classifiers. Grid search with K-fold cross-validation was used to tune hyper parameters. We had nine cases where each case had a different training set for training the model and a different test set for testing the model. The accuracy was as high as 99% on test data. Even if the overall accuracy was relatively lower for a particular case, the compressional group accuracy was still pretty high. This is important as the compressional slowness is a primary deliverable for the Sonic team."
Mingzhang Yin (Ph.D. in Statistics): Google Brain
"The project I worked on is studying generalization and memorization problem in meta-learning. Meta-learning is a family of algorithms that can utilize the knowledge that is learned in past tasks to help learn a new task quickly. In the meta-training and meta-testing, each task is divided into context and target data sets. The goal is to get good performance/prediction on the target set with as few context points as possible. However, neural networks have a strong memorization ability by which it can 'memorize' the mapping from a large amount of data to their labels. Imagine an automated medical prescription system that suggests medication prescriptions to doctors based on patient symptoms and the patient's previous record of prescription responses (i.e., medical history) for adaptation. In the meta-learning framework, each patient represents a separate task. A standard meta-learning system can memorize the information of the training patients, leading it to ignore the medical history and only utilize the symptoms combined with the memorized information. As a result, it may issue highly accurate prescriptions on the meta-training set, but fail to adapt to new patients effectively.
To overcome the memorization problem in non-mutually exclusive tasks, we propose information-theoretic regularizations on the meta-learning objective. By both recognizing the challenge of memorization and developing a general and lightweight approach for solving it, this work represents an important step towards making meta-learning algorithms applicable to and effective on any problem domain."