Button to scroll to the top of the page.

Fall 2019 Colloquium: Graduate Portfolio in Applied Statistical Modeling






Pey Mun Shum

Tuesday, Dec. 10


GDC 7.514

"Two-Factor Mixed Model of Potential Factors Causing Change Orders for Highway Projects"
Hongjuan Zhang Tuesday, Dec. 10 11:30-12pm GDC 7.514 "Predictive Modeling of Customer Demand in Semiconductor Industry"
Stephanos Politis Tuesday, Dec. 10 2-2:30pm GDC 7.514 "A Framework for Pavement Condition Assessment Using Statistical Learning"
Chinelo Orji Wednesday, Dec. 11 2-2:30pm GDC 7.402 "Identifying Patient and Clinical Factors that Impact Overall Survival among Metastatic Colorectal Cancer Patients who received Chemotherapy"
Shivam Agrawal Wednesday, Dec. 11 2:30-3pm GDC 7.402  "Data-Driven Modeling for Reservoir Simulation"

Pey Mun Shum

Title: "Two-Factor Mixed Model of Potential Factors Causing Change Orders for Highway Projects"

Abstract: Public construction especially the vast majority of the highway and street projects in the United States are procured by state or local governments through a competitive bidding process. In a competitively bid public project, it is a common practice that the contract is awarded to the lowest qualified bidder. Therefore, public owners are concern about the project performance of the contractor who is awarded the lowest bid price in terms of the change in the final construction cost from the awarded price. A big deviation from the awarded contract price is undesirable and will put the owners under scrutiny. As such, a few potential factors that would contribute to change order for infrastructure project are investigated using the data obtained from the Texas Department of Transportation. A two-factor mixed model is established to investigate if project type and project location have any impact on the percentage change order of a project. The results show that both project type and project location do not affect change order. This is because change order depends on the project complexity and scale which are not reflected on the project type. A project which has low complexity such as seal coat or overlay, but of large-scaled would have risks as high as a complicated project such as bridge replacement. Hence, the percentage of change order is independent of the project type. In terms of the effect of project location on the percentage of change order, as all the projects in the analysis are located within Texas, they have similar access to resources and availability of contractors to execute the projects. As such, there is not an effect of project locations on the percentage of change order. Furthermore, this study also shows that the number of bidders has a negative correlation with percentage of change order. However, only 4% of the variability of the change order is attributed by the number of bidders which shows that the effect from the competitive bidding is not significant. In summary, project type, project location and the number of bidders do not correlate with change orders. This report contributes to the existing body of knowledge about the factors leading change order despite some limitations that exist in this study.


Hongjuan Zhang

Title:  "Predictive Modeling of Customer Demand in Semiconductor Industry"

Abstract: Customer demand forecasting plays an important role in the supply chain of the semiconductor industry. Knowledge of how customer demand will fluctuate enables the company to keep the right amount of inventory or to schedule production in advance. Given the increasing uncertainties of the global environment and long lead-times for capacity planning and production, semiconductor companies must effectively forecast future customer demand as a basis for related resource allocation and manufacturing decisions.

This paper builds different predictive models to generate the forecasts of customer demand for semiconductor products with different characteristics and compares the effectiveness of these models in terms of the residual mean squared error (RMSE) for different forecasting horizons. The models used in this paper are simple moving average (SMA), simple exponential time smoothing (SES), Holt’s method, Holt-Winters’ seasonality method, autoregressive integrated moving average model (ARIMA) and vectoral autoregressive moving average model (ARMAV). Six categories of products with different demand trend are studied.

Based on this research, for the data set used in this project, the best pick models can be different for different categories of products. For products with a declining trend, the Holt’s damped trend method has the best performance, for products with increasing trend, the Holt-Winters seasonality model has the best performance. The ARIMA model is the best for products with seasonality and spiky demand. The SMA model with an order of 3 months is best for products with steady demand. The ARMAV_POS model is the best for products with sporadic demand. For all the products, the ARIMA model is the one that was most frequently the best pick model among all the forecasting models. The second one is the SMA3 model.


Stephanos Politis

Title: "A Framework for Pavement Condition Assessment Using Statistical Learning"

Abstract: Pavement condition monitoring is fundamental for the efficient allocation of resources in transportation asset management. However, data collection involves laborious and costly procedures. This study intends to investigate the usage of remote sensing data for network-level pavement condition assessment to offer a more cost-effective alternative. Based on an extensive literature review, a statistical learning framework was established to train models that predict the pavement condition of different road segments. The framework exploits the inherent information of multispectral images by generating spectral related attributes. To identify pavement sampling areas, an automated procedure using image segmentation replaces manual surface digitizing. Unlike previous research, different statistical learning models were explored to approximate the mapping function from spectral information to pavement conditions. A preliminary case study is conducted with data provided by the City of Dallas and multispectral images acquired from the Texas Natural Resources Information System. The mean-shift segmentation algorithm was used to locate noise introducing areas on the pavement surface. Different statistical learning models are trained using data from the case study and compared. The developed models were employed to predict the road surface condition class of a test set not included in the training procedure. Further research is needed on the different constituent steps to increase prediction accuracy.


Chinelo Orji

Title: "Identifying Patient and Clinical Factors that Impact Overall Survival among Metastatic Colorectal Cancer Patients who received Chemotherapy"


Background: Among cancers that affect both men and women, colorectal cancer is the third most common cancer and the second leading cause of cancer death in the United States. The economic burden of colorectal cancer in the United States is in the range of $5.5-6.5 billion annually and is projected to increase. About 35% of colorectal cancer patients present with metastatic disease at the time of diagnosis. Despite treatment advancements in cancer care, not all cancer patients experience best possible outcomes. The aim of this study is to identify survival rates and factors that influence survival outcomes in this population.

Methods: This retrospective cohort study utilized data from the electronic health records of metastatic colorectal cancer patients who received chemotherapy between July 2013 and Dec 2014 at a multi-center oncology practice network. Data was collected in May 2019. Overall survival was defined as time from first day of chemotherapy to death or last clinic visit. Kaplan-Meier methods were used to assess and plot the survival curves. Log rank tests were used to compare survival among the different subgroups. Adjusted cox proportional hazards regression was used to estimate the hazards of death among the subgroups.

Results: A total of 2545 observations (corresponding to 2131 patients) were included in the analyses. The mean age was 62 years and 58.8% were males. The median survival time for colorectal, colon and rectal cancers were 21.5, 20.6 and 23.9 months respectively. The 1-, 3- and 5- year survival rates were 68.5%, 31% and 17.3% respectively. The log rank tests showed significant difference in survival for variables age, sex, disease (colon vs rectal), febrile neutropenia (FN) risk, line of therapy and duration of treatment. The adjusted cox proportional hazards model showed that female gender was associated with improved survival (Hazard Ratio [HR]: 0.85; 95% CI: X.XX,X.XX), while increasing age and increasing line of therapy were associated with unfavorable survival(HR Range: 1.34 – 3.06 and 1.76 to 3.95) respectively.

Conclusion: Survival rates seem to be improving for metastatic colorectal cancer. Age, gender and line of therapy were important predictors of colorectal cancer survival in metastatic patients. These findings may be used to better understand patient prognosis and overall survival and for better clinical management of metastases.


Shivam Agrawal

Title: "Data-Driven Modeling for Reservoir Simulation"


Objectives/Scope: An oil and gas reservoir may have uncertainty associated with different formation parameters such as porosity, permeability, etc. These uncertainties affect the pressures throughout the reservoir, which in turn affect the hydrocarbon production. Multiple reservoir simulations are run using CMG software to obtain the pressure responses catering to the uncertainty in the formation parameters. Each of these runs may take a few hours since the simulator solves complex non-linear PDEs in the background. The objective of this research is to reduce the total number of CMG simulations by training a machine learning model to learn from CMG responses for a few realizations and predict on the remaining realizations.

Methods/Procedures: Training and cross-validation dataset will be obtained by running reservoir simulations using CMG software. Each reservoir simulation will comprise of different distributions of formation parameters. Python scripts will be written to read in the grid data, well locations data, and well rates data for each realization. Analytical solution to simplified PDEs will be used to obtain the pressure drops due to single well. Since these solutions are linear, space superstition will be used to obtain the combined pressure drops from multiple wells. This will allow for basic trends in pressure to be captured. The remaining more complex trends will then be captured using a regression model. The regression function is expected to be multivariate and highly non-linear. Tree-based ensemble methods such as XGB and neural networks with multiple dense layers will be implemented and tested for computational and accuracy performance.

Novelty: Reservoir simulations have traditionally been modeled using physics-based formulations. If the formation parameters come from similar distributions, there is an opportunity to accelerate these problems by using machine learning. This research will be an essential first step towards exploring that opportunity.