Copula Modeling for Categorical/Ordinal Data

One of my research interests in recent years is Copula Modeling and Theories; in particular, I’m working with my collaborators to develop new theories and methodologies using so-called Checkerboard Copula Regression (CCR) for high-dimensional contingency tables with an ordinal response variable in a model-free manner. Several of my recent honor thesis students in Statistics have investigated new ways in applying this methodology/measure to different applications.

Kevin Jin (Class of 2023, summa cum laude in Statistics, Post-Baccalaureate Summer Research Fellowship)

Post-Amherst First Stop: PhD Program in Statistics at University of Michigan.

Thesis Title: "Visualizing Simpson’s Paradox in High-Dimensional Contingency Tables using Checkerboard Copula Regression"" Abstract: Simpson’s paradox is a statistical phenomenon that occurs when before considering an important underlying variable, we conclude one association, but after considering that variable, the association reverses or becomes neutral. As we currently live in an ever-increasing complex world, most (if not all) data analysis will involve multivariate analysis, which will include interactions of many different variables. For high-dimensional categorical data, there is a current lack of widespread usage of a visualization tool that can help us gain greater insights into the data, which may lead to Simpson’s paradox. This paper focuses on using BECCR (Bilinear Extension Checkerboard Copula Regression) Prediction Bubble Plots as a visualization tool to help capture the interactions of multiple categorical variables in data. To build up to this point, we review how and why Simpson’s paradox occurs, why it is important to consider it when doing data analysis and build up the copula theory and principles to understand checkerboard copulas. We use BECCR Prediction Bubble Plots as a visualization tool for simulated data sets as well as a real world NIHS data application to determine how these plots could be used to gain greater insights to Simpson’s paradox.

Kenny Chen (Class of 2022, summa cum laude in Statistics, Post-Baccalaureate Summer Research Fellowship)

Post-Amherst First Stop: Master Program in Biostatistics at Harvard University.

Thesis Title: “Model-Free Dependence Measure for High-Dimensional Contingency Tables via the Checkerboard Copula and its Potential as a Goodness of Fit Measure” Abstract: Categorical data analysis with ordinal responses is important in fields such as the social sciences and taking into consideration the intrinsic ordering of ordinal variables can give more powerful inferences. One step in categorical analysis is exploring the various dependence structures among the variables for exploratory modeling. A dependence structure of particular interest is that of the regression dependence which many model-based approaches have been constructed. However, there are comparatively less model free approaches to examining dependence structures in categorical data, and most of these do not focus on regression dependence. To address this, Wei & Kim (2021) proposed a new model-free measure based on the checkerboard copula and demonstrated its ability to identify and quantify the regression dependence in multivariate categorical data with an ordinal response variable and categorical (nominal or ordinal) explanatory variables in an exploratory manner. This thesis explores their novel measure and the methodology behind it. In addition, we extend their work by proposing a model-based estimator of their measure. We conduct simulation studies to evaluate the performance of the model-free and model-based measure. Initial results demonstrated that model-based estimates of the measure from well-fitted models compared similarly to the model-free estimator of the measure, suggesting further exploration into the possibility of using the model-free estimator as a goodness of fit measure.