Analysis and Diagnostics of Categorical Variables with Multiple Outcomes
Surveys often contain qualitative variables for which respondents may select any number of the outcome categories. For instance, for the question "What type of contraceptive have you used?" with possible responses (oral, condom, lubricated condom, spermicide, and diaphragm), respondents would be instructed to select as many of the J = 5 outcomes as apply. This situation is known as multiple responses and outcomes are referred to as items. This thesis discusses several approaches to analysing such data. For stratified multiple response data, we consider three ways of defining the common odds ratio, a summarising measure for the conditional association between a row variable and the multiple response variable, given a stratification variable. For each stratum, we define the odds ratio in terms of: 1 item and 2 rows, 2 items and 2 rows, and 2 items and 1 row. Then we consider two estimation approaches for the common odds ratio and its (co)variance estimators for these types of odds ratios. The model-based approach treats the J items as a Jdimensional binary response and then uses logit models directly for the marginal distribution of each item by applying the generalised estimating equation (GEE) (Liang and Zeger 1986) method. The non-model-based approach uses Mantel-Haenszel (MH) type estimators. The model-based (or marginal model) approach is still applicable for more than two explanatory variables. Preisser and Qaqish (1996) proposed regression diagnostics for GEE. Another model fitting approach is the homogeneous linear predictor model (HLP) based on maximum likelihood (ML) introduced by Lang (2005). We investigate deletion diagnostics as the Cook distance and DBETA for multiple response data using HLPmodels (Lang 2005), which have not been considered yet, and propose a simple "delete=replace" method as an alternative approach for deletion. Methods are compared with the GEE approach. We also discuss the modelling of a repeated multiple response variable, a categorical variable for which subjects can select any number of categories on repeated occasions. Multiple responses have been considered in the literature by various authors; however, repeated multiple responses have not been considered yet. Approaches include the marginal model approach using the GEE and HLP methods, and generalised linear mixed models (GLMM). For the GEE method, we also consider possible correlation structures and propose a groupwise correlation estimation method yielding more efficient parameter estimates if the correlation structure is indeed different for different groups, which is confirmed by a simulation study. Ordered categorical variables occur in many applications and can be seen as a special case of multiple responses. The proportional odds model, which uses logits of cumulative probabilities, is currently the most popular model. We consider two approaches focusing on the mis-specification of a covariate. The binary approach considers the proportional oddsmodel as J-1 logistic regression models and applies the cumulative residual process introduced by Arbogast and Lin (2005) for logistic regression. The multivariate approach views the proportional odds model as a member of the class of multivariate generalised linear models (MGLM), where the response variable is a vector of indicator responses.