Clustering analysis identifying unknown heterogenous subgroups of a population (or a sample) has become increasingly popular along with the popularity of machine learning techniques. Although there are many software packages running clustering analysis, there is a lack of packages conducting clustering analysis within a structural equation modeling framework. The package,

Latent class analysis (LCA) [

As described in the comparison between the mixture-modeling approach and cluster analysis procedure such as

Although Steinley et al. [

Although there are many statistical packages including R for latent class analysis and cluster analysis, the method utilizing both the fuzzy clustering analysis and GSCA for LCA is a new approach and thus, there is no comparable R package for fuzzy clusterwise GSCA. Instead, in this section, we introduce three well-known packages for LCA as a mixture-modeling approach: Mplus, poLCA in R, and SAS procedure LCA as competitors. There are many other software packages available but exhaustive search of packages are out of our scope. We focus on their key features of the three packages, which validates the necessity and coverage of our new package, gscaLCA, in R. Note that most statistical programs for LCA include the capabilities of fitting multi-groups LCA, imposing measurement invariance across groups, and implementing latent class regression (LCR). In addition, binary and multinomial logistic regression options for predicting latent class membership and the ability to take into account sampling weights and clusters are also possible.

Mplus is the most common software package for fitting structural equation models and provides a variety of tools for modeling. For example, within a mixture modeling framework, both latent class analysis and latent profile analysis are available using the option of TYPE=MIXTURE in Analysis part of Mplus. Latent class analysis is for categorical observed variables, whereas latent profile analysis is used for continuous observed variables. Mplus also provides various estimation methods utilizing the maximum likelihood (ML) method [

Among several R packages fitting LCA, “poLCA” is one of the most common R packages. By using expectation-maximization and Newton-Raphson algorithm, poLCA finds maximum likelihood estimates of the LCA model parameters [

Proc LCA was developed for SAS for Windows. Lanza et al. [

As mentioned, there is no dominating method to identify heterogeneity of a population between mixture-modeling approach and cluster analysis because each approach has advantages or disadvantages aforementioned and estimation procedures are different. Rather, the choice of statistical model would be related to researcher's discretion [

Fuzziness is well understood as a soft clustering where each object belongs to every cluster with a certain degree of membership probability, whereas

Based on the description of fuzzy clustering [

^{th} object in class

With a termination criterion such that

(Step 1) Initialize

(Step 2) Compute

(Step 3) If

Generalized structured component analysis (GSCA) [

GSCA consists of three sub models: a measurement model describing observed indicators from each latent construct, a structural model defining the associations among latent constructs, and an weighted relation model defining latent constructs. While the typical SEM models include a measurement model and a structural model assuming the normality of the latent constructs [

While minimizing the residual term

As a similar fashion that Hwang et al. [

(Step 1) Identify clusters and estimate membership probabilities,

(Step 2) Estimate the parameters of GSCA model by applying the optimal scaling,

(Step 3) Update class membership probabilities

(Step 4) Step 2 and Step 3 are iteratively carried out until both

In addition to the fuzzy clusterwise GSCA by Hwang et al. [

In addition to

Both FPI and NCE applies the criterion of “smaller is better” between 0 and 1, which helps researchers decide the number of clusters in the process of fitting LCA [

By adding covariates into latent class modeling, we are able to investigate how the covariates predict the latent class membership of individuals [

More specifically, the first step of the three-step approach is executing fuzzy clusterwise GSCA, which is explained in the previous section. For the second step, two types of partitioning are available. The partitioning methods are associated with class assignments: either a hard partitioning for mutually exclusive assignment or a soft partitioning based on membership probabilities [

With the assignment in the second step, the third step fits either a multinomial or binomial logistic regression. In the case of the hard participating, the procedure is straightforward. A regression model is fitted with the assigned class of each individual as the dependent variable and with covariates as independent variables. The model can be expressed as

For the binomial regression, the assignment is re-coded into dummy variables. By using each dummy variable as dependent variable, the binomial regression can be fitted for each focal class separately. The dummy variables can be denoted as

The package,

Data are the main input to the function gscaLCA, and they should be formatted as a data frame containing indicator variables and covariates. The function gscaLCA requires the indicator variables to be discrete or categorical. It, however, does not requires whether the categorical variables are integer or character. When any indicator variable is continuous, the function is still run by recognizing the type of variable as a categorical variable. Thus, a caution is necessitated. There is an option that a continuous indicator is assigned as a continuous. On the other hand, for the covariates, both discrete and continuous variable are available. When a covariate is categorical numeric variable, it is required to define this variable as a factor. Missing data should be coded as NA in

The package

These data provide 5 items from the 2,560 survey responses data of U.S. teachers from the Teaching and Learning International Survey (TALIS) 2018 [

AddHealth data consist of 5,144 of the participants including their responses of five item variables about substance use such as: Smoking, Alcohol, Other Types of Illegal Drug (Drug), Marijuana, and Cocaine. The responses of the five variables are dichotomous as either “Yes” or “No” and treated the other missing codes as systematic missings. The AddHealth data additionally includes a randomly generated ID variable and two demographic variables: education level and gender, which can be used as covariates. Educational level consists of eight levels from not graduating high school to beyond master's degree. Gender has two levels, male and female. These data were obtained from the website (

To estimate LCA based on fuzzy clusterwise GSCA with the gscaLCA algorithm, the default function of gscaLCA can be called with the following arguments:

The command gscaLCA requires main seven options to fit the fuzzy clusterwise GSCA that are specified:

dat: Dataset to be used to fit a model of gscaLCA.

varnames: A character vector. The names of columns to be used in the function gscaLCA.

ID.var: A character element. The name of ID variable. If ID variable is not specified, the function gscaLCA will try to search an ID variable in the given data. The ID of observations will be automatically generated as a numeric variable if the dataset does not include any ID variable. The default is NULL.

num.class: An integer element. The number of classes to be identified. When num.class is smaller than 2, gscaLCA terminates with an error message. The default is 2.

num.factor: Either EACH or ALLin1. EACH specifies the situation that each indicator is assumed to be its phantom latent variable. ALLin1 indicates that all variables are assumed to be explained by a common latent variable. The default is EACH. The specification here presents the relationship between indicators and latent variables, which is used for GSCA algorithm. These two options can be expressed with a diagram, presented in

Boot.num: An integer element. The number of bootstrap to be identified. The standard errors of parameters are computed from the bootstrap within the gscaLCA algorithm. The default is 20.

multiple.Core: A logical element. TRUE enables to use multiple cores for the bootstrap. The default is FALSE.

When a model of the fuzzy clusterwise GSCA with covariates is fitted, the additional three arguments are required:

covnames: A character vector. The names of columns of the dataset that indicate covariates in the model fitted.

cov.model: A numeric vector. It is a vector of indicators of whether each covariate is used in specifying three sub-model in GSCA. The indicator is 1 if the covariate is involved in GSCA; and otherwise 0. Involving covariates in the GSCA model indicates that the covariates are used in defining the relation between indicators and latent variables.

multinomial.ref: A character element. Options of MAX, MIN, FIRST, and LAST are available for setting a reference group. The default is MAX, which indicates that the class whose prevalence is the highest is used for a reference class in fitting a multinomial regression. Contrary, MIN indicates that the class whose prevalence is the lowest is used for a reference class. FIRST and LAST indicates that the first class and last class are used as the reference class, respectively.

In these arguments, there are kind of generic arguments such as dat, Boot.num, and multiple.Core that utilizes functions in R. On the other hand, ID.var, num.class, num.factor, covnames, cov.model, and multinomial.ref are unique functionalities associated with the

The function gscaLCA returns an object involving many different elements. We have selected key features and recommend researchers to check all other arguments by using gscaLCA:

N: The number of observations used after applying listwise deletion when missing values exist.

N.origin: The number of observations before applying listwise deletion. This number is same as the number of observations of the input dataset.

LEVELs: The observed categories for each indicator.

all.levels.equal: The indicator whether all indicators used for analysis have the same answer categories. If it is FALSE, the program does not create a graph automatically.

num.class: The number of classes used for the analysis.

Boot.num: The number of bootstrap assigned by users.

Boot.num.im: The number of bootstrap implemented. Not all iterations of bootstrap is needed to estimate the standard error.

model.fit: The model fit indices. FIT, AFIT, FPI, and NCE are provided with the standard error and 95% credible interval lower and upper bounds.

LCprevalence: The latent class prevalence. The percent of class, the number of observation for each class, standard error, and 95% credible interval lower and upper bounds are provided.

RespProb: The item response probabilities for all variables used are reported as elements of a list. Each element consists of a table containing the probabilities with respect to the possible categories of each variable. The standard error and 95% credible interval with lower and upper bounds are also reported.

it.in: The number of iteration of in-loop. The in-loop is used for the estimation of the GSCA model.

it.out: The number of iteration of out-loop. The out-loop is used to update the membership probabilities of subjects.

membership: A data frame of the posterior probability for each subject with the predicted class membership.

plot: Graphs of item response probabilities within each category. For example, with two categories, each graph is stored as p1 and p2 in the list of plot. When the number of categories of indicators are different, the graphs are not provided.

A.mat: The estimated factor loading matrix of the GSCA model.

B.mat: The estimated path coefficient matrix of the GSCA model.

W.mat: The estimated weighted relation matrix of the GSCA model.

used.dat: The dataset that used for the analysis. When the input data include missing values, the data used are ones after applying listwise deletion.

When the covariates are involved in the analysis, the function gscaLCA returns eight additional elements:

cov_results.multi.hard: This is the main result of the multinomial regression with the hard partitioning.

cov_results_raw.multi.hard: This is the result of the multinomial regression with the hard partitioning, which is directly from the function nnet::multinom.

cov_results.bin.hard: This is the main result of binominal regression with the hard partitioning.

cov_results_raw.bin.hard: This is the result of the binominal regression with the hard partitioning, which is directly from the function stats::glm.

cov_results.multi.soft: This is the main result of the multinomial regression with the soft partitioning.

cov_results_raw.multi.soft: This is the result of the multinomial regression with the soft partitioning, which is directly from the function nnet::multinom.

cov_results.bin.soft: This is the main result of binominal regression with the soft partitioning.

cov_results_raw.bin.soft: This is the result of the binominal regression with the soft partitioning, which is directly from the function stats::glm.

In addition to the main function gscaLCA, the package gscaLCA provides a function which implements the second and third steps in the algorithm of fuzzy clusterwise GSCA with covariates (gscaLCR). The function is called gscaLCR. As aforementioned, fuzzy clusterwise GSCA with covariate includes three steps. Even if users fit the gscaLCA first without the covariates with the function gscaLCA, steps 2 and 3 of gscaLCR can be executed via the function gscaLCR.

R> gscaLCR(results.obj, covnames, multinomial.ref = "MAX")

The function gscaLCR requires three elements:

results.obj: The result object of gscaLCA.

covnames: A character vector of covariate names. The covariate variables have to be in the data that used to fit the gscaLCA model.

multinomial.ref: A character element. Options of MAX, MIN, FIRST, and LAST are available for setting a reference group. The default is MAX.

The output of gscaLCR is the same as gscaLCA.

To demonstrate the usage of the package

For the TALIS data, we used the three-class model (num.class = 3) with a single factor (num.factor = "ALLin1"). It is the optimal model with these data that was found through the model comparison. Regarding the number of class, it is expected to have three factors, motivation, pedagogy, and satisfaction, as the variable name and explanation suggests (see

The results report that 2,365 observations were used for the analysis, excluding 195 incomplete responses. FIT and AFIT were 0.5046 and 0.5033, respectively. With a single factor, FIT and AFIT are typically lower than the larger number of factors. The indices to evaluate the classification were relatively large (FPI = 0.8623 and NCE = 0.8742), but they are better than when the option num.factor is EACH for these data. The estimated latent class prevalences is 34.59%, 26.22%, and 39.20%. The conditional item response probabilities for each category per variable are also presented in a table. When the standard error and 95% credible interval of the model fit, the prevalence and conditional response probabilities are required, we can print out the objects through the following commands. These standard error was estimated by the bootstrap.

These response probabilities are used to define latent classes. In order to grasp the patterns of the probabilities, a visual representation of profiles based on the probabilities would be more helpful than numeric quantities in the output above. Once the gscaLCA function is executed, the graph is automatically created. When the command summary(T3) is executed, the resulting plot is displayed for all answer categories. When the plots for each answer category are required, it can also be printed by using the command T3$plot.

In this example, we demonstrate how to fit the gscaLCA with covaritate by using the AddHealth data. We used the three-class model (num.class = 3) with num.factor = “EACH”. which was found in the previous research, Park et al. [

The results report that 5,065 observations were used for the analysis after applying the listwise deletion. They also show that the model fit indices with the AddHealth data are acceptable. FIT and AFIT were 0.9993 and 0.9993, and they are close to 1. The indices to evaluate the classification were relatively low (FPI = 0.4658 and NCE = 0.5094). The estimated latent class prevalences are 54.30%, 20.23%, and 25.46%. These estimated prevalences are changed depending on whether we consider covariates into a GSCA model or not. The conditional item response probabilities are also presented for each category per variable. Like the example of the TALIS data, the results here provide the 95% standard error by using the following commands:

When a category of response is binary, a graph provided by this package shows the probabilities’ patterns of each category. In the AddHealth data, thus the graph is involved when the response is “Yes”, because the responses are binary (Yes or No),

From the plot shown in

To examine the effects of covariates on the membership probabilities, the function summary can be used as demonstrated before. Other options in partitioning and fitting regression (multinomial.soft, binomial.hard, and binomial.soft) are available as well in the function summary. In addition, the following command also provides the results of regressions:

As the example of A3_1, other options are available for hard and soft partitioning and multinomial and binomial regression.

Latent class analysis and clustering analysis including fuzzy clustering are statistical tools to identify (dis-)similarity of data distribution as a model-based mixture model and a data-driven classification approach, respectively. We developed the package,

The

The R package,