Longitudinal data arises frequently in the economics, social sciences, epidemiology and biological research. The observations of each subject are measured repeatedly over time and thus are intrinsically correlated[1]. It is crucial to correctly address the within-subject covariation structure since ignoring this structure may lead to inefficient estimators of the mean parameter. Moreover, the covariation structure itself may be of scientific interest[2], while covariance and precision matrices suffer from positive definiteness constraints and high-dimensional parameters, and the correlation matrix should satisfy the constraint that its diagonals must be 1’s in addition to the aforementioned two issues. Thus, there is a frequently-used strategy to estimate the correlation matrix, i.e., first estimating the covariance or precision matrix and then calculating the corresponding correlation matrix, which motivates us to develop the decomposition method proposed in this paper.
Now it is vital to decide to decompose and model whether the covariance matrix or the precision matrix. In fact, the covariance or precision matrix may possess typical structures, and the matrix inversion operation does not protect the original structure, such as sparsity and regression relationship, then decomposing and modeling the inverse of the covariance matrix or the inverse of the precision matrix may result in efficiency loss. For the precision matrix, Pourahmadi[3] introduced the modified Cholesky decomposition (MCD), which leads to an unconstrained parameterization for the precision matrix that automatically guarantees its positive definiteness. The entries in MCD can be interpreted as generalized autoregressive parameters and innovation variances, which can be estimated by fitting a class of unconstrained joint mean-covariance models via maximum likelihood estimation. For the covariance matrix, Zhang and Leng[4] analyzed within-subject covariation by decomposing the covariance matrix itself rather than its inverse by using the moving average Cholesky decomposition (MACD). The entries in this decomposition are moving average parameters and innovation variances. Based on MACD, the corresponding correlation expression depends on the innovation variances, and thus is not necessarily robust with respect to their model misspecification. In contrast, Chen and Dunson[5] proposed an alternative Cholesky decomposition (ACD), which directly models the covariance matrix but in a way that the estimates of the correlation matrix do not depend on the quality of modeling and estimating the innovation variances; that is, estimation of the correlation matrix is robust with respect to model misspecification of the innovation variances, the components shared by both MACD and ACD. ACD is closely related to the moving average model of “standardized” measurements on a longitudinal subject[6,7]. By directly considering the correlation matrix, Zhang et al.[8] developed the hyperspherical parameterization of the Cholesky factor (HPC), which is very appealing since the resulting parameters are unconstrained on the support and directly interpretable in regard to the correlations. HPC can always lead to a more parsimonious model, however, at the cost of intensive computations. To the best of our knowledge, when the precision matrix has a typical structure, robust estimation of the correlation matrix against model misspecification of innovation variances has been much less investigated.
Moreover, longitudinal data may suffer from outliers, and several heavy distributed joint modeling approaches have been investigated. Lin and Wang[9] proposed a joint mean and scale covariance model based on multivariate t distribution and MCD. Maadooliat et al.[7] further investigated the joint model based on the t distribution and ACD. However, this kind of joint t-regression approach is computationally intensive since joint regression parameters together with the degree of freedom need to be estimated. As an alternative, Guney et al.[10] adopted a joint model with multivariate Laplace distribution and MCD. The Laplace distribution is heavy-tailed and robust to outliers. It does not have the degree of freedom parameter; thus, its computation is more convenient than that of the t distribution. However, this joint modeling approach could not lead to the aforementioned robust estimation of the correlation matrix.
In this paper, motivated by ACD, we propose an alternative modified Cholesky decomposition (AMCD) of the precision matrix for longitudinal data. Specifically, the inverse of the diagonal matrix of innovation standard deviations is placed outside the two triangular matrices, which alone determines the correlations, and thus results in the aforementioned robust estimation of the correlation matrix. Then, we establish a joint mean-covariance model with multivariate normal distribution and AMCD, and the quasi-Fisher scoring algorithm, and show that the maximum likelihood estimators are consistent and asymptotically normally distributed. Furthermore, we present the double-robust joint modeling approach with multivariate Laplace distribution and AMCD, and the quasi-Newton algorithm for maximum likelihood estimation. We carry out the simulation studies, cattle data analysis, and sleep dose-response data analysis, which indicated that the proposed AMCD method performed well.
The rest of this paper is organized as follows. In Section 2, we review MACD, ACD and MCD, propose AMCD, and discuss their close relationships. In Section 3, we establish the joint normal mean covariance model with AMCD and give the iterative algorithm and asymptotic properties. In Section 4, we investigate the joint model with multivariate Laplace distribution and AMCD, and develop the quasi-Newton algorithm for maximum likelihood estimation. In Section 5, we carry out the simulations. In Section 6, real data analysis is conducted. In Section 7, the paper is summarized. All the calculations and proofs are in the Appendix.
2.
Existing and proposed Cholesky-type decompositions of the covariation structure
Assume that yij is the measurement at the jth time point tij for the ith subject ( i=\text{1},\dots ,n, j=\text{1},\dots ,{m}_{i} ). Let {\boldsymbol{y}}_{i}={\left({y}_{i1},\dots ,{y}_{i{m}_{i}}\right)}^{\text{T}} , and {\boldsymbol{t}}_{i}={\left({t}_{i1},\dots ,{t}_{i{m}_{i}}\right)}^{\text{T}} . Moreover, suppose that {\boldsymbol{y}}_{i} follows a normal distribution with E\left({\boldsymbol{y}}_{i}\right)={\boldsymbol{\mu }}_{i}= {\left({\mu }_{i1},\dots ,{\mu }_{i{m}_{i}}\right)}^{\text{T}} and {\rm var}\left({\boldsymbol{y}}_{i}\right)={\boldsymbol{\varSigma }}_{i} . Then let {\boldsymbol{y}}_{i}={\boldsymbol{\mu }}_{i}+{\boldsymbol{r}}_{i}, where {\boldsymbol{r}}_{i}={\left({r}_{i1},\dots ,{r}_{i{m}_{i}}\right)}^{\text{T}} is the normal random error, with E\left({\boldsymbol{r}}_{i}\right)={\boldsymbol{0}} and {\rm var}\left({\boldsymbol{r}}_{i}\right)={\boldsymbol{\varSigma }}_{i} .
In some cases, the covariance matrix {\boldsymbol{\varSigma }}_{i} (or its inverse, i.e., the precision matrix) is of scientific interest and possesses a typical structure. Then, a specific Cholesky-type decomposition can be established to adapt to the corresponding typical structure. Furthermore, the correlation matrix might be of particular interest; therefore, it is worthwhile to develop a robust estimation procedure based on the specific decomposition process, of course, with some suitable modifications. Along this line, several existing Cholesky-type decompositions are discussed, the proposed AMCD is introduced, and their close connections are investigated.
2.1
MACD of the covariance matrix
If the covariance matrix itself possesses a typical structure, it is reasonable to decompose the symmetric and positive definite covariance matrix as {\boldsymbol{\varSigma }}_{i}={\boldsymbol{C}}_{i}{\boldsymbol{C}}_{i}^{\text{T}} based on the standard Cholesky decomposition. Here {\boldsymbol{C}}_{i} is a unique lower triangular matrix with positive diagonal elements. Define \text{diag}\left({\boldsymbol{C}}_{i}\right)={\left({c}_{i11},\dots ,{c}_{i{m}_{i}{m}_{i}}\right)}^{\text{T}} , and {\boldsymbol{\varLambda }}_{i}=\text{diag}\left({c}_{i11},\dots ,{c}_{i{m}_{i}{m}_{i}}\right) . For simplification, let {d}_{ij}={c}_{ijj},j=1,\dots ,{m}_{i} ; then, {\boldsymbol{\varLambda }}_{i}=\text{diag}\left({d}_{i1},\dots ,{d}_{i{m}_{i}}\right) . In Ref. [4], the standard Cholesky decomposition was transformed into MACD; that is, the matrix {\boldsymbol{\varLambda }}_{i} is located inside two triangular matrices:
where {\boldsymbol{L}}_{i}={\boldsymbol{C}}_{i}{\boldsymbol{\varLambda }}_{i}^{-1} results in a standardized {\boldsymbol{C}}_{i} , namely dividing each column of {\boldsymbol{C}}_{i} by its diagonal elements. Let {\boldsymbol{L}}_{i}={\left({l}_{ijk}\right)}_{{m}_{i}\times {m}_{i}} and {\boldsymbol{D}}_{i}={\boldsymbol{\varLambda }}_{i}^{2}=\text{diag}\left({d}_{i1}^{2},\dots ,{d}_{i{m}_{i}}^{2}\right) . Denote {\boldsymbol{\varepsilon }}_{i}={\left({\varepsilon }_{i1},\dots ,{\varepsilon }_{i{m}_{i}}\right)}^{\text{T}}={\boldsymbol{L}}_{i}^{-1}{\boldsymbol{r}}_{i} ; then, \text{var}\left({\boldsymbol{\varepsilon }}_{i}\right)={\boldsymbol{\varLambda }}_{i}^{2} and {\boldsymbol{r}}_{i}={\boldsymbol{L}}_{i}{\boldsymbol{\varepsilon }}_{i} . Note that {\boldsymbol{L}}_{i} is a lower triangular matrix with diagonals being 1’s; then, a moving average representation for the residual {r}_{ij} can be expressed as follows:
where \displaystyle \sum _{t=1}^{0}{l}_{ijt}{\varepsilon }_{it} is denoted as 0. Then, from (2), for any 1\leqslant j,k\leqslant {m}_{i} , with j\wedge k=\mathrm{min}\{j,k\} , it follows that
which is determined by the {\boldsymbol{L}}_{i} and {\boldsymbol{\varLambda }}_{i} matrices. Then it might not be robust against misspecification of the model for the innovation variances {d}_{ij}^{2},j=1,\dots ,{m}_{i} .
2.2
ACD of the covariance matrix
In contrast, the standard Cholesky decomposition can be transformed into ACD[7]; that is, the matrix {\boldsymbol{\varLambda }}_{i} is located outside the two triangular matrices:
where {\boldsymbol{A}}_{i}={\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{C}}_{i} leads to a standardized {\boldsymbol{C}}_{i} , namely dividing each row of {\boldsymbol{C}}_{i} by its diagonal elements. Denote {\boldsymbol{A}}_{i}={\left({a}_{ijk}\right)}_{{m}_{i}\times {m}_{i}} . It is obvious that {\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{r}}_{i} has covariance matrix {\boldsymbol{A}}_{i}{\boldsymbol{A}}_{i}^{\text{T}} , within which {\boldsymbol{\varLambda }}_{i} disappears. Specifically, let {\boldsymbol{\varepsilon}}_{i}={\left({\varepsilon}_{i1},\dots ,{\varepsilon}_{i{m}_{i}}\right)}^{\text{T}}={\left({\boldsymbol{\varLambda }}_{i}{\boldsymbol{A}}_{i}\right)}^{-1}{\boldsymbol{r}}_{i} ; then, \text{var}\left({\boldsymbol{\varepsilon}}_{i}\right)={\boldsymbol{I}}_{{m}_{i}} and {\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{r}}_{i}={\boldsymbol{A}}_{i}{\boldsymbol{\varepsilon}}_{i} . In fact, {\boldsymbol{A}}_{i} is also a lower triangular matrix with diagonals being 1's; then, the “standardized” residual {r}_{ij}/{d}_{ij} can be represented as
which is determined by the {\boldsymbol{A}}_{i} matrix alone. Modeling a correlation matrix via ACD is highly appealing since it is robust with respect to misspecification of the model for {d}_{ij}^{2} , j=1,\dots ,{m}_{i} .
2.3
MCD of the precision matrix
However, the precision matrix might possess a typical structure and should be decomposed appropriately. Based on the standard Cholesky decomposition, {\boldsymbol{\varSigma }}_{i}^{-1}={\boldsymbol{C}}_{i}^{-\text{T}}{\boldsymbol{C}}_{i}^{-1}={\boldsymbol{B}}_{i}^{\text{T}}{\boldsymbol{B}}_{i} , where {\boldsymbol{B}}_{i}={\boldsymbol{C}}_{i}^{-1} is a unique lower triangular matrix with positive diagonals. Similar to (1),
where {\boldsymbol{F}}_{i}^{\text{T}}={\boldsymbol{B}}_{i}^{\text{T}}{\boldsymbol{\varLambda }}_{i} leads to a standardized {\boldsymbol{B}}_{i}^{\text{T}} , namely dividing each column of {\boldsymbol{B}}_{i}^{\text{T}} by its diagonal elements {d}_{ij}^{-2} . In fact, (5) can be transformed into {\boldsymbol{F}}_{i}{\boldsymbol{\varSigma }}_{i}{\boldsymbol{F}}_{i}^{\text{T}}={\boldsymbol{\varLambda }}_{i}^{2} , i.e., MCD[6,9]. Notice that {\boldsymbol{F}}_{i} can be expressed as a lower triangular matrix, with the diagonal entries being 1’s and the \left(j,k\right) th nonzero entries being -{f}_{ijk},k < j . Then let {\boldsymbol{\varepsilon }}_{i}={\boldsymbol{F}}_{i}{r}_{i} , and \text{var}\left({\boldsymbol{\varepsilon }}_{i}\right)={\boldsymbol{\varLambda }}_{i}^{2} , from which the autoregressive representation for the residuals {r}_{ij} can be expressed as follows:
Then, by simple calculation, the correlation between {r}_{ij} and {r}_{ik} is determined by the {\boldsymbol{F}}_{i} and {\boldsymbol{\varLambda }}_{i} matrices, and thus is not robust against misspecification of the model for innovation variances.
2.4
AMCD of the precision matrix
Conversely, the proposed AMCD causes the matrix {\boldsymbol{\varLambda }}_{i}^{-1} to be outside the two triangular matrices:
where {\boldsymbol{T}}_{i}^{\text{T}}={\boldsymbol{\varLambda }}_{i}{\boldsymbol{B}}_{i}^{\text{T}} results in a standardized {\boldsymbol{B}}_{i}^{\text{T}} , namely dividing each row of {\boldsymbol{B}}_{i}^{\text{T}} by its diagonals {d}_{ij}^{-2} . In fact, (6) is equivalent to {\boldsymbol{T}}_{i}{\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{\varSigma }}_{i}{\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{T}}_{i}^{\text{T}}={\boldsymbol{I}}_{{m}_{i}} . Apparently, the covariance matrix of {\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{r}}_{i} is {\boldsymbol{T}}_{i}^{-1}{\boldsymbol{T}}_{i}^{-\text{T}} , where {\boldsymbol{\varLambda }}_{i} disappears. Let {\boldsymbol{\varepsilon}}_{i}={\boldsymbol{T}}_{i}{\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{r}}_{i} ; then, \text{var}\left({\boldsymbol{\varepsilon}}_{i}\right)={\boldsymbol{I}}_{{m}_{i}} . Note that
where the generalized autoregressive parameter {\phi }_{ijt} ’s are unconstrained, the “standardized” factor {d}_{ij} ’s are restricted to be positive, the innovation variances \text{var}\left({\epsilon}_{ij}\right)=1 , and the innovation covariances \text{cov}\left({\epsilon}_{ij},{\epsilon}_{ik}\right)=0,j\ne k , implying their uncorrelation in general, more specifically, their independence when assuming normality. Note that {d}_{ij}^{2} ’s are the innovation variances in MCD. For ease of comparison, they are also referred to as innovation variances rather than squared “standardized” factors, in AMCD.
Obviously, \text{corr}\left({r}_{ij},{r}_{ik}\right) is determined by the {\boldsymbol{T}}_{i} matrix alone, and thus is robust against misspecification of the model for {d}_{ij}^{2} , j=1,\dots ,{m}_{i} . This advantage makes the proposed AMCD attractive for modeling the correlation matrix. Moreover, it is more suitable to adopt AMCD than ACD when the precision matrix, rather than the covariance matrix, possesses a typical structure.
3.
Normal joint modeling approach
3.1
Joint mean-covariance model
Based on the proposed AMCD in (6), the estimation of {\boldsymbol{\varSigma }}_{i}^{-1} is equivalent to that of {\boldsymbol{T}}_{i} and {\boldsymbol{\varLambda }}_{i} . Define {\text{log}}\left({\boldsymbol{\varLambda }}_{i}^{2}\right)= {\text{diag}}\left\{{\text{log}}\left({d}_{i1}^{2}\right),\dots ,{\text{log}}\left({d}_{i{m}_{i}}^{2}\right)\right\} . Following the general approach in Ref. [11], the unconstrained nonredundant entries of \text{log}\left({\boldsymbol{\varLambda }}_{i}^{2}\right) and {\boldsymbol{T}}_{i} can be modeled together with {\boldsymbol{\mu }}_{i} by generalized linear regression models
where g\left(\cdot \right) is assumed to be a monotone and differentiable known link function; {\boldsymbol{x}}_{ij} , {\boldsymbol{z}}_{ij} and {\boldsymbol{w}}_{ijk} are the vectors of covariates; and {\boldsymbol{\beta}} ,{\boldsymbol{\lambda}} and {\boldsymbol{\gamma}} are the p\times 1 , s\times 1 and q\times 1 vectors, respectively, of the corresponding associated parameters. The covariates {\boldsymbol{x}}_{ij} , {\boldsymbol{z}}_{ij} and {\boldsymbol{w}}_{ijk} may consist of the baseline covariates, polynomials in time (for {\boldsymbol{x}}_{ij} and {\boldsymbol{z}}_{ij} ) or time lag (for {\boldsymbol{w}}_{ijk} ), and even their interactions. For instance, when the entries of {\boldsymbol{\mu }}_{i} , \text{log}\left({\boldsymbol{\varLambda }}_{i}^{2}\right) and {\boldsymbol{L}}_{i} are modeled by polynomials in time and time lag, the covariates may be constructed as
assuming that the correlation between {r}_{ij} and {r}_{ik} relies only on the time lag between {t}_{ij} and {t}_{ik} \left(1\leqslant k < j\leqslant {m}_{i}\right) .
3.2
Maximum likelihood estimation
According to the proposed AMCD, inversing (6), it can be obtained that
Let {\boldsymbol{\theta}} ={\left({\boldsymbol{\beta }}^{\text{T}},{\boldsymbol{\lambda }}^{\text{T}},{\boldsymbol{\gamma }}^{\text{T}}\right)}^{\text{T}} ; then, from (6) and (8), twice the negative log-likelihood function of the multivariate normal distribution, except for a constant, can be written as
where {\boldsymbol{r}}_{i}={\boldsymbol{y}}_{i}-{\boldsymbol{\mu }}_{i} . In this case, the joint mean-covariance model can be referred to as the normal joint modeling model (NJMM). By taking partial derivatives of l\left({\boldsymbol{\theta}} \right) with respect to {\boldsymbol{\beta}} , {\boldsymbol{\lambda}} and {\boldsymbol{\gamma}} , respectively, the maximum likelihood estimating equations for these parameters become
Here \dfrac{\partial {\boldsymbol{\mu }}_{i}^{\text{T}}}{\partial {\boldsymbol{\beta}} } is the p\times {m}_{i} matrix with j th column \dfrac{\partial {\mu }_{ij}}{\partial {\boldsymbol{\beta}} }={\dot{g}}^{-1}\left({\boldsymbol{x}}_{ij}^{\text{T}}{\boldsymbol{\beta}} \right){\boldsymbol{x}}_{ij} , {\dot{g}}^{-1}\left(\cdot \right) is the derivative of the inverse function {g}^{-1}\left(\cdot \right) , and we denote \mu \left(\cdot \right)={g}^{-1}\left(\cdot \right) ; {\boldsymbol{Z}}_{i}={\left({\boldsymbol{z}}_{i1}^{\text{T}},\dots ,{\boldsymbol{z}}_{i{m}_{i}}^{\text{T}}\right)}^{\text{T}} , {\boldsymbol{h}}_{i}=\text{diag}\left({\boldsymbol{T}}_{i}^{\text{T}}{\boldsymbol{T}}_{i}{\boldsymbol{\varLambda }}_{i}^{-1}{\boldsymbol{r}}_{i}{\boldsymbol{r}}_{i}^{\text{T}}{\boldsymbol{\varLambda }}_{i}^{-1}\right) , and {\boldsymbol{1}}_{{m}_{i}} is a {m}_{i}\times 1 vector of 1’s; \dfrac{\partial {\boldsymbol{T}}_{i}^{-1}}{\partial {\gamma }_{j}}=-{\boldsymbol{T}}_{i}^{-1}\dfrac{\partial {\boldsymbol{T}}_{i}}{\partial {\gamma }_{j}}{\boldsymbol{T}}_{i}^{-1} with
for j=1,\dots ,q , and {\boldsymbol{w}}_{ijk}={\left({w}_{ijk,1},\dots ,{w}_{ijk,q}\right)}^{\text{T}} .
Since the solutions satisfy equations (9), the parameters {\boldsymbol{\beta}} , {\boldsymbol{\lambda}} and {\boldsymbol{\gamma}} can be solved iteratively with the others kept fixed. More specifically, the numerical solutions for these parameters can be calculated by the quasi-Fisher scoring algorithm.
First, the expectations of the Hessian matrices are listed below and discussed explicitly in Appendix A.1. That is,
with {\boldsymbol{A}}\circ {\boldsymbol{B}} denoting the Hadamard product of matrix {\boldsymbol{A}} and {\boldsymbol{B}} , {\boldsymbol{v}}_{ijk}=-\displaystyle \sum _{t=k+1}^{j-1}\dfrac{\partial {a}_{itk}}{\partial {\boldsymbol{\gamma}} }{\phi }_{ijt} , and {a}_{itk} being the \left(t,k\right) th element of {\boldsymbol{T}}_{i}^{-1} .
Then the iterative algorithm is as follows.
Step 1. Initialize the starting values {\boldsymbol{\beta }}^{\left(0\right)} , {\boldsymbol{\lambda }}^{\left(0\right)} and {\boldsymbol{\gamma }}^{\left(0\right)} . Set k=0 .
Step 2. Compute {\boldsymbol{\varSigma }}_{i} with given {\boldsymbol{\lambda }}^{\left(k\right)} and {\boldsymbol{\gamma }}^{\left(k\right)} . Update {\boldsymbol{\beta}} via
Step 4. Set k\leftarrow k+1 . Repeat steps 2 and 3 until a prespecified convergence criterion is satisfied.
Note that {\boldsymbol{\lambda}} and {\boldsymbol\gamma} are updated together, which is caused by their asymptotic dependence, as seen from Theorem 1 in Section 3.3. It can only be ensured that this algorithm converges to a local optimum which relies critically on the starting values. It is natural to choose the starting value of {\boldsymbol{\beta}} as the least square estimator in equation (10) with {\boldsymbol{\varSigma }}_{i} ’s set as identity matrices. Then, the starting value of {\boldsymbol{\gamma}} can be determined assuming {\boldsymbol{T}}_{i}={\boldsymbol{I}}_{{m}_{i}} , and the starting value of {\boldsymbol{\lambda}} can be given as the least square estimator based on the residuals. The starting estimates of {\boldsymbol{\beta}} and {\boldsymbol{\lambda}} are clearly \sqrt{n} -consistent. Moreover, the negative log-likelihood function is asymptotically convex around a small neighbourhood of true parameters, according to the theoretical results in Theorem 1 in Section 3.3 and the proofs in Appendix A.2. Then the final estimates via the iterative algorithm are guaranteed to be the global optima and more efficient than the starting values in terms of the asymptotical viewpoint. For simulation studies and real data analysis, convergence was usually achieved within several iterations.
3.3
Asymptotic properties
In this section, we establish the strong consistency and asymptotic normality of the maximum likelihood estimators. The following regularity conditions are imposed for the theoretical analysis.
(C1) The dimensions p , s and q of covariates {\boldsymbol{x}}_{ij} , {\boldsymbol{z}}_{ij} and {\boldsymbol{\varOmega }}_{ijk} are fixed, n\to \text{∞} , and \underset{1\leqslant i\leqslant n}{\mathrm{max}}\;{m}_{i} is bounded.
(C2) The parameter space \bf{\varTheta } of {\boldsymbol{\theta}} is a compact subset of {\boldsymbol{R}}^{p+s+q} , and the true parameter {\boldsymbol{\theta }}_{0}={\left({\boldsymbol{\beta }}_{0}^{\text{T}},{\boldsymbol{\lambda }}_{0}^{\text{T}},{\boldsymbol{\gamma }}_{0}^{\text{T}}\right)}^{\text{T}} lies in the interior of \bf{\varTheta } .
(C3) When n\to \mathrm{\infty } , {\boldsymbol{I}}\left({\boldsymbol{\theta }}_{0}\right)/n converges to a positive definite matrix {\boldsymbol{\mathcal{I}}}\left({\boldsymbol{\theta }}_{0}\right) .
Condition (C1) is standard for practical longitudinal data analysis. Condition (C2) is natural in the theoretical study of the maximum likelihood estimation. Condition (C3) is conventional in regression analysis for modeling unbalanced longitudinal data.
Theorem 1. If n\to \mathrm{\infty } and regularity conditions (C1)–(C3) hold, then (a) the maximum likelihood estimator \widehat{\boldsymbol{\theta }}={\left({\widehat{\boldsymbol{\beta }}}^{T},{\widehat{\boldsymbol{\lambda }}}^{T},{\widehat{\boldsymbol{\gamma }}}^{T}\right)}^{\text{T}} is strongly consistent for the true value {\boldsymbol{\theta }}_{0}={\left({\boldsymbol{\beta }}_{0}^{\text{T}},{\boldsymbol{\lambda }}_{0}^{\text{T}},{\boldsymbol{\gamma }}_{0}^{\text{T}}\right)}^{\text{T}} , and (b) \widehat{\boldsymbol{\theta }}={\left({\widehat{\boldsymbol{\beta }}}^{T},{\widehat{\boldsymbol{\lambda }}}^{T},{\widehat{\boldsymbol{\gamma }}}^{T}\right)}^{\text{T}} is asymptotically normal, that is, \sqrt{n}\left(\widehat{\boldsymbol{\theta }}-{\boldsymbol{\theta }}_{0}\right)\xrightarrow{\text{d}} N\left\{{\boldsymbol{0}},{\boldsymbol{{\mathcal{I}}}}^{-1}\left({\boldsymbol{\theta }}_{0}\right)\right\} .
It can be easily seen that the Fisher information matrix {\boldsymbol{\mathcal{I}}}\left({\boldsymbol{\theta }}_{0}\right) is a block matrix, more specifically,
from which \widehat{\boldsymbol{\beta }} is asymptotically independent of \widehat{\boldsymbol{\lambda} } and \widehat{\boldsymbol{\gamma }} , whereas \widehat{\boldsymbol{\lambda}} and \widehat{\boldsymbol{\gamma }} are not asymptotically independent. Since \widehat{\boldsymbol{\theta }}={\left({\widehat{\boldsymbol{\beta }}}^{T},{\widehat{\boldsymbol{\lambda }}}^{T},{\widehat{\boldsymbol{\gamma }}}^{T}\right)}^{\text{T}} is a consistent estimator for {\boldsymbol{\theta }}_{0} , the asymptotic covariance matrix {\boldsymbol{\mathcal{I}}}^{-1} can be consistently estimated by the inverse of a matrix with block components
In this section, we investigate the joint modeling approach based on the multivariate Laplace distribution, which is useful when the response variable has heavier tails. However, there are several kinds of multivariate Laplace distributions, which can be respectively regarded as a particular multivariate Linnik distribution[12], a special case of the multivariate power exponential (PE) distribution[13,14], a Gaussian scale mixture[15], and so on[16]. Among them, we adopt the special case of the multivariate PE distribution.
If {\boldsymbol{y}}_{i} follows a {m}_{i} -dimensional PE distribution, i=1,\dots ,n, then its density function is
with the location vector {\boldsymbol{\mu }}_{i}\in {\boldsymbol{R}} , the positive definite dispersion matrix {\boldsymbol{\varSigma }}_{i} and \upsilon . When \upsilon =1 , (12) corresponds to a multivariate normal distribution. When \upsilon =0.5 , (12) is the density function of a multivariate Laplace distribution, i.e.,
then E( {\boldsymbol{y}}_{i} )= {\boldsymbol{\mu }}_{i} and var( {\boldsymbol{y}}_{i} )=4( {m}_{i} +1) {\boldsymbol{\varSigma }}_{i} . Based on the multivariate Laplace distribution (13), the proposed AMCD of {\boldsymbol{\varSigma }}_{i}^{-1} (6) and the joint model in Section 3.1, we can obtain the Laplace joint modeling model (LJMM).
4.2
The quasi-Newton algorithm for maximum likelihood estimation
Twice the negative log-likelihood function of the multivariate Laplace distribution, except for a constant, can be written as
Let {{{\boldsymbol{\Delta }}_{i}={\boldsymbol{r}}}_{i}^{\text{T}}{\boldsymbol{\varLambda}} }_{i}^{-1}{{\boldsymbol{T}}}_{i}^{\text{T}}{{\boldsymbol{T}}}_{i}{{\boldsymbol{\varLambda}} }_{i}^{-1}{{\boldsymbol{r}}}_{i} . By taking partial derivatives of {l}_{L}\left({\boldsymbol{\theta}} \right) with respect to {\boldsymbol{\beta}} , {\boldsymbol{\lambda}} and {\boldsymbol{\gamma}} , respectively, the maximum likelihood estimating equations are
where {\bf{\Delta }}_{i}^{-\frac{1}{2}} can be regarded as a weight, providing robustness against outliers in the data.
We then estimate {\boldsymbol{\theta}} by minimizing expression (14) via the quasi-Newton algorithm. More specifically, the parameters in {\boldsymbol{\theta}} can be split into {\boldsymbol{\theta }}_{1}={\boldsymbol{\beta}} and {\boldsymbol{\theta }}_{23}={\left({\boldsymbol{\lambda }}^{\text{T}},{\boldsymbol{\gamma }}^{\text{T}}\right)}^{\text{T}} , and solved sequentially with the other parameters held fixed. The detailed algorithm is as follows.
Step 1. Initialize the parameters {\boldsymbol{\beta }}^{\left(0\right)} , {\boldsymbol{\lambda }}^{\left(0\right)} and {\boldsymbol{\gamma }}^{\left(0\right)} as in Section 3.2, the inverse Hessian matrix {\boldsymbol{H}}_{1}^{\left(0\right)} for {\boldsymbol{\theta }}_{1} and {\boldsymbol{H}}_{23}^{\left(0\right)} for {\boldsymbol{\theta }}_{23} as identity matrix. Set k=0 .
Step 2. For {\boldsymbol{\theta }}_{1}={\boldsymbol{\beta}} , compute the score function
Step 6. For {\boldsymbol{\theta }}_{23}={\left({\boldsymbol{\lambda }}^{\text{T}},{\boldsymbol{\gamma }}^{\text{T}}\right)}^{\text{T}} , compute the score function
Step 9. Update the inverse Hessian matrix via the BFGS formula as in Step 5, with the subscript “1” replaced by “23”.
Step 10. Set k=k+1 and repeat Steps 2 to 9 until a prespecified convergence criterion is met.
Note that the BFGS algorithm is implemented in this section since it is demonstrated to be one of the best quasi-Newton algorithms for solving unconstrained smooth optimization problems and performs very well here. Alternative quasi-Newton algorithms are also deserved to be investigated in the future. For more details, see Refs. [17, 18].
5.
Simulation studies
In this section, we investigate the finite-sample performance of the proposed AMCD, compare the robustness of AMCD, MCD and ACD for modeling correlation matrices based on various typical covariation structures, and compare the modeling capacities of NJMM and LJMM with AMCD based on different distributions.
Study 1. The 1000 replications of the datasets were generated from the following NJMM:
Here, n=50, 100 , or 200, {m}_{i}-1\sim binomial(6,0.8), and {t}_{ij} ’s were generated from the uniform distribution in the unit interval. The covariates {\boldsymbol{x}}_{ij}={\left({x}_{ij1},{x}_{ij2},{x}_{ij3}\right)}^{\text{T}} were generated from the multivariate normal distribution with mean zero, marginal variance 1 and correlation 0.5. In addition, we take {\boldsymbol{z}}_{ij}={\boldsymbol{x}}_{ij} and {\boldsymbol{w}}_{ijk}=\left\{1,{t}_{ij}-{t}_{ik},{\left({t}_{ij}-{t}_{ik}\right)}^{2}\right\} . In Table 1, “Bias” denotes the average bias of the parameter estimates, and “SD” represents the sample standard deviation of the estimates for the parameters, which can be regarded as the true standard deviation of the resulting estimates. “SE” denotes the sample average of the estimated standard errors by the formula, and “Std” represents the standard deviation of these standard errors. Table 1 indicates that the proposed AMCD approach results in unbiased parameter estimates, and the standard error formula works well, especially when n is large.
Table
1.
Simulation results for Study 1. All the results are multiplied by a factor {10}^{2} .
Study 2. We compare the performances of the proposed AMCD with those of MCD and ACD based on different datasets generated from various types of covariation structures in terms of the estimation robustness of the correlation matrices against model misspecification of the innovation variances. To assess the estimation accuracy of the correlation matrices, we define the entropy loss function {\Delta }_{\text{1}}\left({\boldsymbol{R}}_{i},{\widehat{\boldsymbol{R}}}_{i}\right)=\text{trace} \left({\boldsymbol{R}}_{i}^{-1}{\widehat{\boldsymbol{R}}}_{i}\right)-\text{log}\left|{\boldsymbol{R}}_{i}^{-1}\right|-{m}_{i} and the quadratic loss function {\Delta}_{\text{2}}\left({\boldsymbol{R}}_{i},{\widehat{\boldsymbol{R}}}_{i}\right)= \text{trace} {\left({\boldsymbol{R}}_{i}^{-1}{\widehat{\boldsymbol{R}}}_{i}-{\boldsymbol{I}}_{{m}_{i}}\right)}^{2} , where {R}_{i} is the true correlation matrix and {\widehat{\boldsymbol{R}}}_{i} is an estimated correlation matrix. Both of the loss functions are 0 when {\boldsymbol{R}}_{i}={\widehat{\boldsymbol{R}}}_{i} and positive when {\boldsymbol{R}}_{i}\ne {\widehat{\boldsymbol{R}}}_{i} . It is obvious that a smaller loss implies better estimates of {\boldsymbol{R}}_{i} . Furthermore, the loss functions for all subjects are defined by {L}_{l}=\dfrac{1}{n}\displaystyle \sum _{i=1}^{n}{\Delta}_{l}\left({\boldsymbol{R}}_{i},{\widehat{\boldsymbol{R}}}_{i}\right),i=\mathrm{1,2} .
In the following, we generated 1000 replications of the datasets with n=100 from the joint mean-covariance model similar to that in Study 1, except that {r}_{ij} follows respectively the proposed AMCD, MCD and ACD. Then the estimates and losses of the correlation matrices were computed respectively based on the true model \mathrm{log}\left({d}_{ij}^{2}\right)={\mathrm{\lambda }}_{0}+{z}_{ij1}{\mathrm{\lambda }}_{1}+{z}_{ij2}{\mathrm{\lambda }}_{2}+{z}_{ij3}{\mathrm{\lambda }}_{3} and the misspecified model \mathrm{log}\left({d}_{ij}^{2}\right)={\mathrm{\lambda }}_{0}+{z}_{ij1}{\mathrm{\lambda }}_{1} for the innovation variances. For comparison, the losses obtained from fitting the true and misspecified models for the innovation variances are exhibited in Tables 2 and 3, respectively, where ENL and QUL represent respectively the average of the entropy loss {L}_{1} and quadratic loss {L}_{2} , with empirical standard errors in parentheses.
Table
2.
Simulation results for Study 2. Fitting the true model for log( {d}_{ij}^{2} ).
In terms of the average losses in Table 2, it is vital to know the true covariation structure; that is, when the true covariation structure follows AMCD, the losses of estimating the correlation matrices increase when the covariation structure is incorrectly decomposed based on MCD or ACD, and vice versa. By comparing Table 2 with Table 3, we can analyze the influence of model misspecification for the innovation variances on estimating the correlation matrices. Specifically, by comparing the left two AMCD columns of Table 2 with those of Table 3, ENL and QUL both vary little, no matter whether the true covariation structure is AMCD, MCD or ACD; the losses in the right two ACD columns of Tables 2 and 3 also exhibit a similar pattern; however, the losses in the middle two MCD columns of Tables 2 and 3 vary significantly. Thus, the estimation of the correlation matrices is robust with respect to the model misspecification for the innovation variances when fitting the AMCD or ACD structure, but is nevertheless not robust when fitting the MCD structure.
Study 3. We generated 1000 replications of the datasets with n=100 respectively from several distributions, where the joint model was the same as that in Study 1. Specifically, {y}_{i} follows respectively the multivariate normal distribution (Scenario 1), Laplace distribution (Scenario 2) and PE distribution with \upsilon =0.7 (Scenario 3), and we fitted NJMM and LJMM under each scenario to compare thier modeling capacities.
To assess the estimation accuracy of the parameters, Bias, SD and mean squared errors (MSE) were calculated, and the simulation results are displayed in Tables 4–6. From Table 4, NJMM yields smaller MSE than does LJMM for all parameters, when the data follows the multivariate normal distribution (Scenario 1). Moreover, Bias of NJMM and LJMM are generally small, except that Bias for {\boldsymbol{\lambda }}_{0} when fitting LJMM are rather large under Scenario 1, which might be due to the influence of distribution misspecification and the difficulty of estimating the constant term in innovation variance models. From Table 5, we can carry out a similar discussion for Scenario 2, and find out that LJMM outperforms NJMM when the data follows the multivariate Laplace distribution. Table 6 shows that LJMM has slightly smaller MSE than NJMM, when the data follows the multivariate PE distribution with \upsilon =0.7 (Scenario 3), which is between the multivariate normal distribution and the multivariate Laplace distribution.
Table
4.
Simulation results for Study 3. Fitting NJMM and LJMM based on the data generated from the multivariate normal distribution (Scenario 1).
Table
6.
Simulation results for Study 3. Fitting NJMM and LJMM based on the data generated from the multivariate PE distribution with \upsilon =0.7 (Scenario 3).
We applied the proposed AMCD approach to the balanced cattle data, which was initially introduced in Ref. [19]. This dataset consists of two treatment groups, A and B, and the weights of each cattle were measured 11 times over a period of 133 d. Notably, the 30 animals in group A were analyzed in Refs. [11, 20] and [8]. Pan and Mackenzie[20] found out that it is reasonable to adopt three polynomials for modeling jointly the mean, the log-innovation variances and the autoregressive coefficients based on MCD, and the optimal triplet of the polynomial orders is determined as (8,3,4) in terms of the Bayesian information criterion (BIC).
From expression (8) based on AMCD, we can calculate the corresponding innovation variances {d}_{ij}^{2} and generalized autoregressive parameters {\phi }_{ijk}\left(j > k\right) of the sample covariance matrix. Fig. 1 displays \text{log}\left({d}_{ij}^{2}\right) versus time and {\phi }_{ijk} versus time lag, both indicate trends which might be properly characterized by polynomials. Then we adopt three polynomials
Figure
1.
Cattle data: sample regressograms and fitted curves for (a) log-innovation variances and (b) autoregressive coefficients. Solid lines, curves fitted by the proposed AMCD method; dashed lines, 95% pointwise confidence intervals using the bootstrapping method.
where p , s and q are the three polynomial orders, and {\widehat{l}}_{\text{max}} is the corresponding maximum log-likelihood. Then some model with the smallest BIC is generally judged to be the optimal model, which can capture the dynamics much more precisely and parsimoniously. Due to the asymptotic orthogonality of the mean parameter and the covariation parameters, these two kinds of parameters can be searched separately, which is more computationally efficient. Note that p within the optimal model in Refs. [20] and [8] are both 8, which motivates us to first fix p=\text{8} and search for s and q . The optimal \left(s,q\right) is determined to be (3,3) by comparing several candidate models, and the major searching process is displayed in Table 7. Then, we fix \left(s,q\right)=\left(\text{3},\text{3}\right) , search for p , and determine the optimal order for AMCD as \left(p,s,q\right)=\left(\text{8,3,3}\right) . Fig. 1 shows the optimal AMCD-based fitted curves for \text{log}\left({d}_{ij}^{2}\right) ’s and {\phi }_{ijk} ’s and their corresponding \text{95} \% pointwise confidence intervals obtained using the bootstrapping method; these curves capture the pattern well for the log-innovation variances and rather well for the generalized autoregressive parameters.
Table
7.
Cattle data: comparison of various models based on the proposed AMCD approach.
The optimal order based on AMCD is slightly less than that based on MCD, but identical to that based on ACD, which is also determined by using a similar searching process. To compare the performances of the aforementioned three decompositions, we regard the sample correlation matrix as the true matrix, and still assess the estimation accuracy of the correlation matrix by the entropy loss {L}_{1} and quadratic loss {L}_{2} . The \left({L}_{1},{L}_{2}\right) values for the optimal models based on AMCD, ACD and MCD are (3.0866,35.6300), (2.8660,30.8312), and (1.9237,12.9392), respectively. Obviously, the MCD-based model yields more accurate estimates of the correlation matrix, and the AMCD and ACD-based models have similar performances. This indicates that the MCD structure might be more suitable for this dataset than AMCD and ACD, in terms of the estimation accuracy of the correlation matrix. However, when we keep p and q fixed, and take the polynomial order for the innovation variances as s=\text{1} rather than the optimal order s=\text{3} , the \left({L}_{1},{L}_{2}\right) ’s are respectively (3.0743, 31.8634), (2.9381, 30.0376), and (2.8819, 29.2350). The estimation accuracy of the correlation matrix for AMCD and ACD-based models varies slightly, again implying robustness with respect to the model misspecification of the innovation variances; while the losses for the MCD-based model increase significantly which might lead to the opposite conclusion.
6.2
The sleep dose-response data
We apply NJMM and LJMM based on the proposed AMCD to the sleep dose-response data, which was initially investigated by Ref. [21] and subsequently analyzed in Ref. [22] using Bayesian inference for the joint model based on the t distribution and MCD. The days of sleep deprivation and the corresponding average reaction times were recorded for each of the 18 participants in the 3-h group. Fig. 2a shows the trajectories of the average reaction times over an equally spaced 10-d period, together with the mean profile plot and the corresponding \pm 1 standard deviations across the period. There seem to be sudden jumps and drops in the trajectories of the first and sixth subjects, which may be obvious outliers and thus not able to be properly handled using NJMM.
Figure
2.
Sleep dose-response data: (a) trajectories of average reaction time; (b) sample regressograms for log-innovation variances; (c) sample regressograms for autoregressive coefficients.
The mean response varies linearly over time, and thus can be modeled by a 1st-degree polynomial in time. Fig. 2b and c show the sample regressograms for log-innovation variances and autoregressive coefficients, implying higher-order polynomial functions in time or time lag. Hence, it is suitable to adopt three polynomials expressed as (16) with p=1 and s,q to be determined. We fitted the sleep dose-response data using AMCD-based NJMM and LJMM for various degrees of the Poly( 1,s,q ) models. The numbers of parameters, {\widehat{l}}_{\text{max}} and BIC( 1,s,q ) values for the searched Poly( 1,s,q )’s are listed in Table 8, where the BIC values are calculated via formula (17). According to the BIC values, Poly(1,3,4) and Poly(1,3,2) are the best for NJMM and LJMM, respectively, and this data is fitted significantly better when using LJMM. Table 9 displays the maximum likelihood estimates and standard errors when fitting respectively NJMM and LJMM with the above two best polynomials. Note that NJMM and LJMM have similar parameter estimates. However, the standard errors of \widehat{\boldsymbol{\lambda } } and \widehat{\boldsymbol{\gamma } } for LJMM are smaller than those of NJMM with Poly(1,3,2); the standard errors of \widehat{\boldsymbol{\lambda } } for LJMM are slightly larger than those of NJMM with Poly(1,3,4), and the standard errors of \widehat{\boldsymbol{\gamma }} for LJMM are still smaller in this case. This indicates convincingly that parameter estimates of LJMM always possess lower variability for this data.
Table
8.
Sleep dose-response data: comparison of various Poly( 1,{s,q} ) choices between NJMM and LJMM.
In longitudinal data analysis, we always suffer from various kinds of covariation structure and need to choose the most suitable decomposition among the candidates. If the covariance matrix possesses a typical structure, we might apply the MACD method; otherwise, we might adopt the MCD method for the precision matrix. Furthermore, when we are interested in robust estimation of the correlation matrix against model misspecification of the innovation variances, we can apply the ACD method to the covariance matrix. However, the decomposition of the precision matrix targeting the above robustness has been rather less investigated. In this paper, we have proposed AMCD of the precision matrix and established its role in providing robust estimator for the correlation matrix. In addition, we have investigated the AMCD-based LJMM which may achieve both the aforementioned robust estimation and robustness to outliers in the data.
Recently, an autoregressive moving average Cholesky decomposition (ARMACD) has been proposed in Ref. [23] ; this decomposition is more general than both MACD and MCD. Therefore, it is worthwhile to develop a robust estimation of the correlation matrix on the basis of ARMACD in the future. Furthermore, these decompositions correspond to various time series models, then various robust time series approaches could also be generalized for robust estimation of the correlation, covariance or precision matrix. In addition, longitudinal data may suffer from missingness, such as informative dropouts[24]. Then we need to specify a suitable dropout mechanism, establish the complete data log-likelihood, and implement an EM algorithm to calculate the maximum likelihood estimates. Moreover, heterogeneity is also deserved to be investigated, not only for the mean parameters, but also for the covariation structures. The finite mixture model[25] based on NJMM or LJMM with AMCD may perform well, which is left for future research.
Conflict of Interest
The authors declare that they have no conflict of interest.
Table
6.
Simulation results for Study 3. Fitting NJMM and LJMM based on the data generated from the multivariate PE distribution with \upsilon =0.7 (Scenario 3).