The Metropolis-Hastings algorithm for simulating the individual parameters
We consider a joint model for the observations and individual parameters of individual $i$:
where $\qcyipsii$ is the conditional distribution of the observations of individual $i$ (see Modeling the observations) and $\ppsii(\psi_i)$ the distribution of the individual parameters of individual $i$ (see Modeling the individual parameters).
Our goal here is to generate values from the conditional distribution $\qcpsiiyi$:
This distribution cannot usually be computed in closed-form when the model is not a linear Gaussian one. However, we will see that the Metropolis-Hastings (MH) algorithm allows us to draw a sequence $(\psi_i^{(\imh)}, \imh=1,2,\ldots)$ which converges (in distribution) to the target distribution $\qcpsiiyi$.
We will consider a very general model for the individual parameters:
\(
\psi_i = M(\beta,c_i,\eta_i) ,
\)
|
(1) |
where $\beta$ is a vector of fixed effects, $c_i$ a vector of (observed) individual covariates and $\eta_i$ a vector of random effects whose probability distribution is denoted $\qetai$.
The MH algorithm is used to simulate a sequence of random effects $(\eta_i^{(\imh)}, \imh=1,2,\ldots)$ with the target distribution being the conditional distribution $\qcetaiyi$ of the random effects $\eta_i$. We will then obtain the sequence $(\psi_i^{(\imh)}, \imh=1,2,\ldots)$ using (1).
The MH algorithm is iterative and requires an initial value $\eta_i^{(0)}$. Then, at iteration $\imh$, we
- Draw a new value $\ceta_i^{(\imh)}$ with some proposal distribution $q_{\imh}(\cdot \, ; \eta_i^{(\imh-1)})$
- Compute $\cpsi_i^{(\imh)} = M(\beta,c_i,\ceta_i^{(\imh)})$
- Accept this new value, that is let $\eta_i^{(\imh)}=\ceta_i^{(\imh)}$, with probability
Several proposal distributions are used in $\monolix$:
- $q^{(1)}=\qetai$ is the marginal distribution of $\eta_i$, that is, the normal distribution ${\cal N}(0,\Omega)$. The acceptance probability for this kernel is
- $q^{(2,\ieta)}$, for $\ieta=1,2,\ldots, d$ is the unidimensional Gaussian random walk for component $\ieta$ of $\eta_i$:
- $q^{(3,\Meta)}$, for $\Meta \subset \{1,2,\ldots,d\}$ is the multidimensional Gaussian random walk for the vector $\eta_\Meta = (\eta_\ieta , \ieta\in \Meta)$:
where $\xi_{i,\ieta}^{(\imh)} \sim {\cal N}(0, \upsilon_\ieta^{(\imh)})$. The variance $\upsilon_\ieta^{(\imh)}$ of this random walk is calibrated in order to reach an optimal acceptance rate $\alpha^\star$ ($\monolix$ uses $\alpha^\star = 0.3$ as default). Here, the transition kernel is symmetrical and
where $\xi_{i,\Meta}^{(\imh)}=(\xi_{i,\ieta}^{(\imh)}, \ieta\in \Meta)$ is a Gaussian vector with diagonal variance matrix $\Upsilon_\Meta^{(\imh)}$. Here as well, the variance $\Upsilon_\Meta^{(\imh)}$ of this random walk is adjusted in order to reach the optimal acceptation rate $\alpha^\star$. Different subsets $\Meta$ are chosen at each iteration.
The MH algorithm consists then in using successively these different proposals for $i=1,2,\ldots , N$.
The variances $(\upsilon_\ieta,\, \ieta=1,2,\ldots, d)$ for proposal $q^{(2,\ieta)}$ are updated at iteration $\imh$ as follows:
where $0<\delta<1$ is a constant and $\overline{\alpha}_\ieta^{(\imh)}$ the empirical acceptance rate at iteration $\imh$:
Indeed, a too small (resp. large) variance for the random walk leads to a too large (resp. small) acceptance probability. The strategy proposed here therefore allows us to adaptively correct the variance for a given acceptance probability $\alpha^\star$. A small value of $a$ allows us to smooth out the sequence $(\upsilon_\ieta^{(\imh)})$. $\monolix$ uses $a = 0.4$ as a default value.
We can use the same strategy for updating the diagonal variance matrices $\Upsilon_\Meta,\ \Meta \subset \{1,2,,\ldots,d\}$ for the kernel $q^{(3,M)}$.
The simulated sequence $(\psi_i^{(\imh)}, \imh=1,2,\ldots)$ can then be used for estimating empirically the conditional distribution $\qcpsiiyi$ and the conditional mean $\esp{F(\psi_i)| y_i}$ of any function $F$ such that $\esp{F^2(\psi_i)| y_i}<+\infty$. The accuracy of the estimation obviously depends on the length $K$ of the sequence $(\psi_i^{(\imh)})$ used for the estimation, since the variance of the estimator merely decreases as $1/K$. Estimation is also improved if the sequence starts from a point which is representative of the equilibrium distribution (burn-in is one method of finding a good starting point) or known to have reasonably high probability (the conditional mode or the last value obtained with SAEM for instance).