Difference between revisions of "Estimation of the log-likelihood"

Revision as of 15:06, 13 May 2013

$ \def\ieta{m} \def\Meta{M} \def\imh{\ell} \def\ceta{\tilde{\eta}} \def\cpsi{\tilde{\psi}} \def\transpose{t} \newcommand{\Dt}[1]{\partial_\theta #1} \newcommand{\DDt}[1]{\partial^2_{\theta} #1} \newcommand{\cov}[1]{\mbox{Cov}\left(#1\right)} \def\jparam{j} \newcommand{\Dphi}[1]{\partial_\phi #1} \newcommand{\Dpsi}[1]{\partial_\psi #1} \def\llike{\cal LL} \def\tqpsii{\tilde{p}_{\psi_i}} \def\tppsii{\tilde{\pmacro}} \newcommand{\argmin}[1]{ \mathop{\rm arg} \mathop{\rm min}\limits_{#1} } \newcommand{\argmax}[1]{ \mathop{\rm arg} \mathop{\rm max}\limits_{#1} } \newcommand{\nominal}[1]{#1^{\star}} \newcommand{\psis}{\psi{^\star}} \newcommand{\phis}{\phi{^\star}} \newcommand{\hpsi}{\hat{\psi}} \newcommand{\hphi}{\hat{\phi}} \newcommand{\teps}{\varepsilon} \newcommand{\limite}[2]{\mathop{\longrightarrow}\limits_{\mathrm{#1}}^{\mathrm{#2}}} \newcommand{\DDt}[1]{\partial^2_\theta #1} \def\cpop{c_{\rm pop}} \def\Vpop{V_{\rm pop}} \def\iparam{l} \newcommand{\trcov}[1]{#1} \def\bu{\boldsymbol{u}} \def\bt{\boldsymbol{t}} \def\bT{\boldsymbol{T}} \def\by{\boldsymbol{y}} \def\bx{\boldsymbol{x}} \def\bc{\boldsymbol{c}} \def\bw{\boldsymbol{w}} \def\bz{\boldsymbol{z}} \def\bpsi{\boldsymbol{\psi}} \def\bbeta{\beta} \def\aref{a^\star} \def\kref{k^\star} \def\model{M} \def\hmodel{m} \def\mmodel{\mu} \def\imodel{H} \def\thmle{\hat{\theta}} \def\ofim{I^{\rm obs}} \def\efim{I^{\star}} \def\Imax{\rm Imax} \def\probit{\rm probit} \def\vt{t} \def\id{\rm Id} \def\teta{\tilde{\eta}} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\deriv}[1]{\frac{d}{dt}#1(t)} \newcommand{\pred}[1]{\tilde{#1}} \def\phis{\phi{^\star}} \def\hphi{\tilde{\phi}} \def\hw{\tilde{w}} \def\hpsi{\tilde{\psi}} \def\hatpsi{\hat{\psi}} \def\hatphi{\hat{\phi}} \def\psis{\psi{^\star}} \def\transy{u} \def\psipop{\psi_{\rm pop}} \newcommand{\psigr}[1]{\hat{\bpsi}_{#1}} \newcommand{\Vgr}[1]{\hat{V}_{#1}} \def\pmacro{\text{p}} \def\py{\pmacro} \def\pt{\pmacro} \def\pc{\pmacro} \def\pu{\pmacro} \def\pyi{\pmacro} \def\pyj{\pmacro} \def\ppsi{\pmacro} \def\ppsii{\pmacro} \def\pcpsith{\pmacro} \def\pcpsiiyi{\pmacro} \def\pth{\pmacro} \def\pypsi{\pmacro} \def\pcypsi{\pmacro} \def\ppsic{\pmacro} \def\pcpsic{\pmacro} \def\pypsic{\pmacro} \def\pypsit{\pmacro} \def\pcypsit{\pmacro} \def\pypsiu{\pmacro} \def\pcypsiu{\pmacro} \def\pypsith{\pmacro} \def\pypsithcut{\pmacro} \def\pypsithc{\pmacro} \def\pcypsiut{\pmacro} \def\pcpsithc{\pmacro} \def\pcthy{\pmacro} \def\pyth{\pmacro} \def\pcpsiy{\pmacro} \def\pz{\pmacro} \def\pw{\pmacro} \def\pcwz{\pmacro} \def\pw{\pmacro} \def\pcyipsii{\pmacro} \def\pyipsii{\pmacro} \def\pcetaiyi{\pmacro} \def\pypsiij{\pmacro} \def\pyipsiONE{\pmacro} \def\ptypsiij{\pmacro} \def\pcyzipsii{\pmacro} \def\pczipsii{\pmacro} \def\pcyizpsii{\pmacro} \def\pcyijzpsii{\pmacro} \def\pcyiONEzpsii{\pmacro} \def\pcypsiz{\pmacro} \def\pccypsiz{\pmacro} \def\pypsiz{\pmacro} \def\pcpsiz{\pmacro} \def\peps{\pmacro} \def\petai{\pmacro} \def\psig{\psi} \def\psigprime{\psig^{\prime}} \def\psigiprime{\psig_i^{\prime}} \def\psigk{\psig^{(k)}} \def\psigki{\psig_i^{(k)}} \def\psigkun{\psig^{(k+1)}} \def\psigkuni{\psig_i^{(k+1)}} \def\psigi{\psig_i} \def\psigil{\psig_{i,\ell}} \def\phig{\phi} \def\phigi{\phig_i} \def\phigil{\phig_{i,\ell}} \def\etagi{\eta_i} \def\IIV{\Omega} \def\thetag{\theta} \def\thetagk{\theta_k} \def\thetagkun{\theta_{k+1}} \def\thetagkunm{\theta_{k-1}} \def\sgk{s_{k}} \def\sgkun{s_{k+1}} \def\yg{y} \def\xg{x} \def\qx{p_x} \def\qy{p_y} \def\qt{p_t} \def\qc{p_c} \def\qu{p_u} \def\qyi{p_{y_i}} \def\qyj{p_{y_j}} \def\qpsi{p_{\psi}} \def\qpsii{p_{\psi_i}} \def\qcpsith{p_{\psi|\theta}} \def\qth{p_{\theta}} \def\qypsi{p_{y,\psi}} \def\qcypsi{p_{y|\psi}} \def\qpsic{p_{\psi,c}} \def\qcpsic{p_{\psi|c}} \def\qypsic{p_{y,\psi,c}} \def\qypsit{p_{y,\psi,t}} \def\qcypsit{p_{y|\psi,t}} \def\qypsiu{p_{y,\psi,u}} \def\qcypsiu{p_{y|\psi,u}} \def\qypsith{p_{y,\psi,\theta}} \def\qypsithcut{p_{y,\psi,\theta,c,u,t}} \def\qypsithc{p_{y,\psi,\theta,c}} \def\qcypsiut{p_{y|\psi,u,t}} \def\qcpsithc{p_{\psi|\theta,c}} \def\qcthy{p_{\theta | y}} \def\qyth{p_{y,\theta}} \def\qcpsiy{p_{\psi|y}} \def\qcpsiiyi{p_{\psi_i|y_i}} \def\qcetaiyi{p_{\eta_i|y_i}} \def\qz{p_z} \def\qw{p_w} \def\qcwz{p_{w|z}} \def\qw{p_w} \def\qcyipsii{p_{y_i|\psi_i}} \def\qyipsii{p_{y_i,\psi_i}} \def\qypsiij{p_{y_{ij}|\psi_{i}}} \def\qyipsi1{p_{y_{i1}|\psi_{i}}} \def\qtypsiij{p_{\transy(y_{ij})|\psi_{i}}} \def\qcyzipsii{p_{z_i,y_i|\psi_i}} \def\qczipsii{p_{z_i|\psi_i}} \def\qcyizpsii{p_{y_i|z_i,\psi_i}} \def\qcyijzpsii{p_{y_{ij}|z_{ij},\psi_i}} \def\qcyi1zpsii{p_{y_{i1}|z_{i1},\psi_i}} \def\qcypsiz{p_{y,\psi|z}} \def\qccypsiz{p_{y|\psi,z}} \def\qypsiz{p_{y,\psi,z}} \def\qcpsiz{p_{\psi|z}} \def\qeps{p_{\teps}} \def\qetai{p_{\eta_i}} \def\neta{n_\eta} \def\ncov{M} \def\npsi{n_\psig} \def\beeta{\eta} \def\logit{\rm logit} \def\transy{u} \def\so{O} \newcommand{\prob}[1]{ \mathbb{P}\left(#1\right)} \newcommand{\probs}[2]{ \mathbb{P}_{#1}\left(#2\right)} \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\esps}[2]{\mathbb{E}_{#1}\left(#2\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\vars}[2]{\mbox{Var}_{#1}\left(#2\right)} \newcommand{\std}[1]{\mbox{sd}\left(#1\right)} \newcommand{\stds}[2]{\mbox{sd}_{#1}\left(#2\right)} \newcommand{\corr}[1]{\mbox{Corr}\left(#1\right)} \newcommand{\Rset}{\mbox{$\mathbb{R}$}} \newcommand{\Yr}{\mbox{$\mathcal{Y}$}} \newcommand{\teps}{\varepsilon} \newcommand{\like}{\cal L} \newcommand{\logit}{\rm logit} \newcommand{\transy}{u} \newcommand{\repy}{y^{(r)}} \newcommand{\brepy}{\boldsymbol{y}^{(r)}} \newcommand{\vari}[3]{#1_{#2}^{{#3}}} \newcommand{\dA}[2]{\dot{#1}_{#2}(t)} \newcommand{\nitc}{N} \newcommand{\itc}{I} \newcommand{\vl}{V} \newcommand{tstart}{t_{start}} \newcommand{tstop}{t_{stop}} \newcommand{\one}{\mathbb{1}} \newcommand{\hazard}{h} \newcommand{\cumhaz}{H} \newcommand{\std}[1]{\mbox{sd}\left(#1\right)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \def\mlxtran{\text{MLXtran}} \def\monolix{\text{Monolix}} $

The observed log-likelihood ${\llike}(\theta;\by)=\log({\like}(\theta;\by))$ can be estimated without any approximation of the model using a Monte-Carlo approach.

Since

$ \begin{eqnarray} {\llike}(\theta;\by) &\eqdef& \log(\py(\by;\theta)) \\ &=& \sum_{i=1}^{N} \log(\pyi(y_i;\theta)), \end{eqnarray}$

we will estimate the log-pdf $\log(\pyi(y_i;\theta))$ for each individual and derive an estimate of the log-likelihood as the sum of these individual log-likelihoods. We will explain here how to estimate the log-pdf $\log(\pyi(y_i;\theta))$ for any individual $i$.

Using the $\phi$-representation of the model, notice first that $\pyi(y_i;\theta)$ can be decomposed as follows

$\begin{eqnarray} \pyi(y_i;\theta) &=& \displaystyle{ \int \pyipsii(y_i,\phi_i;\theta)\,d\phi_i }\\ &=& \displaystyle{ \int \pcyiphii(y_i | \phi_i;\theta)\pphii(\phi_i;\theta)\,d\phi_i }\\ &=& \esps{\qphii}{\pcyiphii(y_i | \phi_i;\theta)}. \end{eqnarray}$

$\pyi(y_i;\theta)$ is expressed as a mean. It can therefore be approximated by an empirical mean using a Monte-Carlo procedure:

1. draw $M$ independent realizations $\phi_i^{(1)}$, $\phi_i^{(2)}$,..., $\phi_i^{(M)}$ of the normal distribution $\qphii(\, \cdot \, ; \theta)$,

2. estimate $ \pyi(y_i;\theta)$ with

$ \hat{p}_{i,M} = \displaystyle{ \frac{1}{M} } \sum_{m=1}^{M}\pcyiphii(y_i | \phi_i^{(m)};\theta) $

By construction, this estimator is unbiased:

$\begin{eqnarray} \esp{\hat{p}_{i,M </div>&=& \esps{\qphii}{\pcyiphii(y_i | \phi_i^{(m)};\theta)} \\ &=& \pyi(y_i;\theta) \end{eqnarray}$ }}

Furthermore, it is consistent since its variance decreases as $1/M$:

$ \var{\hat{p}_{i,M} }= \displaystyle{\frac{1}{M} }\vars{\qphii}{\pcyiphii(y_i | \phi_i^{(m)};\theta)}. $

We could be satisfied with this estimator, since we "only" have to select $M$ large enough to get an estimator with a small variance. Nevertheless, we will see now that it is possible to improve the statistical properties of this estimator.

For any distribution $\tqphii$ absolutely continuous with respect to the marginal distribution $\qphii$, we can write

$\begin{eqnarray} \displaystyle{ \int \pyiphii(y_i,\phi_i;\theta)\,d\phi_i }\\ &=& \displaystyle{ \int \pcyiphii(y_i | \phi_i;\theta)\frac{\pphii(\phi_i;\theta)}{\tpphii(\phi_i;\theta)}\tpphii(\phi_i;\theta)\,d\phi_i }\\ &=& \esps{\tqphii}{\pcyiphii(y_i | \phi_i;\theta)\displaystyle{\frac{\pphii(\phi_i;\theta)}{\tpphii(\phi_i;\theta)} } }. \end{eqnarray}$

We can now approximate $\pyi(y_i;\theta)$ via an Importance Sampling integration method using $\tqphii$ as a proposal distribution:

1. draw $M$ independent realizations $\phi_i^{(1)}$, $\phi_i^{(2)}$, ..., $\phi_i^{(M)}$ of the proposal distribution $\tqphii(\, \cdot \, ; \theta)$,
2. estimate $ \pyi(y_i;\theta)$ with

$ \hat{p}_{i,M} = \displaystyle{ \frac{1}{M} }\sum_{m=1}^{M}\pcyiphii(y_i | \phi_i^{(m)};\theta) \displaystyle{\frac{\pphii(\phi_i^{(m)};\theta)}{\tpphii(\phi_i^{(m)};\theta)} } $

By construction, this new estimator is also unbiased and its variance decreases also as $1/M$:

$ \var{\hat{p}_{i,M} }= \displaystyle{\frac{1}{M} }\vars{\tqphii}{\pcyiphii(y_i | \phi_i^{(m)};\theta)\displaystyle{ \frac{\pphii(\phi_i^{(m)};\theta)}{\tpphii(\phi_i^{(m)};\theta)} } }. $

There exists an infinite number of possible proposal distributions $\tpphii$ which all provide the same rate of convergence $1/M$. The trick is to reduce the variance of the estimator by selecting a proposal such that the numerator is as small as possible.

Imagine that we use the conditional distribution $\qcphiiyi$ as proposal. Then, for any $m=1,2,\ldots,M$,

$\begin{eqnarray} \pcyiphii(y_i | \phi_i^{(m)};\theta) \displaystyle{ \frac{\pphii(\phi_i^{(m)};\theta)}{\tpphii(\phi_i^{(m)};\theta)} } &=& \pcyiphii(y_i | \phi_i^{(m)};\theta) \displaystyle{ \frac{\pphii(\phi_i^{(m)};\theta)}{\pcphiiyi(\phi_i^{(m)} | y_i;\theta)} } \\ &=& \pyi(y_i;\theta) \end{eqnarray}$

which means that $\hat{p}_{i,M}=\pyi(y_i;\theta)$! Such an estimator is optimal since its variance is null and only one realization of $\qcphiiyi$ suffices for computing exactly $\pyi(y_i;\theta)$. The problem is that it is not possible to generate the $\phi_i^{(m)}$'s with this conditional distribution, since that would require to compute a normalizing constant ... which is precisely $\pyi(y_i;\theta)$.

Nevertheless, this conditional distribution can be estimated using the Metropolis-Hastings algorithm described in The Metropolis-Hastings algorithm section and a practical proposal "close" to the optimal proposal $\qcphiiyi$ can be derived. We can then expect to get a very accurate estimation with a relatively small Monte-Carlo size $M$.

In $\monolix$, the mean and variance of the conditional distribution $\qcphiiyi$ are estimated by M-H for each individual $i$. Then, the $\phi_i^{(m)}$'s are drawn with a non-central $t$ distribution:

$\phi_i^{(m)} = \mu_i + \sigma_i \times T_{i,m} $

where $\mu_i$ and $\sigma^2_i$ are estimates of $\esp{\phi_i|y_i;\theta}$ and $\var{\phi_i|y_i;\theta}$, and where $(T_{i,m})$ is a sequence of i.i.d. random variables distributed with a Student'$t-$distribution with $\nu$ degrees of freedom.

$\monolix$ uses the default value $\nu=5$. It is possible to use the default value $\nu=5$. It is also possible to automatically test different d.f in $\{2, 5, 10, 20\}$ and to select the one that provides the smallest empirical variance for $\widehat{ {\llike} }_M(\theta;\by) = \sum_{i=1}^{N}\log(\hat{p}_{i,M})$.

Remark

Even if $\widehat{\like}_M(\theta;\by) = \prod_{i=1}^{N}\hat{p}_{i,M}$ is an unbiased estimator of ${\like}(\theta;\by)$,

$\widehat{ {\llike} }_M(\theta;\by)$ is a biased estimator of $\log({\like}(\theta;\by))$. Indeed, by Jensen's inequality, we have that

$ \esp{\log(\widehat{ {\like} }_M(\theta;\by))} \leq \log \esp{\widehat{ {\like} }_M(\theta;\by)} = \log ({\like}(\theta;\by)). $

But, i) the bias decreases as $M$ increases, ii) the bias is reduced if $\widehat{ {\like} }_M(\theta;\by)$ is close to ${\like}(\theta;\by)$. It is therefore highly recommended to use a proposal as close as possible to the conditional distribution $\qcphiiyi$, which means to estimate this conditional distribution before estimating the log-likelihood.

Example

A mixture model is used for this example. We don't aim here to provide the details about the model and the formulas used for implementing the algorithm. We use this example to highlight the importance of the choice of the proposal distribution for estimating the likelihood in complex models.

The three graphics below display the estimation of the deviance ($-2\times {\llike}(\theta;\by)$) by Importance Sampling using different proposal distributions. The estimated deviance is displayed as a function of the Monte-Carlo size.

In this first example, the conditional distributions of the individual parameters $\phi_i$ were estimated using the M-H algorithm and non-central $t$ distributions with 5 $d.f.$ we used as proposal distributions. There is no bias and an accurate estimation is obtained with a small Monte-Carlo size. The estimated deviance is 14386.8 (s.e. = 0.7). Here, only 100 iterations of the M-H were enough for correctly estimating the conditional means and variances of the $\psi_i$.

In this second example, only the last 30 iterations of SAEM were used for estimating the conditional mean and variances of the $\phi_i$. Then, non-central $t$ distributions with 5 $d.f.$ we used as proposal distributions. There is a bias which decreases very slowly with the Monte-Carlo size and which is not negligible even with $10^5$ simulations: the estimated deviance is 14397.0 (s.e. = 2.7).

Normal distributions were used in this example as proposals. The parameters of these distributions were the means and variances of the conditional distributions of the $\phi_i$ estimated using the M-H algorithm. Results are similar to those obtained with a $t$ distribution(estimated deviance=14386.9, s.e. = 1) but more simulations are required for eliminating the bias.

Estimation of the log-likelihood via linearization

For continuous data models, an alternative to the Importance Sampling approach is to use a linearization of the model, following what was proposed in Estimation of the F.I.M. using a linearization of the model section for approximating the observed Fisher Information Matrix. Indeed, the marginal distribution of a continuous vector of observations $y_i$ can be approximated by a normal distribution. It is then straightforward to derive the associated likelihood. All the calculations are detailed in ???? Section.

This method can be much faster than Importance Sampling. It should be used by the modeler for model selection purpose during the first runs, when the goal is to identify significant differences between models. Importance Sampling should be used when a more precise evaluation of the log-likelihood is required.

@@ Line 264: / Line 264: @@
 we will  estimate the log-pdf $\log(\pyi(y_i;\theta))$ for each individual and derive an estimate of the log-likelihood as the sum of these individual log-likelihoods. We will  explain here how to estimate the log-pdf $\log(\pyi(y_i;\theta))$ for any individual $i$.
-Remark first that $\pyi(y_i;\theta)$ can be decomposed as follows
+Using the $\phi$-representation of the model, notice first that $\pyi(y_i;\theta)$ can be decomposed as follows
 {{Equation1
 |equation=<math>\begin{eqnarray}
 \pyi(y_i;\theta) &=&
-\displaystyle{ \int \pyipsii(y_i,\psi_i;\theta)\,d\psi_i } \\
+\displaystyle{ \int \pyipsii(y_i,\phi_i;\theta)\,d\phi_i }\\
-&=& \displaystyle{\int \pcyipsii(y_i {{!}} \psi_i;\theta)\ppsi(\psi_i;\theta)\,d\psi_i }\\
+&=& \displaystyle{ \int \pcyiphii(y_i {{!}} \phi_i;\theta)\pphii(\phi_i;\theta)\,d\phi_i }\\
-&=& \esps{\qpsii}{\pcyipsii(y_i {{!}} \psi_i;\theta)}.
+&=& \esps{\qphii}{\pcyiphii(y_i {{!}} \phi_i;\theta)}.
 \end{eqnarray}</math> }}
@@ Line 277: / Line 278: @@
 <blockquote>
-. draw $M$ independent realizations $\psi_i^{(1)}$, $\psi_i^{(2)}$, ..., $\psi_i^{(M)}$ of the marginal distribution $\qpsii(\, \cdot \, ; \theta)$,
+. draw $M$ independent realizations $\phi_i^{(1)}$, $\phi_i^{(2)}$,..., $\phi_i^{(M)}$ of the normal distribution $\qphii(\, \cdot \, ; \theta)$,
 </blockquote>
 <blockquote>
@@ Line 284: / Line 285: @@
 {{Equation1
 |equation=<math>
-\hat{p}_{i,M} = \displaystyle{ \frac{1}{M} }\sum_{m=1}^{M}\pcyipsii(y_i {{!}} \psi_i^{(m)};\theta)
+\hat{p}_{i,M} = \displaystyle{ \frac{1}{M} } \sum_{m=1}^{M}\pcyiphii(y_i {{!}} \phi_i^{(m)};\theta)
 </math> }}
 </blockquote>
@@ Line 293: / Line 294: @@
 {{Equation1
 |equation=<math>\begin{eqnarray}
-\esp{\hat{p}_{i,M} }&=& \esps{\qpsii}{\pcyipsii(y_i {{!}} \psi_i^{(m)};\theta)} \\
+\esp{\hat{p}_{i,M}}&=& \esps{\qphii}{\pcyiphii(y_i {{!}} \phi_i^{(m)};\theta)} \\
 &=& \pyi(y_i;\theta)
 \end{eqnarray}</math> }}
-Furthermore, it is consistant since its variance decreases as $1/M$:
+Furthermore, it is consistent since its variance decreases as $1/M$:
 {{Equation1
 |equation=<math>
-\var{\hat{p}_{i,M} }= \displaystyle{\frac{1}{M} }\vars{\qpsii}{\pcyipsii(y_i {{!}} \psi_i^{(m)};\theta)}.
+\var{\hat{p}_{i,M} }= \displaystyle{\frac{1}{M} }\vars{\qphii}{\pcyiphii(y_i {{!}} \phi_i^{(m)};\theta)}.
 </math> }}
@@ Line 307: / Line 308: @@
 Nevertheless, we will see now that it is possible to improve the statistical properties of this estimator.
-For any distribution $\tqpsii$ absolutely continuous with respect to the marginal distribution $\qpsii$, we can write
+For any distribution $\tqphii$ absolutely continuous with respect to the marginal distribution $\qphii$, we can write
 {{Equation1
 |equation=<math>\begin{eqnarray}
-\pyi(y_i;\theta) &=&
+\displaystyle{ \int \pyiphii(y_i,\phi_i;\theta)\,d\phi_i }\\
-\displaystyle{ \int \pyipsii(y_i,\psi_i;\theta)\,d\psi_i }\\
+&=& \displaystyle{ \int \pcyiphii(y_i {{!}} \phi_i;\theta)\frac{\pphii(\phi_i;\theta)}{\tpphii(\phi_i;\theta)}\tpphii(\phi_i;\theta)\,d\phi_i }\\
-&=& \displaystyle{\int \pcyipsii(y_i {{!}} \psi_i;\theta)\frac{\ppsi(\psi_i;\theta)}{\tppsii(\psi_i;\theta)}\tppsii(\psi_i;\theta)\,d\psi_i }\\
+&=& \esps{\tqphii}{\pcyiphii(y_i {{!}} \phi_i;\theta)\displaystyle{\frac{\pphii(\phi_i;\theta)}{\tpphii(\phi_i;\theta)} } }.
-&=& \esps{\tqpsii}{\pcyipsii(y_i {{!}} \psi_i;\theta)\displaystyle{\frac{\ppsi(\psi_i;\theta)}{\tppsii(\psi_i;\theta)} } }.
 \end{eqnarray}</math> }}
-We can now approximate $\pyi(y_i;\theta)$ via an ''Importance Sampling'' integration method using $\tqpsii$ as a proposal distribution:
+We can now approximate $\pyi(y_i;\theta)$ via an ''Importance Sampling'' integration method using $\tqphii$ as a proposal distribution:
 <blockquote>
-. draw $M$ independent realizations $\psi_i^{(1)}$, $\psi_i^{(2)}$, ..., $\psi_i^{(M)}$ of the proposal distribution $\tqpsii(\, \cdot \, ; \theta)$,
+. draw $M$ independent realizations $\phi_i^{(1)}$, $\phi_i^{(2)}$, ..., $\phi_i^{(M)}$ of the proposal distribution $\tqphii(\, \cdot \, ; \theta)$,
 . estimate $ \pyi(y_i;\theta)$ with
@@ Line 327: / Line 327: @@
 {{Equation1
 |equation=<math>
-\hat{p}_{i,M} = \displaystyle{ \frac{1}{M} }\sum_{m=1}^{M}\pcyipsii(y_i {{!}} \psi_i^{(m)};\theta) \frac{\ppsii(\psi_i^{(m)};\theta)}{\tppsii(\psi_i^{(m)};\theta)}
+\hat{p}_{i,M} = \displaystyle{ \frac{1}{M} }\sum_{m=1}^{M}\pcyiphii(y_i {{!}} \phi_i^{(m)};\theta) \displaystyle{\frac{\pphii(\phi_i^{(m)};\theta)}{\tpphii(\phi_i^{(m)};\theta)} }
 </math> }}
 </blockquote>
@@ Line 336: / Line 336: @@
 {{Equation1
 |equation=<math>
-\var{\hat{p}_{i,M} }= \displaystyle{\frac{1}{M} }\vars{\tqpsii}{\pcyipsii(y_i {{!}}\psi_i^{(m)};\theta) \displaystyle{ \frac{\ppsii(\psi_i^{(m)};\theta)}{\tppsii(\psi_i^{(m)};\theta)} } }.
+\var{\hat{p}_{i,M} }= \displaystyle{\frac{1}{M} }\vars{\tqphii}{\pcyiphii(y_i {{!}} \phi_i^{(m)};\theta)\displaystyle{ \frac{\pphii(\phi_i^{(m)};\theta)}{\tpphii(\phi_i^{(m)};\theta)} } }.
 </math> }}
-There are an infinite number of possible proposal distributions $\tppsii$ which all provide the same rate of convergence $1/M$.
+There exists an infinite number of possible proposal distributions $\tpphii$ which all provide the same rate of convergence $1/M$.
 The trick is to reduce the variance of the estimator by selecting a proposal such that the numerator is as small as possible.
-Imagine that we use the conditional distribution $\qcpsiiyi$ as proposal. Then,  for any $m=1,2,\ldots,M$,
+Imagine that we use the conditional distribution $\qcphiiyi$ as proposal. Then,  for any $m=1,2,\ldots,M$,
 {{Equation1
 |equation=<math>\begin{eqnarray}
-\pcyipsii(y_i {{!}} \psi_i^{(m)};\theta) \displaystyle{ \frac{\ppsii(\psi_i^{(m)};\theta)}{\tppsii(\psi_i^{(m)};\theta)} }
+\pcyiphii(y_i {{!}} \phi_i^{(m)};\theta) \displaystyle{ \frac{\pphii(\phi_i^{(m)};\theta)}{\tpphii(\phi_i^{(m)};\theta)} }
-&=& \pcyipsii(y_i {{!}}\psi_i^{(m)};\theta) \displaystyle{ \frac{\ppsii(\psi_i^{(m)};\theta)}{\pcpsiiyi(\psi_i^{(m)} {{!}} y_i;\theta)} }\\
+&=& \pcyiphii(y_i {{!}} \phi_i^{(m)};\theta) \displaystyle{ \frac{\pphii(\phi_i^{(m)};\theta)}{\pcphiiyi(\phi_i^{(m)} {{!}} y_i;\theta)} } \\
 &=& \pyi(y_i;\theta)
 \end{eqnarray}</math> }}
-which means that $\hat{p}_{i,M}=\pyi(y_i;\theta)$! Such an estimator is optimal since its variance is null and only one realization of $\qcpsiiyi$ suffices for computing exactly $\pyi(y_i;\theta)$. The problem is that it is not possible to generate the $\psi_i^{(m)}$'s with this conditional distribution, since that would require to compute a normalizing constant ... which is precisely $\pyi(y_i;\theta)$.
+which means that $\hat{p}_{i,M}=\pyi(y_i;\theta)$! Such an estimator is optimal since its variance is null and only one realization of $\qcphiiyi$ suffices for computing exactly $\pyi(y_i;\theta)$. The problem is that it is not possible to generate the $\phi_i^{(m)}$'s with this conditional distribution, since that would require to compute a normalizing constant ... which is precisely $\pyi(y_i;\theta)$.
-Nevertheless, this conditional distribution can be estimated using the Metropolis-Hastings algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters|The Metropolis-Hastings algorithm]] section and a practical proposal "close" to the optimal proposal $\qcpsiiyi$ can be derived. We can then expect to get a very accurate estimation with a relatively small Monte-Carlo size $M$.
+Nevertheless, this conditional distribution can be estimated using the Metropolis-Hastings algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters|The Metropolis-Hastings algorithm]] section and a practical proposal "close" to the optimal proposal $\qcphiiyi$ can be derived. We can then expect to get a very accurate estimation with a relatively small Monte-Carlo size $M$.
-In $\monolix$, the  mean  and variance  of the conditional distribution $\qcpsiiyi$ are estimated by M-H for each individual $i$. Then, the $\psi_i^{(m)}$'s are drawn with a non-central $t$ distribution:
+In $\monolix$, the  mean  and variance  of the conditional distribution $\qcphiiyi$ are estimated by M-H for each individual $i$. Then, the $\phi_i^{(m)}$'s are drawn with a non-central $t$ distribution:
 {{Equation1
-|equation=<math>\psi_i^{(m)} = \mu_i + \sigma_i \times T_{i,m} </math> }}
+|equation=<math>\phi_i^{(m)} = \mu_i + \sigma_i \times T_{i,m} </math> }}
-where $\mu_i$ and $\sigma^2_i$ are estimates of $\esp{\psi_i|y_i;\theta}$ and $\var{\psi_i|y_i;\theta}$, and where $(T_{i,m})$ is a sequence of ''i.i.d.'' random variables distributed with a Student'$t-$distribution with $\nu$ degrees of freedom. $\monolix$ uses the default $\nu=5$.
+where $\mu_i$ and $\sigma^2_i$ are estimates of $\esp{\phi_i|y_i;\theta}$ and $\var{\phi_i|y_i;\theta}$, and where $(T_{i,m})$ is a sequence of ''i.i.d.'' random variables distributed with a Student'$t-$distribution with $\nu$ degrees of freedom.
-It is possible to use the default value $\nu=5$. It is also possible to automatically test different d.f in $\{2, 5, 10, 20\}$ and to
+$\monolix$  uses the default value $\nu=5$. It is possible to use the default value $\nu=5$. It is also possible to automatically test different d.f in $\{2, 5, 10, 20\}$ and to select the one that provides the smallest empirical variance for $\widehat{ {\llike} }_M(\theta;\by) = \sum_{i=1}^{N}\log(\hat{p}_{i,M})$.
-select the one that provides the smallest empirical variance for $\widehat{ {\llike} }_M(\theta;\by) = \sum_{i=1}^{N}\log(\hat{p}_{i,M})$.
@@ Line 374: / Line 373: @@
 |equation=<math> \esp{\log(\widehat{ {\like} }_M(\theta;\by))} \leq \log \esp{\widehat{ {\like} }_M(\theta;\by)} = \log ({\like}(\theta;\by)). </math> }}
-But, ''i)'' the bias decreases as $M$ increases, ''ii)'' the bias is reduced if $\widehat{ {\like} }_M(\theta;\by)$ is close to ${\like}(\theta;\by)$. It is therefore highly recommended to use a proposal as close as possible to the conditional distribution $\qcpsiiyi$, which means to estimate this conditional distribution before estimating the log-likelihood.
+But, ''i)'' the bias decreases as $M$ increases, ''ii)'' the bias is reduced if $\widehat{ {\like} }_M(\theta;\by)$ is close to ${\like}(\theta;\by)$. It is therefore highly recommended to use a proposal as close as possible to the conditional distribution $\qcphiiyi$, which means to estimate this conditional distribution before estimating the log-likelihood.
 }}
@@ Line 380: / Line 379: @@
 {{Example
 |title=Example
-|text= &#32;
+|text=
-{{ImageWithCaption|image=ll2.png|size=700px|caption=Estimation of the log-likelihood via IS using non-central $t$ distributions with 5 $d.f.$ as proposal. Preliminary estimation of the conditional distributions 14386.81 (0.7)}}
+A mixture model is used for this example. We don't aim here to provide the details about the model and the formulas used for implementing the algorithm. We use this  example  to highlight  the importance of  the choice of the proposal distribution for estimating the likelihood in complex models.
+The three graphics below display the estimation of the deviance ($-2\times {\llike}(\theta;\by)$) by Importance Sampling using different proposal distributions. The estimated deviance is displayed as a function of the Monte-Carlo size.
+In this first example, the conditional distributions of the individual parameters $\phi_i$ were estimated using the M-H algorithm and non-central $t$ distributions with 5 $d.f.$ we used as proposal distributions. There is no bias and an accurate estimation is obtained with a small Monte-Carlo size. The estimated deviance is 14386.8 (s.e. = 0.7). Here, only 100 iterations of the M-H were enough for correctly estimating the conditional means and variances of the $\psi_i$.
+{{ImageWithCaption|image=ll2.png|size=700px|caption= }}
+In this second example, only the last 30 iterations of SAEM were used for estimating the conditional mean and variances of the $\phi_i$. Then, non-central $t$ distributions with 5 $d.f.$ we used as proposal distributions. There is a bias which decreases very slowly with the Monte-Carlo size and which is not negligible even with $10^5$ simulations: the estimated deviance is 14397.0 (s.e. = 2.7).
+{{ImageWithCaption|image=ll1.png|size=700px|caption= }}
-{{ImageWithCaption|image=ll1.png|size=700px|caption=Estimation of the log-likelihood via IS using non-central $t$ distributions with 5 $d.f.$ as proposal. No preliminary estimation of the conditional distributions 14397.04 (2.7) }}
+Normal  distributions were used in this example as proposals. The parameters of these distributions  were the  means and variances of the conditional distributions of the $\phi_i$  estimated using the M-H algorithm. Results are similar to those obtained with a $t$ distribution(estimated deviance=14386.9, s.e. = 1) but more simulations are required for eliminating the bias.
-{{ImageWithCaption|image=ll3.png|size=700px|caption=Estimation of the log-likelihood via IS using normal distributions as proposal. Preliminary estimation of the conditional distributions. 14386.90 (1) }}
+{{ImageWithCaption|image=ll3.png|size=700px|caption= }}
 }}
@@ Line 395: / Line 405: @@
-For continuous data models, an alternative to the Importance Sampling approach is to use a linearization of the model, following what was proposed in [[Estimation of the observed Fisher information matrix#Estimation of the F.I.M. using a linearization of the model|Estimation of the F.I.M. using a linearization of the model]] section for approximating the observed Fisher Information Matrix. Indeed, the marginal distribution of a continuous vector of observations $y_i$ can be approximated by a normal distribution. It is then straightforward to derive the associated likelihood.
+For continuous data models, an alternative to the Importance Sampling approach is to use a linearization of the model, following what was proposed in [[Estimation of the observed Fisher information matrix#Estimation of the F.I.M. using a linearization of the model|Estimation of the F.I.M. using a linearization of the model]] section for approximating the observed Fisher Information Matrix. Indeed, the marginal distribution of a continuous vector of observations $y_i$ can be approximated by a normal distribution. It is then straightforward to derive the associated likelihood. All the calculations are detailed in [[#|????]] Section.
-This method can be much faster than Importance Sampling. It should be used by the modeler for model selection purpose  during the first runs,  when it is a matter to identify significant differences between models. Importance Sampling should be used when a more precise evaluation of the log-likelihood is required.
+This method can be much faster than Importance Sampling. It should be used by the modeler for model selection purpose  during the first runs,  when the goal is to identify significant differences between models. Importance Sampling should be used when a more precise evaluation of the log-likelihood is required.
 </div>

Difference between revisions of "Estimation of the log-likelihood"

Revision as of 15:06, 13 May 2013

Estimation of the log-likelihood via linearization

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

WikiPopix

Introduction

Models

Tasks & Tools

Methods

Download files

Tools