Difference between revisions of "Estimation of the observed Fisher information matrix"
m |
m |
||
Line 1: | Line 1: | ||
<div style="font-size:12pt; font-family:garamond"> | <div style="font-size:12pt; font-family:garamond"> | ||
− | |||
− | |||
− | |||
==Estimation of the observed F.I.M. using a stochastic approximation== | ==Estimation of the observed F.I.M. using a stochastic approximation== | ||
+ | The ''observed'' Fisher information matrix (F.I.M.) is a function of $\theta$ defined as | ||
+ | |||
+ | {{EquationWithRef | ||
+ | |equation=<div id="eq_fim1"><math>\begin{eqnarray} | ||
+ | I(\theta) &=& -\DDt{\log ({\like}(\theta;\by))} \\ | ||
+ | &=& -\DDt{\log (\py(\by;\theta))} . | ||
+ | \end{eqnarray}</math></div> | ||
+ | |reference=(1) }} | ||
+ | |||
+ | Due to the likelihood being quite complex, $I(\theta)$ usually has no closed form expression. It is however possible to estimate it using a stochastic approximation procedure based on <balloon title="Kuhn05: put here the reference!!!" style="color:#177245">Louis' formula</balloon>: | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\DDt{\log (\pmacro(\by,\bpsi;\theta))} = \esp{\DDt{\log (\pmacro(\by,\bpsi;\theta))} {{!}} \by ;\theta} + \cov{\Dt{\log (\pmacro(\by,\bpsi;\theta))} {{!}} \by ; \theta}, | ||
+ | </math> }} | ||
+ | |||
+ | where | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | \cov{\Dt{\log (\pmacro(\by,\bpsi;\theta))} {{!}} \by ; \theta} &=& | ||
+ | \esp{ \left(\Dt{\log (\pmacro(\by,\bpsi;\theta))} \right)\left(\Dt{\log (\pmacro(\by,\bpsi;\theta))}\right)^{\transpose} {{!}} \by ; \theta} \\ | ||
+ | && - \esp{\Dt{\log (\pmacro(\by,\bpsi;\theta))} {{!}} \by ; \theta}\esp{\Dt{\log (\pmacro(\by,\bpsi;\theta))} {{!}} \by ; \theta}^{\transpose} . | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | Thus, $\DDt{\log (\pmacro(\by,\bpsi;\theta))}$ is defined as a combination of conditional expectations. Each of these conditional expectations can be estimated by Monte Carlo, or equivalently approximated using a stochastic approximation algorithm. | ||
+ | |||
+ | We can then draw a sequence $(\psi_i^{(k)})$ using a Metropolis-Hasting algorithm and estimate the observed F.I.M. online. | ||
+ | At iteration $k$ of the algorithm: | ||
+ | |||
+ | |||
+ | * '''Simulation step''': for $i=1,2,\ldots,N$, draw $\psi_i^{(k)}$ from $m$ iterations of the Metropolis-Hastings algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters| The Metropolis-Hastings algorithm section]] with $\pmacro(\psi_i |y_i ;{\theta})$ as the limit distribution. | ||
+ | |||
+ | * '''Stochastic approximation''': update $D_k$, $G_k$ and $\Delta_k$ according to the following recurrence relations: | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | \Delta_k & = & \Delta_{k-1} + \gamma_k \left(\Dt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))} - \Delta_{k-1} \right) \\ | ||
+ | D_k & = & D_{k-1} + \gamma_k \left(\DDt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))} - D_{k-1} \right)\\ | ||
+ | G_k & = & G_{k-1} + \gamma_k \left((\Dt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))})(\Dt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))})^\transpose -G_{k-1} \right), | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | where $(\gamma_k)$ is a decreasing sequence of positive numbers such that $\gamma_1=1$, \ \ $ \sum_{k=1}^{\infty} \gamma_k = \infty$, \ and $\sum_{k=1}^{\infty} \gamma_k^2 < \infty$. | ||
+ | |||
+ | |||
+ | * '''Estimation step''': update the estimate $H_k$ of the F.I.M. according to | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>H_k = D_k + G_k - \Delta_k \Delta_k^{\transpose}. </math> }} | ||
+ | |||
+ | |||
+ | Implementing this algorithm therefore requires computation of the first and second derivatives of | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\log (\pmacro(\by,\bpsi;\theta))=\sum_{i=1}^{N} \log (\pmacro(y_i,\psi_i;\theta)).</math> }} | ||
+ | |||
+ | Assume first that the joint distribution of $\by$ and $\bpsi$ decomposes as | ||
+ | |||
+ | {{EquationWithRef | ||
+ | |equation=<div id="eq:fim_dec1"><math> | ||
+ | \pypsi(\by,\bpsi;\theta) = \pcypsi(\by {{!}} \bpsi)\ppsi(\bpsi;\theta). | ||
+ | </math></div> | ||
+ | |reference=(2) }} | ||
+ | |||
+ | This assumption means that for any $i=1,2,\ldots,N$, all of the components of $\psi_i$ are random and there exists a sufficient statistic ${\cal S}(\bpsi)$ for the estimation of $\theta$. It is then sufficient to compute the first and second derivatives of $\log (\pmacro(\bpsi;\theta))$ in order to estimate the F.I.M. This can be done relatively simply in closed form when the individual parameters are normally distributed (or a transformation $h$ of them is). | ||
+ | |||
+ | If some component of $\psi_i$ has no variability, [[#eq:fim_dec1|(2)]] no longer holds, but we can decompose $\theta$ into $(\theta_y,\theta_\psi)$ such that | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \pyipsii(y_i,\psi_i;\theta) = \pcyipsii(y_i {{!}} \psi_i ; \theta_y)\ppsii(\psi_i;\theta_\psi). | ||
+ | </math> }} | ||
+ | |||
+ | We then need to compute the first and second derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ and $\log(\ppsii(\psi_i;\theta_\psi))$. Derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ that do not have a closed form expression can be obtained using central differences. | ||
+ | |||
+ | |||
+ | {{Remarks | ||
+ | |title=Remarks | ||
+ | |text= | ||
+ | 1. Using $\gamma_k=1/k$ for $k \geq 1$ means that each term is approximated with an empirical mean obtained from $(\bpsi^{(k)}, k \geq 1)$. For instance, | ||
+ | |||
+ | {{EquationWithRef | ||
+ | |equation=<div id="eq:fim_Delta1"><math>\begin{eqnarray} | ||
+ | \Delta_k | ||
+ | &=& \Delta_{k-1} + \displaystyle{ \frac{1}{k} } \left(\Dt{\log (\pmacro(\by,\bpsi^{(k)};\theta))} - \Delta_{k-1} \right) | ||
+ | \end{eqnarray}</math></div> | ||
+ | |reference=(3) }} | ||
+ | {{EquationWithRef | ||
+ | |equation=<div id="eq:fim_Delta2"><math>\begin{eqnarray} | ||
+ | &=& \displaystyle{ \frac{1}{k} }\sum_{j=1}^{k} \Dt{\log (\pmacro(\by,\bpsi^{(j)};\theta))} . | ||
+ | \end{eqnarray}</math></div> | ||
+ | |reference=(4) }} | ||
+ | |||
+ | : [[#eq:fim_Delta1|(3)]] (resp. [[#eq:fim_Delta2|(4)]]) defines $\Delta_k$ using an online (resp. offline) algorithm. | ||
+ | Writing $\Delta_k$ as in [[#eq:fim_Delta1|(3)]] instead of [[#eq:fim_Delta2|(4)]] avoids having to store all simulated sequences $(\bpsi^{(j)}, 1\leq j \leq k)$ when computing $\Delta_k$. | ||
+ | |||
+ | |||
+ | 2. This approach is used for computing the F.I.M. $I(\hat{\theta})$ in practice, where $\hat{\theta}$ is the maximum likelihood estimate of $\theta$. The only difference with the [[The Metropolis-Hastings algorithm for simulating the individual parameters|Metropolis-Hastings]] used for SAEM is that the population parameter $\theta$ is not updated and remains fixed at $\hat{\theta}$. | ||
+ | }} | ||
+ | |||
+ | |||
+ | {{OutlineText | ||
+ | |text=In summary, for a given estimate $\hat{\theta}$ of the population parameter $\theta$, a stochastic approximation algorithm for estimating the observed Fisher Information Matrix $I(\hat{\theta)}$ consists of: | ||
+ | |||
+ | |||
+ | 1. For $i=1,2,\ldots,N$, run a [[The Metropolis-Hastings algorithm for simulating the individual parameters|Metropolis-Hastings algorithm]] to draw a sequence $\psi_i^{(k)}$ with limit distribution $\pmacro(\psi_i {{!}}y_i ;\hat{\theta})$. | ||
+ | |||
+ | |||
+ | 2. At iteration $k$ of the Metropolis-Hastings algorithm, compute the first and second derivatives of $\pypsi(\by,\bpsi^{(k)};\hat{\theta})$. | ||
+ | |||
+ | |||
+ | 3.Update $\Delta_k$, $G_k$, $D_k$ and compute an estimate $H_k$ of the F.I.M. | ||
+ | }} | ||
+ | |||
+ | |||
+ | {{Example | ||
+ | |title=Example 1 | ||
+ | |text=Consider the model | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | y_i {{!}} \psi_i &\sim& \pcyipsii(y_i {{!}} \psi_i) \\ | ||
+ | h(\psi_i) &\sim_{i.i.d}& {\cal N}( h(\psi_{\rm pop}) , \Omega), | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | where $\Omega = {\rm diag}(\omega_1^2,\omega_2^2,\ldots,\omega_d^2)$ is a diagonal matrix and $h(\psi_i)=(h_1(\psi_{i,1}), h_2(\psi_{i,2}), \ldots , h_d(\psi_{i,d}) )^{\transpose}$. | ||
+ | The vector of population parameters is $\theta = (\psi_{\rm pop} , \Omega)=(\psi_{ {\rm pop},1},\ldots,\psi_{ {\rm pop},d},\omega_1^2,\ldots,\omega_d^2)$. | ||
+ | |||
+ | Here, | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \log (\pyipsii(y_i,\psi_i;\theta)) = \log (\pcyipsii(y_i {{!}} \psi_i)) + \log (\ppsii(\psi_i;\theta)). | ||
+ | </math> }} | ||
+ | |||
+ | Then, | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | \Dt{\log (\pyipsii(y_i,\psi_i;\theta))} &=& \Dt{\log (\ppsii(\psi_i;\theta))} \\ | ||
+ | \DDt{\log (\pyipsii(y_i,\psi_i;\theta))} &=& \DDt{\log (\ppsii(\psi_i;\theta))} . | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | More precisely, | ||
+ | |||
+ | {{Equation1 | ||
+ | |rquation=<math>\begin{eqnarray} | ||
+ | \log (\ppsii(\psi_i;\theta)) &=& -\displaystyle{\frac{d}{2} }\log(2\pi) + \sum_{\iparam=1}^d \log(h_\iparam^{\prime}(\psi_{i,\iparam})) | ||
+ | -\displaystyle{ \frac{1}{2} } \sum_{\iparam=1}^d \log(\omega_\iparam^2) | ||
+ | -\sum_{\iparam=1}^d \displaystyle{ \frac{1}{2\, \omega_\iparam^2} }( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )^2 \\ | ||
+ | \partial \log (\ppsii(\psi_i;\theta))/\partial \psi_{ {\rm pop},\iparam} &=& | ||
+ | \displaystyle{\frac{1}{\omega_\iparam^2} }h_\iparam^{\prime}(\psi_{ {\rm pop},\iparam})( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) ) \\ | ||
+ | \partial \log (\ppsii(\psi_i;\theta))/\partial \omega^2_{\iparam} &=& | ||
+ | -\displaystyle{ \frac{1}{2\omega_\iparam^2} } | ||
+ | +\displaystyle{\frac{1}{2\, \omega_\iparam^4} }( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )^2 \\ | ||
+ | \partial^2 \log (\ppsii(\psi_i;\theta))/\partial \psi_{ {\rm pop},\iparam} \partial \psi_{ {\rm pop},\jparam} &=& | ||
+ | \left\{ | ||
+ | \begin{array}{ll} | ||
+ | <!-- % \frac{1}{\omega_\iparam^2} --> | ||
+ | \left( h_\iparam^{\prime\prime}(\psi_{ {\rm pop},\iparam})( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )- h_\iparam^{\prime \, 2}(\psi_{ {\rm pop},\iparam}) \right)/\omega_\iparam^2 & {\rm if \quand \iparam=\jparam} \\ | ||
+ | 0 & {\rm otherwise} | ||
+ | \end{array} | ||
+ | \right. | ||
+ | \\ | ||
+ | \partial^2 \log (\ppsii(\psi_i;\theta))/\partial \omega^2_{\iparam} \partial \omega^2_{\jparam} &=& \left\{ | ||
+ | \begin{array}{ll} | ||
+ | <!-- % \frac{1}{2\omega_\iparam^4} - \frac{1}{\omega_\iparam^6} --> | ||
+ | 1/(2\omega_\iparam^4) - | ||
+ | ( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )^2/\omega_\iparam^6 & {\rm if \quad \iparam=\jparam} \\ | ||
+ | 0 & \rm otherwise} | ||
+ | \end{array} | ||
+ | \right. | ||
+ | \\ | ||
+ | \partial^2 \log (\ppsii(\psi_i;\theta))/\partial \psi_{ {\rm pop},\iparam} \partial \omega^2_{\jparam} &=& \left\{ | ||
+ | \begin{array}{ll} | ||
+ | <!-- % -\frac{1}{\omega_\iparam^4} --> | ||
+ | -h_\iparam^{\prime}(\psi_{ {\rm pop},\iparam})( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )/\omega_\iparam^4 \\ | ||
+ | 0 & {\rm otherwise.} | ||
+ | \end{array} | ||
+ | \right. | ||
+ | \end{eqnarray}</math> }} | ||
+ | }} | ||
+ | |||
+ | |||
+ | |||
+ | {{Example | ||
+ | |title=Example 2 | ||
+ | |text= We consider the same model for continuous data, assuming a constant error model and that the variance $a^2$ of the residual error has no variability: | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | y_{ij} {{!}} \psi_i &\sim& {\cal N}(f(t_{ij}, \psi_i) \ , \ a^2), \ \ 1 \leq j \leq n_i \\ | ||
+ | h(\psi_i) &\sim_{i.i.d}& {\cal N}( h(\psi_{\rm pop}) , \Omega). | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | Here, $\theta_y=a^2$, $\theta_\psi=(\psi_{\rm pop},\Omega)$ and | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \log(\pyipsii(y_i,\psi_i;\theta)) = \log(\pcyipsii(y_i {{!}} \psi_i ; a^2)) + \log(\ppsii(\psi_i;\psi_{\rm pop},\Omega)), | ||
+ | </math> }} | ||
+ | |||
+ | where | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \log(\pcyipsii(y_i {{!}} \psi_i ; a^2)) | ||
+ | =-\displaystyle{\frac{n_i}{2} }\log(2\pi)- \displaystyle{\frac{n_i}{2} }\log(a^2) - \displaystyle{\frac{1}{2a^2} }\sum_{j=1}^{n_i}(y_{ij} - f(t_{ij}, \psi_i))^2 . | ||
+ | </math> }} | ||
+ | |||
+ | Derivatives of $\log(\pcyipsii(y_i {{!}} \psi_i ; a^2))$ with respect to $a^2$ are straightforward to compute. Derivatives of $\log(\ppsii(\psi_i;\psi_{\rm pop},\Omega))$ with respect to $\psi_{\rm pop}$ and $\Omega$ remain unchanged. | ||
+ | }} | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | {{Example | ||
+ | |title=Example 3 | ||
+ | |text= Consider again the same model for continuous data, assuming now that a subset $\xi$ of the parameters of the structural model has no variability: | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | y_{ij} {{!}} \psi_i &\sim& {\cal N}(f(t_{ij}, \psi_i,\xi) \ , \ a^2), \ \ 1 \leq j \leq n_i \\ | ||
+ | h(\psi_i) &\sim_{i.i.d}& {\cal N}( h(\psi_{\rm pop}) , \Omega). | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | Let $\psi$ remain as the subset of individual parameters with variability. Here, $\theta_y=(\xi,a^2)$, $\theta_\psi=(\psi_{\rm pop},\Omega)$, and | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \log(\pcyipsii(y_i {{!}} \psi_i ; \xi,a^2)) | ||
+ | =-\displaystyle{\frac{n_i}{2} }\log(2\pi)- \displaystyle{\frac{n_i}{2} }\log(a^2) - \displaystyle{\frac{1}{2 a^2} }\sum_{j=1}^{n_i}(y_{ij} - f(t_{ij}, \psi_i,\xi))^2 . | ||
+ | </math> }} | ||
+ | |||
+ | Derivatives of $\log(\pcyipsii(y_i {{!}} \psi_i ; \xi, a^2))$ with respect to $\xi$ require computation of the derivative of $f$ with respect to $\xi$. These derivatives are usually not calculable. One possibility is to numerically approximate them using finite differences. | ||
+ | }} | ||
+ | |||
+ | |||
+ | <br> | ||
+ | == Estimation using linearization of the model == | ||
+ | |||
+ | Consider here a model for continuous data that uses a $\phi$-parametrization for the individual parameters: | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | y_{ij} &= & f(t_{ij} , \phi_i) + g(t_{ij} , \phi_i)\teps_{ij} \\ | ||
+ | \phi_i &=& \phi_{\rm pop} + \eta_i . | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | Let $\hphi_i$ be some predicted value of $\phi_i$, such as for instance the estimated mean or estimated mode of the conditional distribution $\pmacro(\phi_i |y_i ; \hat{\theta})$. | ||
+ | |||
+ | We can then choose to linearize the model for the observations $(y_{ij}, 1\leq j \leq n_i)$ of individual $i$ around the vector of predicted individual parameters. Let $\Dphi{f(t , \phi)}$ be the row vector of derivatives of $f(t , \phi)$ with respect to $\phi$. Then, | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | y_{ij} &\simeq& f(t_{ij} , \hphi_i) + \Dphi{f(t_{ij} , \hphi_i)} \, (\phi_i - \hphi_i) + g(t_{ij} , \hphi_i)\teps_{ij} \\ | ||
+ | &\simeq& f(t_{ij} , \hphi_i) + \Dphi{f(t_{ij} , \hphi_i)} \, (\phi_{\rm pop} - \hphi_i) | ||
+ | + \Dphi{f(t_{ij} , \hphi_i)} \, \eta_i + g(t_{ij} , \hphi_i)\teps_{ij} . | ||
+ | \end{eqnarray}</math> }} | ||
+ | |||
+ | Then, we can approximate the marginal distribution of the vector $y_i$ as a normal distribution: | ||
+ | |||
+ | {{EquationWIthRef | ||
+ | |equation=<div id="eq:fim_approx"><math> | ||
+ | y_{i} \approx {\cal N}\left(f(t_{i} , \hphi_i) + \Dphi{f(t_{i} , \hphi_i)} \, (\phi_{\rm pop} - \hphi_i) , | ||
+ | \Dphi{f(t_{i} , \hphi_i)} \Omega \Dphi{f(t_{i} , \hphi_i)}^{\transpose} + g(t_{i} , \hphi_i)\Sigma_{n_i} g(t_{ij} , \hphi_i)^{\transpose} \right), | ||
+ | </math></div> | ||
+ | |reference=(5) }} | ||
+ | |||
+ | where $\Sigma_{n_i}$ is the variance-covariance matrix of $\teps_{i,1},\ldots,\teps_{i,n_i}$. If the $\teps_{ij}$ are i.i.d., then | ||
+ | $\Sigma_{n_i}$ is the identity matrix. | ||
+ | |||
+ | We can equivalently use the original $\psi$-parametrization and the fact that $\phi_i=h(\psi_i)$. Then, | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \Dphi{f(t_{i} , \hphi_i)} = \Dpsi{f(t_{i} , \hpsi_i)} J_h(\hpsi_i)^{\transpose} , </math> }} | ||
+ | |||
+ | where $J_h$ is the Jacobian of $h$. | ||
+ | |||
+ | We then can approximate the observed log-likelihood ${\llike}(\theta) = \log(\like(\theta;\by))=\sum_{i=1}^N \log(\pyi(y_i;\theta))$ using this normal approximation. We can also derive the F.I.M. by computing the matrix of second-order partial derivatives of ${\llike}(\theta)$. | ||
+ | |||
+ | Except for very simple models, computing these second-order partial derivatives in closed form is not straightforward. In such cases, finite differences can be used for numerically approximating them. We can use for instance a central difference approximation of the second derivative of $\llike(\theta)$. To this end, let $\nu>0$. For $j=1,2,\ldots, m$, let $\nu^{(j)}=(\nu^{(j)}_{k}, 1\leq k \leq m)$ be the $m$-vector such that | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math> | ||
+ | \nu^{(j)}_{k} = \left\{ | ||
+ | \begin{array}{ll} | ||
+ | \nu & {\rm if \quad j= k} \\ | ||
+ | 0 & {\rm otherwise.} | ||
+ | \end{array} | ||
+ | \right. | ||
+ | </math> }} | ||
+ | |||
+ | Then, for $\nu$ small enough, | ||
+ | |||
+ | {{Equation1 | ||
+ | |equation=<math>\begin{eqnarray} | ||
+ | \partial_{\theta_j}{ {\llike}(\theta)} &\approx& \displaystyle{ \frac{ {\llike}(\theta+\nu^{(j)})- {\llike}(\theta-\nu^{(j)})}{2\nu} } \\ | ||
+ | \end{eqnarray}</math> }} | ||
+ | {{EquationWithRef | ||
+ | |equation=<div id="eq:fim_diff"><math>\begin{eqnarray} | ||
+ | \partial^2_{\theta_j,\theta_k}{ {\llike}(\theta)} &\approx& \displaystyle{\frac{ {\llike}(\theta+\nu^{(j)}+\nu^{(k)})- {\llike}(\theta+\nu^{(j)}-\nu^{(k)}) | ||
+ | -{\llike}(\theta-\nu^{(j)}+\nu^{(k)})+{\llike}(\theta-\nu^{(j)}-\nu^{(k)})}{4\nu^2} } . | ||
+ | \end{eqnarray}</math></div> | ||
+ | |reference=(6) }} | ||
+ | |||
+ | |||
+ | {{OutlineText | ||
+ | |text=In summary, for a given estimate $\hat{\theta}$ of the population parameter $\theta$, the algorithm for approximating the Fisher Information Matrix $I(\hat{\theta)}$ using a linear approximation of the model consists of: | ||
+ | |||
+ | |||
+ | 1. For $i=1,2,\ldots,N$, obtain some estimate $(\hpsi_i)$ of the individual parameters $(\psi_i)$ (we can average for example the final terms of the sequence $(\psi_i^{(k)})$ drawn during the final iterations of the SAEM algorithm). | ||
+ | |||
+ | |||
+ | 2. For $i=1,2,\ldots,N$, compute $\hphi_i=h(\hpsi_i)$, the mean and the variance of the normal distribution defined in [[#eq:fim_approx|(5)]], and ${\llike}(\theta)$ using this normal approximation. | ||
+ | |||
+ | |||
+ | 3. Use [[#eq:fim_diff|(6)]] to approximate the matrix of second-order derivatives of $\llike(\theta)$. | ||
+ | }} | ||
+ | |||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <!-- OLD VERSION --> | ||
The ''observed'' Fisher information matrix (F.I.M.) is a function of $\theta$ defined as | The ''observed'' Fisher information matrix (F.I.M.) is a function of $\theta$ defined as |
Revision as of 12:52, 21 May 2013
Estimation of the observed F.I.M. using a stochastic approximation
The observed Fisher information matrix (F.I.M.) is a function of $\theta$ defined as
\(\begin{eqnarray}
I(\theta) &=& -\DDt{\log ({\like}(\theta;\by))} \\
&=& -\DDt{\log (\py(\by;\theta))} .
\end{eqnarray}\)
|
(1) |
Due to the likelihood being quite complex, $I(\theta)$ usually has no closed form expression. It is however possible to estimate it using a stochastic approximation procedure based on Louis' formula:
where
Thus, $\DDt{\log (\pmacro(\by,\bpsi;\theta))}$ is defined as a combination of conditional expectations. Each of these conditional expectations can be estimated by Monte Carlo, or equivalently approximated using a stochastic approximation algorithm.
We can then draw a sequence $(\psi_i^{(k)})$ using a Metropolis-Hasting algorithm and estimate the observed F.I.M. online. At iteration $k$ of the algorithm:
- Simulation step: for $i=1,2,\ldots,N$, draw $\psi_i^{(k)}$ from $m$ iterations of the Metropolis-Hastings algorithm described in The Metropolis-Hastings algorithm section with $\pmacro(\psi_i |y_i ;{\theta})$ as the limit distribution.
- Stochastic approximation: update $D_k$, $G_k$ and $\Delta_k$ according to the following recurrence relations:
where $(\gamma_k)$ is a decreasing sequence of positive numbers such that $\gamma_1=1$, \ \ $ \sum_{k=1}^{\infty} \gamma_k = \infty$, \ and $\sum_{k=1}^{\infty} \gamma_k^2 < \infty$.
- Estimation step: update the estimate $H_k$ of the F.I.M. according to
Implementing this algorithm therefore requires computation of the first and second derivatives of
Assume first that the joint distribution of $\by$ and $\bpsi$ decomposes as
\(
\pypsi(\by,\bpsi;\theta) = \pcypsi(\by | \bpsi)\ppsi(\bpsi;\theta).
\)
|
(2) |
This assumption means that for any $i=1,2,\ldots,N$, all of the components of $\psi_i$ are random and there exists a sufficient statistic ${\cal S}(\bpsi)$ for the estimation of $\theta$. It is then sufficient to compute the first and second derivatives of $\log (\pmacro(\bpsi;\theta))$ in order to estimate the F.I.M. This can be done relatively simply in closed form when the individual parameters are normally distributed (or a transformation $h$ of them is).
If some component of $\psi_i$ has no variability, (2) no longer holds, but we can decompose $\theta$ into $(\theta_y,\theta_\psi)$ such that
We then need to compute the first and second derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ and $\log(\ppsii(\psi_i;\theta_\psi))$. Derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ that do not have a closed form expression can be obtained using central differences.
Estimation using linearization of the model
Consider here a model for continuous data that uses a $\phi$-parametrization for the individual parameters:
Let $\hphi_i$ be some predicted value of $\phi_i$, such as for instance the estimated mean or estimated mode of the conditional distribution $\pmacro(\phi_i |y_i ; \hat{\theta})$.
We can then choose to linearize the model for the observations $(y_{ij}, 1\leq j \leq n_i)$ of individual $i$ around the vector of predicted individual parameters. Let $\Dphi{f(t , \phi)}$ be the row vector of derivatives of $f(t , \phi)$ with respect to $\phi$. Then,
Then, we can approximate the marginal distribution of the vector $y_i$ as a normal distribution:
- REDIRECTION Modèle:EquationWithRef
where $\Sigma_{n_i}$ is the variance-covariance matrix of $\teps_{i,1},\ldots,\teps_{i,n_i}$. If the $\teps_{ij}$ are i.i.d., then $\Sigma_{n_i}$ is the identity matrix.
We can equivalently use the original $\psi$-parametrization and the fact that $\phi_i=h(\psi_i)$. Then,
where $J_h$ is the Jacobian of $h$.
We then can approximate the observed log-likelihood ${\llike}(\theta) = \log(\like(\theta;\by))=\sum_{i=1}^N \log(\pyi(y_i;\theta))$ using this normal approximation. We can also derive the F.I.M. by computing the matrix of second-order partial derivatives of ${\llike}(\theta)$.
Except for very simple models, computing these second-order partial derivatives in closed form is not straightforward. In such cases, finite differences can be used for numerically approximating them. We can use for instance a central difference approximation of the second derivative of $\llike(\theta)$. To this end, let $\nu>0$. For $j=1,2,\ldots, m$, let $\nu^{(j)}=(\nu^{(j)}_{k}, 1\leq k \leq m)$ be the $m$-vector such that
Then, for $\nu$ small enough,
\(\begin{eqnarray}
\partial^2_{\theta_j,\theta_k}{ {\llike}(\theta)} &\approx& \displaystyle{\frac{ {\llike}(\theta+\nu^{(j)}+\nu^{(k)})- {\llike}(\theta+\nu^{(j)}-\nu^{(k)})
-{\llike}(\theta-\nu^{(j)}+\nu^{(k)})+{\llike}(\theta-\nu^{(j)}-\nu^{(k)})}{4\nu^2} } .
\end{eqnarray}\)
|
(6) |
The observed Fisher information matrix (F.I.M.) is a function of $\theta$ defined as
\(\begin{eqnarray}
I(\theta) &=& -\DDt{\log (\like(\theta;\by))} \\
&=& -\DDt{\log (\py(\by;\theta))}
\end{eqnarray}\)
|
(1) |
Due to the complex expression of the likelihood, $I(\theta)$ has no closed form expression. It is however possible to estimate it using a stochastic approximation procedure, based on Louis' formula:
where
$\Dt{\log (\pmacro(\by,\bpsi;\theta))}$ is defined as a combination of conditional expectations. Each of these conditional expectations can be estimated by Monte-Carlo, or equivalently approximated using a stochastic approximation algorithm.
We can then draw a sequence $(\psi_i^{(k)})$ using a Metropolis-Hasting algorithm and estimate the observed F.I.M on-line. At iteration $k$ of the algorithm:
- $\textbf{Simulation-step}$: for $i=1,2,\ldots N$, draw $\psi_i^{(k)}$ from $m$ iterations of the Metropolis-Hastings algorithm described in The Metropolis-Hastings algorithm for simulating the individual parameters with $\pmacro(\psi_i |y_i ;{\theta})$ as limiting distribution.
- $\textbf{Stochastic approximation}$: update $D_k$, $G_k$ and $\Delta_k$ according to the following recurrent relations:
- where $(\gamma_k)$ is a decreasing sequence of positive numbers such that $\gamma_1=1$, $ \sum_{k=1}^{\infty} \gamma_k = \infty$ and $\sum_{k=1}^{\infty} \gamma_k^2 < \infty$.
- $\textbf{Estimation-step}$: update the estimate $H_k$ of the F.I.M. according to
Implementing this algorithm therefore requires to compute the first and second derivatives of
Assume first that the joint distribution of $\by$ and $\bpsi$ decomposes as
\(
\pypsi(\by,\bpsi;\theta) = \pcypsi(\by | \bpsi)\ppsi(\bpsi;\theta).
\)
|
(2) |
his assumption means that, for any $i=1,2,\ldots N$, all the components of $\psi_i$ are random and that there exists a sufficient statistic ${\cal S}(\bpsi)$ for the estimation of $\theta$. It is then sufficient to compute the first and second derivatives of $\log (\pmacro(\bpsi;\theta))$ for estimating the F.I.M. This can be done relatively simply in a closed form when the individual parameters are normally distributed (eventually up to a transformation $h$).
If some component of $\psi_i$ has no variability, (2) does not hold anymore but we will decompose $\theta$ into $(\theta_y,\theta_\psi)$ such that
We will then need to compute the first and second derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ and $\log(\ppsii(\psi_i;\theta_\psi))$. Derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ that don't have closed form expressions can be obtained by central differences.
Estimation of the F.I.M. using a linearization of the model
Consider here a model for continuous data, using the $\phi$-parametrization for the individual parameters:
Let $\hphi_i$ be some predicted value of $\phi_i$ ($\hphi_i$ can for instance be the estimated mean or the estimated mode of the conditional distribution $\pmacro(\phi_i |y_i ; \hat{\theta})$).
We can therefore linearize the model for the observations $(y_{ij}, 1\leq j \leq n_i)$ of individual $i$ around the vector of predicted individual parameters. Let $\Dphi{f(t , \phi)}$ be the row vector of derivatives of $f(t , \phi)$ with respect to $\phi$. Then,
Then, we can approximate the marginal distribution of the vector $y_i$ as a normal distribution:
\(
y_{i} \approx {\cal N}\left(f(t_{i} , \hphi_i) + \Dphi{f(t_{i} , \hphi_i)} \, (\phi_{\rm pop} - \hphi_i) ,
\Dphi{f(t_{i} , \hphi_i)} \Omega \Dphi{f(t_{i} , \hphi_i)}^{\transpose} + g(t_{i} , \hphi_i)\Sigma_{n_i} g(t_{ij} , \hphi_i)^{\transpose} \right).
\)
|
(5) |
where $\Sigma_{n_i}$ is the variance-covariance matrix of $\teps_{i,1},\ldots,\teps_{i,n_i})$. If the $\teps_{ij}$'s are i.i.d, then $\Sigma_{n_i}$ is the identity matrix.
We can equivalently use the original $\psi$-parametrization using the fact that $\phi_i=h(\psi_i)$. Then,
where $J_h$ is the Jacobian of $h$.
We then can approximate the observed log-likelihood ${\llike}(\theta) = \log({\like}(\theta;\by))=\sum_{i=1}^N \log(\pyi(y_i;\theta))$ using this normal approximation. We can also derive the F.I.M by computing the matrix of second-order partial derivatives of ${\llike}(\theta)$.
Except for very simple models, computing these second-order order derivatives in a closed form is not straightforward. Then, finite differences can be used for approximating numerically these quantities. We can use for instance a central difference approximation of the second derivative of ${\llike}(\theta)$:
Let $\nu>0$. For $j=1,2,\ldots, m$, let $\nu^{(j)}=(\nu^{(j)}_{k}, 1\leq k \leq m)$ be the $m$-vector such that
Then, for $\nu$ small enough,
\(\begin{eqnarray}
\partial_{\theta_j}{ {\llike}(\theta)} &\approx& \displaystyle{ \frac{ {\llike}(\theta+\nu^{(j)})- {\llike}(\theta-\nu^{(j)})}{2\nu} } \ ,
\end{eqnarray}\)
|
(6) |
\(\begin{eqnarray}
\partial^2_{\theta_j,\theta_k}{ {\llike}(\theta)} &\approx& \displaystyle{ \frac{ {\llike}(\theta+\nu^{(j)}+\nu^{(k)})-{\llike}(\theta+\nu^{(j)}-\nu^{(k)})
-{\llike}(\theta-\nu^{(j)}+\nu^{(k)})+{\llike}(\theta-\nu^{(j)}-\nu^{(k)})}{4\nu^2} } .
\end{eqnarray}\)
|
(7) |