Estimation of the observed Fisher information matrix

Estimation using stochastic approximation

The observed Fisher information matrix (F.I.M.) is a function of $\theta$ defined as

$\begin{eqnarray} I(\theta) &=& -\DDt{\log ({\like}(\theta;\by))} \\ &=& -\DDt{\log (\py(\by;\theta))} . \end{eqnarray}$

(1)

Due to the likelihood being quite complex, $I(\theta)$ usually has no closed form expression. It is however possible to estimate it using a stochastic approximation procedure based on Louis' formula:

$\DDt{\log (\pmacro(\by,\bpsi;\theta))} = \esp{\DDt{\log (\pmacro(\by,\bpsi;\theta))} | \by ;\theta} + \cov{\Dt{\log (\pmacro(\by,\bpsi;\theta))} | \by ; \theta}, $

where

$\begin{eqnarray} \cov{\Dt{\log (\pmacro(\by,\bpsi;\theta))} | \by ; \theta} &=& \esp{ \left(\Dt{\log (\pmacro(\by,\bpsi;\theta))} \right)\left(\Dt{\log (\pmacro(\by,\bpsi;\theta))}\right)^{\transpose} | \by ; \theta} \\ && - \esp{\Dt{\log (\pmacro(\by,\bpsi;\theta))} | \by ; \theta}\esp{\Dt{\log (\pmacro(\by,\bpsi;\theta))} | \by ; \theta}^{\transpose} . \end{eqnarray}$

Thus, $\DDt{\log (\pmacro(\by,\bpsi;\theta))}$ is defined as a combination of conditional expectations. Each of these conditional expectations can be estimated by Monte Carlo, or equivalently approximated using a stochastic approximation algorithm.

We can then draw a sequence $(\psi_i^{(k)})$ using a Metropolis-Hasting algorithm and estimate the observed F.I.M. online. At iteration $k$ of the algorithm:

Simulation step: for $i=1,2,\ldots,N$, draw $\psi_i^{(k)}$ from $m$ iterations of the Metropolis-Hastings algorithm described in The Metropolis-Hastings algorithm section with $\pmacro(\psi_i |y_i ;{\theta})$ as the limit distribution.

Stochastic approximation: update $D_k$, $G_k$ and $\Delta_k$ according to the following recurrence relations:

$\begin{eqnarray} \Delta_k & = & \Delta_{k-1} + \gamma_k \left(\Dt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))} - \Delta_{k-1} \right) \\ D_k & = & D_{k-1} + \gamma_k \left(\DDt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))} - D_{k-1} \right)\\ G_k & = & G_{k-1} + \gamma_k \left((\Dt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))})(\Dt{\log (\pmacro(\by,\bpsi^{(k)};{\theta}))})^\transpose -G_{k-1} \right), \end{eqnarray}$

where $(\gamma_k)$ is a decreasing sequence of positive numbers such that $\gamma_1=1$, $ \sum_{k=1}^{\infty} \gamma_k = \infty$, and $\sum_{k=1}^{\infty} \gamma_k^2 < \infty$.

Estimation step: update the estimate $H_k$ of the F.I.M. according to

$H_k = D_k + G_k - \Delta_k \Delta_k^{\transpose}. $

Implementing this algorithm therefore requires computation of the first and second derivatives of

$\log (\pmacro(\by,\bpsi;\theta))=\sum_{i=1}^{N} \log (\pmacro(y_i,\psi_i;\theta)).$

Assume first that the joint distribution of $\by$ and $\bpsi$ decomposes as

$ \pypsi(\by,\bpsi;\theta) = \pcypsi(\by | \bpsi)\ppsi(\bpsi;\theta). $

(2)

This assumption means that for any $i=1,2,\ldots,N$, all of the components of $\psi_i$ are random and there exists a sufficient statistic ${\cal S}(\bpsi)$ for the estimation of $\theta$. It is then sufficient to compute the first and second derivatives of $\log (\pmacro(\bpsi;\theta))$ in order to estimate the F.I.M. This can be done relatively simply in closed form when the individual parameters are normally distributed (or a transformation $h$ of them is).

If some component of $\psi_i$ has no variability, (2) no longer holds, but we can decompose $\theta$ into $(\theta_y,\theta_\psi)$ such that

$ \pyipsii(y_i,\psi_i;\theta) = \pcyipsii(y_i | \psi_i ; \theta_y)\ppsii(\psi_i;\theta_\psi). $

We then need to compute the first and second derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ and $\log(\ppsii(\psi_i;\theta_\psi))$. Derivatives of $\log(\pcyipsii(y_i |\psi_i ; \theta_y))$ that do not have a closed form expression can be obtained using central differences.

Remarks

1. Using $\gamma_k=1/k$ for $k \geq 1$ means that each term is approximated with an empirical mean obtained from $(\bpsi^{(k)}, k \geq 1)$. For instance,

$\begin{eqnarray} \Delta_k &=& \Delta_{k-1} + \displaystyle{ \frac{1}{k} } \left(\Dt{\log (\pmacro(\by,\bpsi^{(k)};\theta))} - \Delta_{k-1} \right) \end{eqnarray}$

(3)

$\begin{eqnarray} &=& \displaystyle{ \frac{1}{k} }\sum_{j=1}^{k} \Dt{\log (\pmacro(\by,\bpsi^{(j)};\theta))} . \end{eqnarray}$

(4)

(3) (resp. (4)) defines $\Delta_k$ using an online (resp. offline) algorithm. Writing $\Delta_k$ as in (3) instead of (4) avoids having to store all simulated sequences $(\bpsi^{(j)}, 1\leq j \leq k)$ when computing $\Delta_k$.

2. This approach is used for computing the F.I.M. $I(\hat{\theta})$ in practice, where $\hat{\theta}$ is the maximum likelihood estimate of $\theta$. The only difference with the Metropolis-Hastings used for SAEM is that the population parameter $\theta$ is not updated and remains fixed at $\hat{\theta}$.

In summary, for a given estimate $\hat{\theta}$ of the population parameter $\theta$, a stochastic approximation algorithm for estimating the observed Fisher Information Matrix $I(\hat{\theta)}$ consists of:

1. For $i=1,2,\ldots,N$, run a Metropolis-Hastings algorithm to draw a sequence $\psi_i^{(k)}$ with limit distribution $\pmacro(\psi_i |y_i ;\hat{\theta})$.

2. At iteration $k$ of the Metropolis-Hastings algorithm, compute the first and second derivatives of $\pypsi(\by,\bpsi^{(k)};\hat{\theta})$.

3.Update $\Delta_k$, $G_k$, $D_k$ and compute an estimate $H_k$ of the F.I.M.

Example 1

Consider the model

$\begin{eqnarray} y_i | \psi_i &\sim& \pcyipsii(y_i | \psi_i) \\ h(\psi_i) &\sim_{i.i.d}& {\cal N}( h(\psi_{\rm pop}) , \Omega), \end{eqnarray}$

where $\Omega = {\rm diag}(\omega_1^2,\omega_2^2,\ldots,\omega_d^2)$ is a diagonal matrix and $h(\psi_i)=(h_1(\psi_{i,1}), h_2(\psi_{i,2}), \ldots , h_d(\psi_{i,d}) )^{\transpose}$. The vector of population parameters is $\theta = (\psi_{\rm pop} , \Omega)=(\psi_{ {\rm pop},1},\ldots,\psi_{ {\rm pop},d},\omega_1^2,\ldots,\omega_d^2)$.

Here,

$ \log (\pyipsii(y_i,\psi_i;\theta)) = \log (\pcyipsii(y_i | \psi_i)) + \log (\ppsii(\psi_i;\theta)). $

Then,

$\begin{eqnarray} \Dt{\log (\pyipsii(y_i,\psi_i;\theta))} &=& \Dt{\log (\ppsii(\psi_i;\theta))} \\ \DDt{\log (\pyipsii(y_i,\psi_i;\theta))} &=& \DDt{\log (\ppsii(\psi_i;\theta))} . \end{eqnarray}$

More precisely,

\(\begin{eqnarray} \log (\ppsii(\psi_i;\theta)) &=& -\displaystyle{\frac{d}{2} }\log(2\pi) + \sum_{\iparam=1}^d \log(h_\iparam^{\prime}(\psi_{i,\iparam})) -\displaystyle{ \frac{1}{2} } \sum_{\iparam=1}^d \log(\omega_\iparam^2) -\sum_{\iparam=1}^d \displaystyle{ \frac{1}{2\, \omega_\iparam^2} }( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )^2 \\ \partial \log (\ppsii(\psi_i;\theta))/\partial \psi_{ {\rm pop},\iparam} &=& \displaystyle{\frac{1}{\omega_\iparam^2} }h_\iparam^{\prime}(\psi_{ {\rm pop},\iparam})( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) ) \\ \partial \log (\ppsii(\psi_i;\theta))/\partial \omega^2_{\iparam} &=& -\displaystyle{ \frac{1}{2\omega_\iparam^2} } +\displaystyle{\frac{1}{2\, \omega_\iparam^4} }( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )^2 \\ \partial^2 \log (\ppsii(\psi_i;\theta))/\partial \psi_{ {\rm pop},\iparam} \partial \psi_{ {\rm pop},\jparam} &=& \left\{ \begin{array}{ll} \left( h_\iparam^{\prime\prime}(\psi_{ {\rm pop},\iparam})( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )- h_\iparam^{\prime \, 2}(\psi_{ {\rm pop},\iparam}) \right)/\omega_\iparam^2 & {\rm if \quad } \iparam=\jparam \\ 0 & {\rm otherwise} \end{array} \right. \\ \partial^2 \log (\ppsii(\psi_i;\theta))/\partial \omega^2_{\iparam} \partial \omega^2_{\jparam} &=& \left\{ \begin{array}{ll} 1/(2\omega_\iparam^4) - ( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )^2/\omega_\iparam^6 & {\rm if \quad} \iparam=\jparam \\ 0 & {\rm otherwise} \end{array} \right. \\ \partial^2 \log (\ppsii(\psi_i;\theta))/\partial \psi_{ {\rm pop},\iparam} \partial \omega^2_{\jparam} &=& \left\{ \begin{array}{ll} -h_\iparam^{\prime}(\psi_{ {\rm pop},\iparam})( h_\iparam(\psi_{i,\iparam}) - h_\iparam(\psi_{ {\rm pop},\iparam}) )/\omega_\iparam^4 & {\rm if \quad} \iparam=\jparam \\ 0 & {\rm otherwise.} \end{array} \right. \end{eqnarray}\)

Example 2

We consider the same model for continuous data, assuming a constant error model and that the variance $a^2$ of the residual error has no variability:

$\begin{eqnarray} y_{ij} | \psi_i &\sim& {\cal N}(f(t_{ij}, \psi_i) \ , \ a^2), \ \ 1 \leq j \leq n_i \\ h(\psi_i) &\sim_{i.i.d}& {\cal N}( h(\psi_{\rm pop}) , \Omega). \end{eqnarray}$

Here, $\theta_y=a^2$, $\theta_\psi=(\psi_{\rm pop},\Omega)$ and

$ \log(\pyipsii(y_i,\psi_i;\theta)) = \log(\pcyipsii(y_i | \psi_i ; a^2)) + \log(\ppsii(\psi_i;\psi_{\rm pop},\Omega)), $

where

$ \log(\pcyipsii(y_i | \psi_i ; a^2)) =-\displaystyle{\frac{n_i}{2} }\log(2\pi)- \displaystyle{\frac{n_i}{2} }\log(a^2) - \displaystyle{\frac{1}{2a^2} }\sum_{j=1}^{n_i}(y_{ij} - f(t_{ij}, \psi_i))^2 . $

Derivatives of $\log(\pcyipsii(y_i | \psi_i ; a^2))$ with respect to $a^2$ are straightforward to compute. Derivatives of $\log(\ppsii(\psi_i;\psi_{\rm pop},\Omega))$ with respect to $\psi_{\rm pop}$ and $\Omega$ remain unchanged.

Example 3

Consider again the same model for continuous data, assuming now that a subset $\xi$ of the parameters of the structural model has no variability:

$\begin{eqnarray} y_{ij} | \psi_i &\sim& {\cal N}(f(t_{ij}, \psi_i,\xi) \ , \ a^2), \ \ 1 \leq j \leq n_i \\ h(\psi_i) &\sim_{i.i.d}& {\cal N}( h(\psi_{\rm pop}) , \Omega). \end{eqnarray}$

Let $\psi$ remain as the subset of individual parameters with variability. Here, $\theta_y=(\xi,a^2)$, $\theta_\psi=(\psi_{\rm pop},\Omega)$, and

$ \log(\pcyipsii(y_i | \psi_i ; \xi,a^2)) =-\displaystyle{\frac{n_i}{2} }\log(2\pi)- \displaystyle{\frac{n_i}{2} }\log(a^2) - \displaystyle{\frac{1}{2 a^2} }\sum_{j=1}^{n_i}(y_{ij} - f(t_{ij}, \psi_i,\xi))^2 . $

Derivatives of $\log(\pcyipsii(y_i | \psi_i ; \xi, a^2))$ with respect to $\xi$ require computation of the derivative of $f$ with respect to $\xi$. These derivatives are usually not calculable. One possibility is to numerically approximate them using finite differences.

Estimation using linearization of the model

Consider here a model for continuous data that uses a $\phi$-parametrization for the individual parameters:

$\begin{eqnarray} y_{ij} &= & f(t_{ij} , \phi_i) + g(t_{ij} , \phi_i)\teps_{ij} \\ \phi_i &=& \phi_{\rm pop} + \eta_i . \end{eqnarray}$

Let $\hphi_i$ be some predicted value of $\phi_i$, such as for instance the estimated mean or estimated mode of the conditional distribution $\pmacro(\phi_i |y_i ; \hat{\theta})$.

We can then choose to linearize the model for the observations $(y_{ij}, 1\leq j \leq n_i)$ of individual $i$ around the vector of predicted individual parameters. Let $\Dphi{f(t , \phi)}$ be the row vector of derivatives of $f(t , \phi)$ with respect to $\phi$. Then,

$\begin{eqnarray} y_{ij} &\simeq& f(t_{ij} , \hphi_i) + \Dphi{f(t_{ij} , \hphi_i)} \, (\phi_i - \hphi_i) + g(t_{ij} , \hphi_i)\teps_{ij} \\ &\simeq& f(t_{ij} , \hphi_i) + \Dphi{f(t_{ij} , \hphi_i)} \, (\phi_{\rm pop} - \hphi_i) + \Dphi{f(t_{ij} , \hphi_i)} \, \eta_i + g(t_{ij} , \hphi_i)\teps_{ij} . \end{eqnarray}$

Then, we can approximate the marginal distribution of the vector $y_i$ as a normal distribution:

REDIRECTION Modèle:EquationWithRef

where $\Sigma_{n_i}$ is the variance-covariance matrix of $\teps_{i,1},\ldots,\teps_{i,n_i}$. If the $\teps_{ij}$ are i.i.d., then $\Sigma_{n_i}$ is the identity matrix.

We can equivalently use the original $\psi$-parametrization and the fact that $\phi_i=h(\psi_i)$. Then,

$ \Dphi{f(t_{i} , \hphi_i)} = \Dpsi{f(t_{i} , \hpsi_i)} J_h(\hpsi_i)^{\transpose} , $

where $J_h$ is the Jacobian of $h$.

We then can approximate the observed log-likelihood ${\llike}(\theta) = \log(\like(\theta;\by))=\sum_{i=1}^N \log(\pyi(y_i;\theta))$ using this normal approximation. We can also derive the F.I.M. by computing the matrix of second-order partial derivatives of ${\llike}(\theta)$.

Except for very simple models, computing these second-order partial derivatives in closed form is not straightforward. In such cases, finite differences can be used for numerically approximating them. We can use for instance a central difference approximation of the second derivative of $\llike(\theta)$. To this end, let $\nu>0$. For $j=1,2,\ldots, m$, let $\nu^{(j)}=(\nu^{(j)}_{k}, 1\leq k \leq m)$ be the $m$-vector such that

$ \nu^{(j)}_{k} = \left\{ \begin{array}{ll} \nu & {\rm if \quad j= k} \\ 0 & {\rm otherwise.} \end{array} \right. $

Then, for $\nu$ small enough,

$\begin{eqnarray} \partial_{\theta_j}{ {\llike}(\theta)} &\approx& \displaystyle{ \frac{ {\llike}(\theta+\nu^{(j)})- {\llike}(\theta-\nu^{(j)})}{2\nu} } \\ \end{eqnarray}$

$\begin{eqnarray} \partial^2_{\theta_j,\theta_k}{ {\llike}(\theta)} &\approx& \displaystyle{\frac{ {\llike}(\theta+\nu^{(j)}+\nu^{(k)})- {\llike}(\theta+\nu^{(j)}-\nu^{(k)}) -{\llike}(\theta-\nu^{(j)}+\nu^{(k)})+{\llike}(\theta-\nu^{(j)}-\nu^{(k)})}{4\nu^2} } . \end{eqnarray}$

(6)

In summary, for a given estimate $\hat{\theta}$ of the population parameter $\theta$, the algorithm for approximating the Fisher Information Matrix $I(\hat{\theta)}$ using a linear approximation of the model consists of:

1. For $i=1,2,\ldots,N$, obtain some estimate $(\hpsi_i)$ of the individual parameters $(\psi_i)$ (we can average for example the final terms of the sequence $(\psi_i^{(k)})$ drawn during the final iterations of the SAEM algorithm).

2. For $i=1,2,\ldots,N$, compute $\hphi_i=h(\hpsi_i)$, the mean and the variance of the normal distribution defined in (5), and ${\llike}(\theta)$ using this normal approximation.

3. Use (6) to approximate the matrix of second-order derivatives of ${\llike}(\theta)$.


The Metropolis-Hastings algorithm for simulating the individual parameters	Estimation of the log-likelihood

Estimation of the observed Fisher information matrix

Estimation using stochastic approximation

Estimation using linearization of the model

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

WikiPopix

Introduction

Models

Tasks & Tools

Methods

Download files

Tools