Estimation
Contents
Introduction
In the modeling context, we usually assume that we have data that includes observations $\by$, measurement times $\bt$ and possibly additional regression variables $\bx$. There may also be individual covariates $\bc$, and in pharmacological applications the dose regimen $\bu$. For clarity, in the following notation we will omit the design variables $\bt$, $\bx$ and $\bu$, and the covariates $\bc$.
Here, we find ourselves in the classical framework of incomplete data models. Indeed, only $\by = (y_{ij})$ is observed in the joint model $\pypsi(\by,\bpsi;\theta)$.
Estimation tasks are common ones seen in statistics:
 Estimate the population parameter $\theta$ using the available observations and possibly a priori information that is available.
 Evaluate the precision of the proposed estimates.
 Reconstruct missing data, here being the individual parameters $\bpsi=(\psi_i, 1\leq i \leq N)$.
 Estimate the loglikelihood for a given model, i.e., for a given joint distribution $\qypsi$ and value of $\theta$.
Maximum likelihood estimation of the population parameters
Definitions
Maximum likelihood estimation consists of maximizing with respect to $\theta$ the observed likelihood defined by:
Maximum likelihood estimation of the population parameter $\theta$ requires:
 A model, i.e., a joint distribution $\qypsi$. Depending on the software used, the model can be implemented using a script or a graphical user interface. $\monolix$ is extremely flexible and allows us to combine both. It is possible for instance to code the structural model using $\mlxtran$ and use the GUI for implementing the statistical model. Whatever the options selected, the complete model can always be saved as a text file.
 Inputs $\by$, $\bc$, $\bu$ and $\bt$. All of these variables tend to be stored in a unique data file (see the Data Exploration Section).
 An algorithm which allows us to maximize $\int \pypsi(\by,\bpsi ;\theta) \, d \bpsi$ with respect to $\theta$. Each software package has its own algorithms implemented. It is not our goal here to rate and compare the various algorithms and implementations. We will use exclusively the SAEM algorithm as described in The SAEM algorithm and implemented in $\monolix$ as we are entirely satisfied by both its theoretical and practical qualities:
 The algorithms implemented in $\monolix$ including SAEM and its extensions (mixture models, hidden Markov models, SDEbased model, censored data, etc.) have been published in statistical journals. Furthermore, convergence of SAEM has been rigorously proved.
 The SAEM implementation in $\monolix$ is extremely efficient for a wide variety of complex models.
 The SAEM implementation in $\monolix$ was done by the same group that proposed the algorithm and studied in detail its theoretical and practical properties.
Example
Let us again look at the model used in the Model Visualization Section. For the case of a unique dose $D$ given at time $t=0$, the structural model is written:
where $Cc$ is the concentration in the central compartment and $h$ the hazard function for the event of interest (hemorrhaging). Supposing a constant error model for the concentration, the model for the observations can be easily implemented using $\mlxtran$.
Here, amtDose is a reserved keyword for the last administered dose.
The model's parameters are the absorption rate constant $ka$, the volume of distribution $V$, the clearance $Cl$, the baseline hazard $h_0$ and the coefficient $\gamma$. The statistical model for the individual parameters can be defined in the $\monolix$ project file (left) and/or the $\monolix$ GUI (right):
Once the model is implemented, tasks such as maximum likelihood estimation can be performed using the SAEM algorithm. Certain settings in SAEM must be provided by the user. Even though SAEM is quite insensitive to the initial parameter values, it is possible to perform a preliminary sensitivity analysis in order to select "good" initial values.
Looking for good initial values for SAEM

Then, when we run SAEM, it converges easily and quickly to the MLE:
Parameter estimation can therefore be seen as estimating the reference values and variance of the random effects.
In addition to these numbers, it is important to be able to graphically represent these distributions in order to see them and therefore understand them better. In effect, the interpretation of certain parameters is not always simple. Of course, we know what a normal distribution represents and in particular its mean, median and mode, which are equal (see the distribution of $Cl$ below for instance). These measures of central tendency can be different among themselves for other asymmetric distributions such as the lognormal (see the distribution of $ka$).
Interpreting dispersion terms like $\omega_{ka}$ and $\omega_{V}$ is not obvious either when the parameter distributions are not normal. In such cases, quartiles or quantiles of order 5% and 95% (for example) may be useful for quantitively describing the variability of these parameters.
Estimation of the population distributions of the individual parameters of the model

Bayesian estimation of the population parameters
The Bayesian approach considers $\theta$ as a random vector with a prior distribution $\qth$. We can then define the posterior distribution of $\theta$:
We can estimate this conditional distribution and derive any statistics (posterior mean, standard deviation, percentiles, etc.) or derive the socalled Maximum a Posteriori (MAP) estimate of $\theta$:
The MAP estimate therefore maximizes a penalized version of the observed likelihood. In other words, maximum a posteriori estimation reduces to penalized maximum likelihood estimation. Suppose for instance that $\theta$ is a scalar parameter and the prior is a normal distribution with mean $\theta_0$ and variance $\gamma^2$. Then, the MAP estimate minimizes
The MAP estimate is a tradeoff between the MLE which maximizes ${\llike}(\theta ; \by)$ and $\theta_0$ which minimizes $(\theta  \theta_0)^2$. The weight given to the prior directly depends on the variance of the prior distribution: the smaller $\gamma^2$ is, the closer to $\theta_0$ the MAP is. The limiting distribution considers that $\gamma^2=0$: this prior means here that $\theta$ is fixed as $\theta_0$ and no longer needs to be estimated.
Both the Bayesian and frequentist approaches have their supporters and detractors. But rather than being dogmatic and blindly following the same rulebook every time, we need to be pragmatic and ask the right methodological questions when confronted with a new problem.
We have to remember that Bayesian methods have been extremely successful, in particular for numerical calculations. For instance, (Bayesian) MCMC methods allow us to estimate more or less any conditional distribution coming from any hierarchical model, whereas frequentist approaches such as maximum likelihood estimation can be much more difficult to implement.
All things said, the problem comes down to knowing whether the data contains sufficient information to answer a given question, and whether some other information may be available to help answer it. This is the essence of the art of modeling: finding the right compromise between the confidence we have in the data and prior knowledge of the problem. Each problem is different and requires a specific approach. For instance, if all the patients in a pharmacokinetic trial have essentially the same weight, it is pointless to estimate a relationship between weight and the model's PK parameters using the trial data. In this case, the modeler would be better served trying to use prior information based on physiological criteria rather than just a statistical model.
Therefore, we can use information available to us, of course! Why not? But this information needs to be pertinent. Systematically using a prior for the parameters is not always meaningful. Can we reasonable suppose that we have access to such information? For continuous data for example, what does putting a prior on the residual error model's parameters mean in reality? A reasoned statistical approach consists of only including prior information for certain parameters (those for which we have real prior information) and having confidence in the data for the others.
$\monolix$ allows this hybrid approach which reconciles the Bayesian and frequentist approaches. A given parameter can be:
 a fixed constant if we have absolute confidence in its value or the data does not allow it to be estimated, essentially due to identifiability constraints.
 estimated by maximum likelihood, either because we have great confidence in the data or have no information on the parameter.
 estimated by introducing a prior and calculating the MAP estimate.
 estimated by introducing a prior and then estimating the posterior distribution.
We put aside dealing with the fixed components of $\theta$ in the following. Here are some possible situations:
 Combined maximum likelihood and maximum a posteriori estimation: decompose $\theta$ into $(\theta_E,\theta_{M})$ where $\theta_E$ are the components of $\theta$ to be estimated with MLE and $\theta_{M}$ those with a prior distribution whose posterior distribution is to be maximized. Then, $(\hat{\theta}_E , \hat{\theta}_{M} )$ below maximizes the penalized likelihood of $(\theta_E,\theta_{M})$:
 Combined maximum likelihood and posterior distribution estimation: here, decompose $\theta$ into $(\theta_E,\theta_{R})$ where $\theta_E$ are the components of $\theta$ to be estimated with MLE and $\theta_{R}$ those with a prior distribution whose posterior distribution is to be estimated. We propose the following strategy for estimating $\theta_E$ and $\theta_{R}$:
 Compute the maximum likelihood of $\theta_E$:
 Estimate the conditional distribution $\pmacro(\theta_{R}  \by ;\hat{\theta}_E)$.
where ${\llike} (\theta_E , \theta_{M}; \by) \ \ \eqdef \ \ \log\left(\py(\by  \theta_{M}; \theta_E)\right).$
It is then straightforward to extend this approach to more complex situations where some components of $\theta$ are estimated with MLE, others using MAP estimation and others still by estimating their conditional distributions.
Estimation of the Fisher information matrix
The variance of the estimator $\thmle$ and thus confidence intervals can be derived from the observed Fisher information matrix (F.I.M.), which itself is calculated using the observed likelihood (i.e., the pdf of the observations $\by$):
\(
\ofim(\thmle ; \by) \ \ \eqdef \ \  \displaystyle{ \frac{\partial^2}{\partial \theta^2} }\log({\like}(\thmle ; \by)) .
\)

(1) 
Then, the variancecovariance matrix of the maximum likelihood estimator $\thmle$ can be estimated by the inverse of the observed F.I.M. Standard errors (s.e.) for each component of $\thmle$ are their standard deviations, i.e., the squareroot of the diagonal elements of this covariance matrix. $\monolix$ also displays the (estimated) relative standard errors (r.s.e.), i.e., the (estimated) standard error divided by the value of the estimated parameter.
The F.I.M. can be used for detecting overparametrization of the structural model. In effect, if the model is poorly identifiable, certain estimators will be quite correlated and the F.I.M. will therefore be poorly conditioned and difficult to inverse. Suppose for example that we want to fit a two compartment PK model to the same data as before. The output is shown below. The large values for the relative standard errors for the intercompartmental clearance $Q$ and the volume of the peripheral compartment $V_2$ mean that the data does not allow us to estimate well these two parameters.
The Fisher information criteria is also widely used in optimal experimental design. Indeed, minimizing the variance of the estimator corresponds to maximizing the information. Then, estimators and designs can be evaluated by looking at certain summary statistics of the covariance matrix (like the determinant or trace for instance).
Estimation of the individual parameters
Once $\theta$ has been estimated, the conditional distribution $\pmacro(\psi_i  y_i ; \hat{\theta})$ of the individual parameters $\psi_i$ can be estimated for each individual $i$ using the MetropolisHastings algorithm. For each $i$, this algorithm generates a sequence $(\psi_i^{k}, k \geq 1)$ which converges in distribution to the conditional distribution $\pmacro(\psi_i  y_i ; \hat{\theta})$ and that can be used for estimating any summary statistic of this distribution (mean, standard deviation, quantiles, etc.).
The mode of this conditional distribution can be estimated using this sequence or by maximizing $\pmacro(\psi_i  y_i ; \hat{\theta})$ using numerical methods.
The choice of using the conditional mean or the conditional mode is arbitrary. By default, $\monolix$ uses the conditional mode, taking the philosophy that the "most likely" values of the individual parameters are the most suited for computing the "most likely" predictions.
Predicted concentrations for 6 individuals using the estimated conditional modes of the individual PK parameters

Estimation of the observed loglikelihood
Once $\theta$ has been estimated, the observed loglikelihood of $\hat{\theta}$ is defined as
The observed loglikelihood cannot be computed in closed form for nonlinear mixed effects models, but can be estimated using the methods described in the Estimation of the loglikelihood Section. The estimated loglikelihood can then be used for performing likelihood ratio tests and for computing information criteria such as AIC and BIC (see the Model evaluation Section).
Bibliography
Lixoft  Monolix 4.2
 http://www.lixoft.eu/products/monolix/productmonolixoverview ,2012
 BibtexAuthor : Lixoft
Title : Monolix 4.2
In : http://www.lixoft.eu/products/monolix/productmonolixoverview 
Address :
Date : 2012
Comets, E., Lavenu, A., Lavielle, M.  saemix: Stochastic Approximation Expectation Maximization (SAEM) algorithm. R package version 0.96.1
 http://cran.rproject.org/web/packages/saemix/index.html ,2013
 BibtexAuthor : Comets, E., Lavenu, A., Lavielle, M.
Title : saemix: Stochastic Approximation Expectation Maximization (SAEM) algorithm. R package version 0.96.1
In : http://cran.rproject.org/web/packages/saemix/index.html 
Address :
Date : 2013
The MathWorks  nlmefitsa: fit nonlinear mixedeffects model with stochastic EM algorithm. Matlab R2013a function
 http://www.mathworks.fr/fr/help/stats/nlmefitsa.html ,2013
 BibtexAuthor : The MathWorks
Title : nlmefitsa: fit nonlinear mixedeffects model with stochastic EM algorithm. Matlab R2013a function
In : http://www.mathworks.fr/fr/help/stats/nlmefitsa.html 
Address :
Date : 2013
Beal, S.L., Sheiner, L.B., Boeckmann, A., Bauer, R.J.  NONMEM users guides
 San Francisco, NONMEM Project Group, University of California ,1992
 BibtexAuthor : Beal, S.L., Sheiner, L.B., Boeckmann, A., Bauer, R.J.
Title : NONMEM users guides
In : San Francisco, NONMEM Project Group, University of California 
Address :
Date : 1992
Pinheiro, J.C., Bates, D.M.  Mixed effects models in S and SPLUS
 Springer Verlag,2000
 BibtexAuthor : Pinheiro, J.C., Bates, D.M.
Title : Mixed effects models in S and SPLUS
In : 
Address :
Date : 2000
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D.  the R Core team (2009) nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.196
 R Foundation for Statistical Computing, Vienna ,2010
 BibtexAuthor : Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D.
Title : the R Core team (2009) nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.196
In : R Foundation for Statistical Computing, Vienna 
Address :
Date : 2010
Spiegelhalter, D., Thomas, A., Best, N., Lunn, D.  WinBUGS user manual
 Cambridge: MRC Biostatistics Unit ,2003
 BibtexAuthor : Spiegelhalter, D., Thomas, A., Best, N., Lunn, D.
Title : WinBUGS user manual
In : Cambridge: MRC Biostatistics Unit 
Address :
Date : 2003
SPSS  Linear mixedeffects modeling in SPSS. An introduction to the MIXED procedure
SAS  The NLMMIXED procedure, SAS/STAT 9.2 User's Guide