https://wiki.inria.fr/wikis/popix/api.php?action=feedcontributions&user=Marc&feedformat=atom Popix - User contributions [en] 2022-06-25T14:14:40Z User contributions MediaWiki 1.32.6 https://wiki.inria.fr/wikis/popix/index.php?title=Overview&diff=7176 Overview 2013-06-04T19:01:25Z <p>Marc: </p> <hr /> <div>$<br /> \def\simulix{\mathsf{simulix} } <br />$<br /> The desire to model a biological or physical phenomenon often arises when we are able to record some observations issued from that phenomenon. Nothing would be more natural therefore than to begin this introduction by looking at some observed data.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= This first plot displays the viral load of four patients with hepatitis C who started a treatment at time $t=0$.<br /> |image = NEWintro1.png<br /> }} <br /> <br /> <br /> {{ExampleWithImage<br /> |text=This second example involves weight data for rats measured over 14 weeks, for a sub-chronic toxicity study related to the question of genetically modified corn.<br /> |image = NEWintro2.png}}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= In this third example, data are fluorescence intensities measured over time in a cellular biology experiment.<br /> |image=NEWintro3.png }}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= Note that repeated measurements are not necessarily always functions of time.<br /> For example, we may be interested in corn production as a function of fertilizer quantity.<br /> |image= NEWintro4.png}}<br /> <br /> <br /> Even though these examples come from quite different domains, in each case the data is made up of repeated measurements on several individuals from a population. What we will call a &quot;population approach&quot; is therefore relevant for characterizing and modeling this data. The modeling goal is thus twofold: characterize the biological or physical phenomena observed for each individual, and secondly, the variability seen between individuals.<br /> <br /> In the example with the rats, the model needs to integrate a growth model that describes how a rat's weight increases with time, and a statistical model that describes why these kinetics can vary from one rat to another. The goal is thus to finish with a &quot;typical&quot; curve for the population (in red) and to be able to explain the variability in the individual's curves (in green) around this population curve.<br /> <br /> <br /> ::[[File:NEWintro5.png]]<br /> <br /> <br /> The model will explain some of this variability by individual covariates such as sex or diet (rats 1 and 3 are male while rats 2 and 4 are female), but some of the variability will remain unexplained and will be considered as random. Integrating into the same model effects considered fixed and others considered random leads naturally to the use of mixed-effects models.<br /> <br /> An alternative yet equivalent approach considers this model as a hierarchical one: each curve is described by a single model, and the variability between individual models is described by a population model. In the case of parametric models, this means that the observations for a given individual are described by a model of the observations that depends on a vector of individual parameters: this is the classic individual approach. The population approach is then a direct extension of [[The individual approach|the individual approach]]: we add a component to the model that describes the variability of the individual parameters within the population.<br /> <br /> A model can thus be seen as a [[What is a model? A joint probability distribution! | joint probability distribution]], which can easily be extended to the case where other variables in the model are considered as random variables: covariates, population parameters, the design, etc. The hierarchical structure of the model leads to a natural decomposition of the joint distribution into a product of conditional and marginal distributions.<br /> <br /> Models for [[Modeling the individual parameters |individual parameters]] and models for [[Modeling the observations | observations]] are described in the [[Introduction_%26_notation|Models]] chapter. In particular, models for [[Continuous data models|continuous observations]], [[Model for categorical data|categorical data]], [[Models for count data|count data]] and [[ Models for time-to-event data | survival data]] are presented and illustrated by various examples. Extensions for [[ Mixture models|mixture models]], [[Hidden Markov models|hidden Markov models]] and [[Stochastic differential equations based models| stochastic differential equation based models]] are also presented.<br /> <br /> The Tasks &amp; Tools chapter presents practical examples of using these models: [[Visualization|exploration and visualization]], [[Estimation|estimation]], [[Model evaluation#Model diagnostics|model diagnostics]], [[Model evaluation#Model selection|model selection]] and [[Simulation|simulation]]. All approaches and proposed methods are rigorously detailed in the [[Introduction and notation|Methods]] chapter.<br /> <br /> The main purpose of a model is to be used. Mathematical modeling and statistics remain useful tools for many disciplines (biology, agronomy, environmental studies, pharmacology, etc.), but it is important that these tools are used properly. The various software packages used in this wiki have been developed with this in mind: they serve the modeler well, while fully complying with a coherent mathematical formalism and using well-known and theoretically justified methods.<br /> <br /> Tools for model exploration ($\mlxplore$), modeling ($\monolix$) and simulation ($\simulix$) use the same model coding language $\mlxtran$. This allows us to define a complete workflow using the same model implementation, i.e., to run several different tasks based on the same model.<br /> <br /> $\mlxtran$ is extremely flexible and well-adapted to implementing complex mixed-effects models.<br /> With $\mlxtran$ we can easily write ODE-based models, implement [[Introduction_to_PK_modeling_using_MLXPlore_-_Part_I|pharmacokinetic models]] with complex administration schedules, include inter-individual variability in parameters, define statistical models for covariates, etc.<br /> Another crucial property of $\mlxtran$ is that it rigorously adopts the model representation formalism proposed in $\wikipopix$. In other words, the model implementation is fully consistent with its mathematical representation.<br /> <br /> $\mlxplore$ provides a clear graphical interface that allows us to visualize not only the structural model but also the statistical model, which is of fundamental importance in the population approach. We can visualize for instance the impact of covariates and inter-individual variability of model parameters on predictions. Then, $\mlxplore$ is an ideal tool for teaching or discovering what is a [[Introduction_to_PK_modeling_using_MLXPlore_-_Part_I|pharmacokinetic model]] for example.<br /> <br /> The algorithms implemented in $\monolix$ (Stochastic Approximation of EM, MCMC, Simulated Annealing, Importance Sampling, etc.) are extremely efficient for a wide variety of complex models. Furthermore, convergence of SAEM and its extensions (mixture models, hidden Markov models, SDE-based models, censored data, etc.) has been rigorously proved and published in statistical journals.<br /> <br /> $\simulix$ is a model computation engine which enables us to simulate a $\mlxtran$ model from within various environments. $\simulix$ is now available for the Matlab and R platforms, allowing any user to combine the flexibility of R and Matlab scripts with the power of $\mlxtran$ in order to easily encode complex models and simulate data.<br /> <br /> For these reasons, $\wikipopix$ and these tools can be used with confidence for training and teaching. This is even more the case because $\mlxplore$, $\monolix$ and $\simulix$ are free for academic research and education purposes.<br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> {{Next<br /> |link=The individual approach }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Overview&diff=7175 Overview 2013-06-04T18:52:36Z <p>Marc: </p> <hr /> <div>$<br /> \def\simulix{\mathsf{simulix} } <br />$<br /> The desire to model a biological or physical phenomenon often arises when we are able to record some observations issued from that phenomenon. Nothing would be more natural therefore than to begin this introduction by looking at some observed data.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= This first plot displays the viral load of four patients with hepatitis C who started a treatment at time $t=0$.<br /> |image = NEWintro1.png<br /> }} <br /> <br /> <br /> {{ExampleWithImage<br /> |text=This second example involves weight data for rats measured over 14 weeks, for a sub-chronic toxicity study related to the question of genetically modified corn.<br /> |image = NEWintro2.png}}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= In this third example, data are fluorescence intensities measured over time in a cellular biology experiment.<br /> |image=NEWintro3.png }}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= Note that repeated measurements are not necessarily always functions of time.<br /> For example, we may be interested in corn production as a function of fertilizer quantity.<br /> |image= NEWintro4.png}}<br /> <br /> <br /> Even though these examples come from quite different domains, in each case the data is made up of repeated measurements on several individuals from a population. What we will call a &quot;population approach&quot; is therefore relevant for characterizing and modeling this data. The modeling goal is thus twofold: characterize the biological or physical phenomena observed for each individual, and secondly, the variability seen between individuals.<br /> <br /> In the example with the rats, the model needs to integrate a growth model that describes how a rat's weight increases with time, and a statistical model that describes why these kinetics can vary from one rat to another. The goal is thus to finish with a &quot;typical&quot; curve for the population (in red) and to be able to explain the variability in the individual's curves (in green) around this population curve.<br /> <br /> <br /> ::[[File:NEWintro5.png]]<br /> <br /> <br /> The model will explain some of this variability by individual covariates such as sex or diet (rats 1 and 3 are male while rats 2 and 4 are female), but some of the variability will remain unexplained and will be considered as random. Integrating into the same model effects considered fixed and others considered random leads naturally to the use of mixed-effects models.<br /> <br /> An alternative yet equivalent approach considers this model as a hierarchical one: each curve is described by a single model, and the variability between individual models is described by a population model. In the case of parametric models, this means that the observations for a given individual are described by a model of the observations that depends on a vector of individual parameters: this is the classic individual approach. The population approach is then a direct extension of [[The individual approach|the individual approach]]: we add a component to the model that describes the variability of the individual parameters within the population.<br /> <br /> A model can thus be seen as a [[What is a model? A joint probability distribution! | joint probability distribution]], which can easily be extended to the case where other variables in the model are considered as random variables: covariates, population parameters, the design, etc. The hierarchical structure of the model leads to a natural decomposition of the joint distribution into a product of conditional and marginal distributions.<br /> <br /> Models for [[Modeling the individual parameters |individual parameters]] and models for [[Modeling the observations | observations]] are described in the [[Introduction_%26_notation|Models]] chapter. In particular, models for [[Continuous data models|continuous observations]], [[Model for categorical data|categorical data]], [[Models for count data|count data]] and [[ Models for time-to-event data | survival data]] are presented and illustrated by various examples. Extensions for [[ Mixture models|mixture models]], [[Hidden Markov models|hidden Markov models]] and [[Stochastic differential equations based models| stochastic differential equation based models]] are also presented.<br /> <br /> The Tasks &amp; Tools chapter presents practical examples of using these models: [[Visualization|exploration and visualization]], [[Estimation|estimation]], [[Model evaluation#Model diagnostics|model diagnostics]], [[Model evaluation#Model selection|model selection]] and [[Simulation|simulation]]. All approaches and proposed methods are rigorously detailed in the [[Introduction and notation|Methods]] chapter.<br /> <br /> The main purpose of a model is to be used. Mathematical modeling and statistics remain useful tools for many disciplines (biology, agronomy, environmental studies, pharmacology, etc.), but it is important that these tools are used properly. The various software packages used in this wiki have been developed with this in mind: they serve the modeler well, while fully complying with a coherent mathematical formalism and using well-known and theoretically justified methods.<br /> <br /> Tools for model exploration ($\mlxplore$), modeling ($\monolix$) and simulation ($\simulix$) use the same model coding language $\mlxtran$. This allows us to define a complete workflow using the same model implementation, i.e., to run several different tasks based on the same model.<br /> <br /> $\mlxtran$ is extremely flexible and well-adapted to implementing complex mixed-effects models.<br /> With $\mlxtran$ we can easily write ODE-based models, implement pharmacokinetic models with complex administration schedules, include inter-individual variability in parameters, define statistical models for covariates, etc.<br /> Another crucial property of $\mlxtran$ is that it rigorously adopts the model representation formalism proposed in $\wikipopix$. In other words, the model implementation is fully consistent with its mathematical representation.<br /> <br /> $\mlxplore$ provides a clear graphical interface that allows us to visualize not only the structural model but also the statistical model, which is of fundamental importance in the population approach. We can visualize for instance the impact of covariates and inter-individual variability of model parameters on predictions.<br /> <br /> The algorithms implemented in $\monolix$ (Stochastic Approximation of EM, MCMC, Simulated Annealing, Importance Sampling, etc.) are extremely efficient for a wide variety of complex models. Furthermore, convergence of SAEM and its extensions (mixture models, hidden Markov models, SDE-based models, censored data, etc.) has been rigorously proved and published in statistical journals.<br /> <br /> $\simulix$ is a model computation engine which enables us to simulate a $\mlxtran$ model from within various environments. $\simulix$ is now available for the Matlab and R platforms, allowing any user to combine the flexibility of R and Matlab scripts with the power of $\mlxtran$ in order to easily encode complex models and simulate data.<br /> <br /> For these reasons, $\wikipopix$ and these tools can be used with confidence for training and teaching. This is even more the case because $\mlxplore$, $\monolix$ and $\simulix$ are free for academic research and education purposes.<br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> {{Next<br /> |link=The individual approach }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Overview&diff=7174 Overview 2013-06-04T18:48:36Z <p>Marc: </p> <hr /> <div>$<br /> \def\simulix{\mathsf{simulix} } <br />$<br /> The desire to model a biological or physical phenomenon often arises when we are able to record some observations issued from that phenomenon. Nothing would be more natural therefore than to begin this introduction by looking at some observed data.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= This first plot displays the viral load of four patients with hepatitis C who started a treatment at time $t=0$.<br /> |image = NEWintro1.png<br /> }} <br /> <br /> <br /> {{ExampleWithImage<br /> |text=This second example involves weight data for rats measured over 14 weeks, for a sub-chronic toxicity study related to the question of genetically modified corn.<br /> |image = NEWintro2.png}}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= In this third example, data are fluorescence intensities measured over time in a cellular biology experiment.<br /> |image=NEWintro3.png }}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= Note that repeated measurements are not necessarily always functions of time.<br /> For example, we may be interested in corn production as a function of fertilizer quantity.<br /> |image= NEWintro4.png}}<br /> <br /> <br /> Even though these examples come from quite different domains, in each case the data is made up of repeated measurements on several individuals from a population. What we will call a &quot;population approach&quot; is therefore relevant for characterizing and modeling this data. The modeling goal is thus twofold: characterize the biological or physical phenomena observed for each individual, and secondly, the variability seen between individuals.<br /> <br /> In the example with the rats, the model needs to integrate a growth model that describes how a rat's weight increases with time, and a statistical model that describes why these kinetics can vary from one rat to another. The goal is thus to finish with a &quot;typical&quot; curve for the population (in red) and to be able to explain the variability in the individual's curves (in green) around this population curve.<br /> <br /> <br /> ::[[File:NEWintro5.png]]<br /> <br /> <br /> The model will explain some of this variability by individual covariates such as sex or diet (rats 1 and 3 are male while rats 2 and 4 are female), but some of the variability will remain unexplained and will be considered as random. Integrating into the same model effects considered fixed and others considered random leads naturally to the use of mixed-effects models.<br /> <br /> An alternative yet equivalent approach considers this model as a hierarchical one: each curve is described by a single model, and the variability between individual models is described by a population model. In the case of parametric models, this means that the observations for a given individual are described by a model of the observations that depends on a vector of individual parameters: this is the classic individual approach. The population approach is then a direct extension of [[The individual approach|the individual approach]]: we add a component to the model that describes the variability of the individual parameters within the population.<br /> <br /> A model can thus be seen as a [[What is a model? A joint probability distribution! | joint probability distribution]], which can easily be extended to the case where other variables in the model are considered as random variables: covariates, population parameters, the design, etc. The hierarchical structure of the model leads to a natural decomposition of the joint distribution into a product of conditional and marginal distributions.<br /> <br /> Models for [[Modeling the individual parameters |individual parameters]] and models for [[Modeling the observations | observations]] are described in the [[Introduction_%26_notation|Models]] chapter. In particular, models for [[Continuous data models|continuous observations]], [[Model for categorical data|categorical]], [[Models for count data|count]] and [[ Models for time-to-event data | survival data]] are presented and illustrated by various examples. Extensions for [[ Mixture models|mixture models]], [[Hidden Markov models|hidden Markov models]] and [[Stochastic differential equations based models| stochastic differential equation-based models are also presented]].<br /> <br /> The Tasks &amp; Tools chapter presents practical examples of using these models: [[Visualization|exploration and visualization]], [[Estimation|estimation]], [[Model evaluation#Model diagnostics|model diagnostics]], [[Model evaluation#Model selection|model selection]] and [[Simulation|simulation]]. All approaches and proposed methods are rigorously detailed in the [[Introduction and notation|Methods]] chapter.<br /> <br /> The main purpose of a model is to be used. Mathematical modeling and statistics remain useful tools for many disciplines (biology, agronomy, environmental studies, pharmacology, etc.), but it is important that these tools are used properly. The various software packages used in this wiki have been developed with this in mind: they serve the modeler well, while fully complying with a coherent mathematical formalism and using well-known and theoretically justified methods.<br /> <br /> Tools for model exploration ($\mlxplore$), modeling ($\monolix$) and simulation ($\simulix$) use the same model coding language $\mlxtran$. This allows us to define a complete workflow using the same model implementation, i.e., to run several different tasks based on the same model.<br /> <br /> $\mlxtran$ is extremely flexible and well-adapted to implementing complex mixed-effects models.<br /> With $\mlxtran$ we can easily write ODE-based models, implement pharmacokinetic models with complex administration schedules, include inter-individual variability in parameters, define statistical models for covariates, etc.<br /> Another crucial property of $\mlxtran$ is that it rigorously adopts the model representation formalism proposed in $\wikipopix$. In other words, the model implementation is fully consistent with its mathematical representation.<br /> <br /> $\mlxplore$ provides a clear graphical interface that allows us to visualize not only the structural model but also the statistical model, which is of fundamental importance in the population approach. We can visualize for instance the impact of covariates and inter-individual variability of model parameters on predictions.<br /> <br /> The algorithms implemented in $\monolix$ (Stochastic Approximation of EM, MCMC, Simulated Annealing, Importance Sampling, etc.) are extremely efficient for a wide variety of complex models. Furthermore, convergence of SAEM and its extensions (mixture models, hidden Markov models, SDE-based models, censored data, etc.) has been rigorously proved and published in statistical journals.<br /> <br /> $\simulix$ is a model computation engine which enables us to simulate a $\mlxtran$ model from within various environments. $\simulix$ is now available for the Matlab and R platforms, allowing any user to combine the flexibility of R and Matlab scripts with the power of $\mlxtran$ in order to easily encode complex models and simulate data.<br /> <br /> For these reasons, $\wikipopix$ and these tools can be used with confidence for training and teaching. This is even more the case because $\mlxplore$, $\monolix$ and $\simulix$ are free for academic research and education purposes.<br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> {{Next<br /> |link=The individual approach }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Overview&diff=7173 Overview 2013-06-04T18:47:01Z <p>Marc: </p> <hr /> <div>$<br /> \def\simulix{\mathsf{simulix} } <br />$<br /> The desire to model a biological or physical phenomenon often arises when we are able to record some observations issued from that phenomenon. Nothing would be more natural therefore than to begin this introduction by looking at some observed data.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= This first plot displays the viral load of four patients with hepatitis C who started a treatment at time $t=0$.<br /> |image = NEWintro1.png<br /> }} <br /> <br /> <br /> {{ExampleWithImage<br /> |text=This second example involves weight data for rats measured over 14 weeks, for a sub-chronic toxicity study related to the question of genetically modified corn.<br /> |image = NEWintro2.png}}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= In this third example, data are fluorescence intensities measured over time in a cellular biology experiment.<br /> |image=NEWintro3.png }}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= Note that repeated measurements are not necessarily always functions of time.<br /> For example, we may be interested in corn production as a function of fertilizer quantity.<br /> |image= NEWintro4.png}}<br /> <br /> <br /> Even though these examples come from quite different domains, in each case the data is made up of repeated measurements on several individuals from a population. What we will call a &quot;population approach&quot; is therefore relevant for characterizing and modeling this data. The modeling goal is thus twofold: characterize the biological or physical phenomena observed for each individual, and secondly, the variability seen between individuals.<br /> <br /> In the example with the rats, the model needs to integrate a growth model that describes how a rat's weight increases with time, and a statistical model that describes why these kinetics can vary from one rat to another. The goal is thus to finish with a &quot;typical&quot; curve for the population (in red) and to be able to explain the variability in the individual's curves (in green) around this population curve.<br /> <br /> <br /> ::[[File:NEWintro5.png]]<br /> <br /> <br /> The model will explain some of this variability by individual covariates such as sex or diet (rats 1 and 3 are male while rats 2 and 4 are female), but some of the variability will remain unexplained and will be considered as random. Integrating into the same model effects considered fixed and others considered random leads naturally to the use of mixed-effects models.<br /> <br /> An alternative yet equivalent approach considers this model as a hierarchical one: each curve is described by a single model, and the variability between individual models is described by a population model. In the case of parametric models, this means that the observations for a given individual are described by a model of the observations that depends on a vector of individual parameters: this is the classic individual approach. The population approach is then a direct extension of [[The individual approach|the individual approach]]: we add a component to the model that describes the variability of the individual parameters within the population.<br /> <br /> A model can thus be seen as a [[What is a model? A joint probability distribution! | joint probability distribution]], which can easily be extended to the case where other variables in the model are considered as random variables: covariates, population parameters, the design, etc. The hierarchical structure of the model leads to a natural decomposition of the joint distribution into a product of conditional and marginal distributions.<br /> <br /> Models for [[Modeling the individual parameters | individual parameters]] and models for [[Modeling the observations | observations]] are described in the Models chapter. In particular, models for [[Continuous data models|continuous observations]], [[Model for categorical data|categorical]], [[Models for count data|count]] and [[ Models for time-to-event data | survival data]] are presented and illustrated by various examples. Extensions for [[ Mixture models|mixture models]], [[Hidden Markov models|hidden Markov models]] and [[Stochastic differential equations based models| stochastic differential equation-based models are also presented]].<br /> <br /> The Tasks &amp; Tools chapter presents practical examples of using these models: [[Visualization|exploration and visualization]], [[Estimation|estimation]], [[Model evaluation#Model diagnostics|model diagnostics]], [[Model evaluation#Model selection|model selection]] and [[Simulation|simulation]]. All approaches and proposed methods are rigorously detailed in the [[Introduction and notation|Methods]] chapter.<br /> <br /> The main purpose of a model is to be used. Mathematical modeling and statistics remain useful tools for many disciplines (biology, agronomy, environmental studies, pharmacology, etc.), but it is important that these tools are used properly. The various software packages used in this wiki have been developed with this in mind: they serve the modeler well, while fully complying with a coherent mathematical formalism and using well-known and theoretically justified methods.<br /> <br /> Tools for model exploration ($\mlxplore$), modeling ($\monolix$) and simulation ($\simulix$) use the same model coding language $\mlxtran$. This allows us to define a complete workflow using the same model implementation, i.e., to run several different tasks based on the same model.<br /> <br /> $\mlxtran$ is extremely flexible and well-adapted to implementing complex mixed-effects models.<br /> With $\mlxtran$ we can easily write ODE-based models, implement pharmacokinetic models with complex administration schedules, include inter-individual variability in parameters, define statistical models for covariates, etc.<br /> Another crucial property of $\mlxtran$ is that it rigorously adopts the model representation formalism proposed in $\wikipopix$. In other words, the model implementation is fully consistent with its mathematical representation.<br /> <br /> $\mlxplore$ provides a clear graphical interface that allows us to visualize not only the structural model but also the statistical model, which is of fundamental importance in the population approach. We can visualize for instance the impact of covariates and inter-individual variability of model parameters on predictions.<br /> <br /> The algorithms implemented in $\monolix$ (Stochastic Approximation of EM, MCMC, Simulated Annealing, Importance Sampling, etc.) are extremely efficient for a wide variety of complex models. Furthermore, convergence of SAEM and its extensions (mixture models, hidden Markov models, SDE-based models, censored data, etc.) has been rigorously proved and published in statistical journals.<br /> <br /> $\simulix$ is a model computation engine which enables us to simulate a $\mlxtran$ model from within various environments. $\simulix$ is now available for the Matlab and R platforms, allowing any user to combine the flexibility of R and Matlab scripts with the power of $\mlxtran$ in order to easily encode complex models and simulate data.<br /> <br /> For these reasons, $\wikipopix$ and these tools can be used with confidence for training and teaching. This is even more the case because $\mlxplore$, $\monolix$ and $\simulix$ are free for academic research and education purposes.<br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> <br /> {{Next<br /> |link=The individual approach }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=The_SAEM_algorithm_for_estimating_population_parameters&diff=7172 The SAEM algorithm for estimating population parameters 2013-06-04T18:25:47Z <p>Marc: /* Bibliography */</p> <hr /> <div>==Introduction ==<br /> <br /> <br /> The SAEM (Stochastic Approximation of EM) algorithm is a stochastic algorithm for calculating the maximum likelihood estimator (MLE) in the quite general setting of incomplete data models. SAEM has been shown to be a very powerful NLMEM tool, known to accurately estimate population parameters as well as having good theoretical properties. In fact, it converges to the MLE under very general hypotheses.<br /> <br /> SAEM was first implemented in the $\monolix$ software. It has also been implemented in NONMEM, the {{Verbatim|R}} package {{Verbatim|saemix}} and the Matlab statistics toolbox as the function {{Verbatim|nlmefitsa.m}}.<br /> <br /> Here, we consider a model that includes observations $\by=(y_i , 1\leq i \leq N)$, unobserved individual parameters $\bpsi=(\psi_i , 1\leq i \leq N)$ and a vector of parameters $\theta$. By definition, the maximum likelihood estimator of $\theta$ maximizes<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; {\like}(\theta ; \by) = \py(\by ; \theta) = \displaystyle{ \int \pypsi(\by,\bpsi ; \theta) \, d \bpsi}.<br /> &lt;/math&gt; }}<br /> <br /> <br /> SAEM is an iterative algorithm that essentially consists of constructing $N$ Markov chains $(\psi_1^{(k)})$, ..., $(\psi_N^{(k)})$ that converge to the conditional distributions $\pmacro(\psi_1|y_1),\ldots , \pmacro(\psi_N|y_N)$, using at each step the complete data $(\by,\bpsi^{(k)})$ to calculate a new parameter vector $\theta_k$. We will present a general description of the algorithm highlighting the connection with the EM algorithm, and present by way of a simple example how to implement SAEM and use it in practice.<br /> <br /> We will also give some extensions of the base algorithm that allow us to improve the convergence properties of the algorithm. For instance, it is possible to stabilize the algorithm's convergence by using several Markov chains per individual. Also, a simulated annealing version of SAEM allows us improve the chances of converging to the global maximum of the likelihood rather than to local maxima.<br /> <br /> <br /> &lt;br&gt;<br /> ==The EM algorithm==<br /> <br /> <br /> We first remark that if the individual parameters $\bpsi=(\psi_i)$ are observed, estimation is not thwarted by any particular problem because an estimator could be found by directly maximizing the joint distribution $\pypsi(\by,\bpsi ; \theta)$.<br /> <br /> However, since the $\psi_i$ are not observed, the EM algorithm replaces $\bpsi$ by its conditional expectation. Then, given some initial value $\theta_0$, iteration $k$ updates ${\theta}_{k-1}$ to ${\theta}_{k}$ with the two following steps:<br /> <br /> <br /> * $\textbf{E-step:}$ evaluate the quantity<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta)=\esp{\log \pmacro(\by,\bpsi;\theta){{!}} \by;\theta_{k-1} } .&lt;/math&gt; }}<br /> <br /> <br /> * $\textbf{M-step:}$ update the estimation of $\theta$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_{k} = \argmax{\theta} \, Q_k(\theta) .<br /> &lt;/math&gt; }}<br /> <br /> <br /> In can be proved that each EM iteration increases the likelihood of observations and that the EM sequence $(\theta_k)$ converges to a<br /> stationary point of the observed likelihood under mild regularity conditions.<br /> <br /> Unfortunately, in the framework of nonlinear mixed-effects models, there is no explicit expression for the E-step since the relationship between observations $\by$ and individual parameters $\bpsi$ is nonlinear. However, even though this expectation cannot be computed in a closed-form, it can be approximated by simulation. For instance,<br /> <br /> <br /> * The Monte Carlo EM (MCEM) algorithm replaces the E-step by a Monte Carlo approximation based on a large number of independent simulations of the non-observed individual parameters $\bpsi$.<br /> <br /> * The SAEM algorithm replaces the E-step by a stochastic approximation based on a single simulation of $\bpsi$.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==The SAEM algorithm==<br /> <br /> At iteration $k$ of SAEM:<br /> <br /> <br /> * $\textbf{Simulation step}$: for $i=1,2,\ldots, N$, draw $\psi_i^{(k)}$ from the conditional distribution $\pmacro(\psi_i |y_i ;\theta_{k-1})$.<br /> <br /> <br /> * $\textbf{Stochastic approximation}$: update $Q_k(\theta)$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta) = Q_{k-1}(\theta) + \gamma_k ( \log \pmacro(\by,\bpsi^{(k)};\theta) - Q_{k-1}(\theta) ),<br /> &lt;/math&gt; }}<br /> <br /> where $(\gamma_k)$ is a decreasing sequence of positive numbers such that $\gamma_1=1$, $\sum_{k=1}^{\infty} \gamma_k = \infty$ and $\sum_{k=1}^{\infty} \gamma_k^2 &lt; \infty$.<br /> <br /> <br /> * $\textbf{Maximization step}$: update $\theta_{k-1}$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_{k} = \argmax{\theta} \, Q_k(\theta) .&lt;/math&gt; }}<br /> <br /> <br /> {{Remarks <br /> |title=Remarks<br /> |text= &amp;#32;<br /> * Setting $\gamma_k=1$ for all $k$ means that there is no memory in the stochastic approximation:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta) = \log \pmacro(\by,\bpsi^{(k)};\theta) . &lt;/math&gt; }}<br /> <br /> : This algorithm, known as Stochastic EM (SEM) thus consists of successively simulating $\bpsi^{(k)}$ with the conditional distribution $\pmacro(\bpsi^{(k)} {{!}} \by;\theta_{k-1})$, then computing $\theta_k$ by maximizing the joint distribution $\pmacro(\by,\bpsi^{(k)};\theta)$.<br /> <br /> <br /> * When the number $N$ of subjects is small, convergence of SAEM can be improved by running $L$ Markov chains for each individual instead of one. The simulation step at iteration $k$ then requires us to draw $L$ sequences ${ \phi_i^{(k,1)} } ,\ldots , { \phi_i^{(k,L)} }$ for each individual $i$ and to combine stochastic approximation and Monte Carlo in the approximation step:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta) = Q_{k-1}(\theta) + \gamma_k \left( \frac{1}{L}\sum_{\ell=1}^{L} \log \pmacro(\by,\bpsi^{(k,\ell)};\theta) - Q_{k-1}(\theta) \right) .<br /> &lt;/math&gt; }}<br /> <br /> : By default, $\monolix$ selects $L$ so that $N\times L \geq 50$.<br /> }}<br /> <br /> <br /> Implementation of SAEM is simplified when the complete model $\pmacro(\by,\bpsi;\theta)$ belongs to a regular (curved) exponential family:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \pmacro(\by,\bpsi ;\theta) = \exp\left\{ - \zeta(\theta) + \langle \tilde{S}(\by,\bpsi) , \varphi(\theta) \rangle \right\} , &lt;/math&gt; }}<br /> <br /> where $\tilde{S}(\by,\bpsi)$ is a sufficient statistic of the complete model (i.e., whose value contains all the information needed to compute any estimate of $\theta$) which takes its values in an open subset ${\cal S}$ of $\Rset^m$. Then, there exists a function $\tilde{\theta}$ such that for any $s\in {\cal S}$,<br /> <br /> {{EquationWithRef<br /> |equation=&lt;div id=&quot;eq:saem_stat&quot;&gt;&lt;math&gt;<br /> \tilde{\theta}(s) = \argmax{\theta} \left\{ - \zeta(\theta) + \langle s , \varphi(\theta) \rangle \right\} .<br /> &lt;/math&gt;&lt;/div&gt;<br /> |reference=(1) }}<br /> <br /> The approximation step of SAEM simplifies to a general Robbins-Monro-type scheme for approximating this conditional expectation:<br /> <br /> <br /> * $\textbf{Stochastic approximation}$: update $s_k$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> s_k = s_{k-1} + \gamma_k ( \tilde{S}(\by,\bpsi^{(k)}) - s_{k-1} ) . &lt;/math&gt; }}<br /> <br /> <br /> Note that the E-step of EM simplifies to computing $s_k=\esp{\tilde{S}(\by,\bpsi) | \by ; \theta_{k-1}}$.<br /> <br /> Then, both EM and SAEM algorithms use [[#eq:saem_stat|(1)]] for the M-step: $\theta_k = \tilde{\theta}(s_k)$.<br /> <br /> Precise results for convergence of SAEM were obtained in the [[Estimation of the observed Fisher information matrix#Estimation using linearization of the model|Estimation of the F.I.M. using a linearization of the model]] chapter in the case where $\pmacro(\by,\bpsi;\theta)$ belongs to a regular curved exponential family. This first version of [[The SAEM algorithm for estimating population parameters|SAEM]] and these first results assume that the individual parameters are simulated exactly under the conditional distribution at each iteration. Unfortunately, for most nonlinear models or non-Gaussian models, the unobserved data cannot be simulated exactly under this conditional distribution. A well-known alternative consists in using the Metropolis-Hastings algorithm: introduce a transition probability which has as unique invariant distribution the conditional distribution we want to simulate.<br /> <br /> In other words, the procedure consists of replacing the Simulation step of SAEM at iteration $k$ by $m$ iterations of the<br /> Metropolis-Hastings (MH) algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters|The Metropolis-Hastings algorithm]] section. It was shown in the [[Estimation of the observed Fisher information matrix#Estimation using linearization of the model|Estimation of the F.I.M. using a linearization of the model]] section that [[The SAEM algorithm for estimating population parameters|SAEM]] still converges under general conditions when coupled with a Markov chain Monte Carlo procedure.<br /> <br /> <br /> {{Remarks<br /> |title= Remark<br /> |text= Convergence of the Markov chains $(\psi_i^{(k)})$ is not necessary at each SAEM iteration. It suffices to run a few MH iterations with various transition kernels before resetting $\theta_{k-1}$. In $\monolix$ by default, three transition kernels are used twice each, successively, in each SAEM iteration.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> == Implementing SAEM ==<br /> <br /> Implementation of SAEM can be difficult to describe when looking at complex statistical models such as mixture models, models with inter-occasion variability, etc. We are therefore going to limit ourselves to looking at some basic models in order to illustrate how SAEM can be implemented.<br /> <br /> &lt;br&gt;<br /> ===SAEM for general hierarchical models===<br /> <br /> Consider first a very general model for any type (continuous, categorical, survival, etc.) of data $(y_i)$:<br /> <br /> {{Equation1<br /> |equation= &lt;math&gt;\begin{eqnarray} y_i {{!}} \psi_i &amp;\sim&amp; \pcyipsii(y_i {{!}} \psi_i) \\<br /> h(\psi_i) &amp;\sim&amp; {\cal N}( \mu , \Omega),<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $h(\psi_i)=(h_1(\psi_{i,1}), h_2(\psi_{i,2}), \ldots , h_d(\psi_{i,d}) )^\transpose$ is a $d$-vector of (transformed) individual parameters, $\mu$ a $d$-vector of fixed effects and $\Omega$ a $d\times d$ variance-covariance matrix.<br /> <br /> We assume here that $\Omega$ is positive-definite. Then, a sufficient statistic for the complete model $\pmacro(\by,\bpsi;\theta)$ is<br /> $\tilde{S}(\bpsi) = (\tilde{S}_1(\bpsi),\tilde{S}_2(\bpsi))$, where<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \tilde{S}_1(\bpsi) &amp;= &amp; \sum_{i=1}^N h(\psi_i) \\<br /> \tilde{S}_2(\bpsi) &amp;= &amp; \sum_{i=1}^N h(\psi_i) h(\psi_i)^\transpose .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> At iteration $k$ of SAEM, we have:<br /> <br /> <br /> * $\textbf{Simulation step}$: for $i=1,2,\ldots, N$, draw $\psi_i^{(k)}$ from $m$ iterations of the MH algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters|The Metropolis-Hastings algorithm]] with $\pmacro(\psi_i |y_i ;\mu_{k-1},\Omega_{k-1})$ as limiting distribution.<br /> <br /> * $\textbf{Stochastic approximation}$: update $s_k=(s_{k,1},s_{k,2})$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> s_{k,1} &amp;=&amp; s_{k-1,1} + \gamma_k \left( \sum_{i=1}^N h(\psi_i^{(k)}) - s_{k-1,1} \right) \\<br /> s_{k,2} &amp;=&amp; s_{k-1,2} + \gamma_k \left( \sum_{i=1}^N h(\psi_i^{(k)})h(\psi_i^{(k)})^\transpose - s_{k-1,2} \right) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> * $\textbf{Maximization step}$: update $(\mu_{k-1},\Omega_{k-1})$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \mu_{k} &amp;=&amp; \frac{1}{N} s_{k,1} \\<br /> \Omega_k &amp;=&amp; \frac{1}{N}\left( s_{k,2} - s_{k,1}s_{k,1}^\transpose \right) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> What is remarkable is that it suffices to be able to calculate $\pcyipsii(y_i | \psi_i)$ for all $\psi_i$ and $y_i$ in order to be able to run SAEM. In effect, this allows the simulation step to be run using MH since the acceptance probabilities can be calculated.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ===SAEM for continuous data models===<br /> Consider now a continuous data model in which the residual error variance is now constant:<br /> <br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> y_{ij} &amp;=&amp; f(t_{ij},\phi_i) + a \teps_{ij} \\<br /> h(\psi_i) &amp;\sim&amp; {\cal N}( \mu , \Omega) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> Here, the individual parameters are $\psi_i=(\phi_i,a)$. The variance-covariance matrix for $\psi_i$ is not positive-definite in this case because $a$ has no variability. If we suppose that the variance matrix $\Omega$ is positive-definite, then noting $\theta=(\mu,\Omega,a)$, a natural decomposition of the model is:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\pmacro(\by,\bpsi;\theta) = \pmacro(\by {{!}} \bpsi;a)\pmacro(\bpsi;\mu,\Omega) .<br /> &lt;/math&gt; }}<br /> <br /> The previous statistic $\tilde{S}(\bpsi) = (\tilde{S}_1(\bpsi),\tilde{S}_2(\bpsi))$ is not sufficient for estimating $a$. Indeed, we need an additional component which is a function both of $\by$ and $\bpsi$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \tilde{S}_3(\by, \bpsi) =\sum_{i=1}^N \sum_{j=1}^{n_i}(y_{ij} - f(t_{ij},\psi_i))^2. &lt;/math&gt; }}<br /> <br /> Then,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \begin{eqnarray}<br /> s_{k,3} &amp;=&amp; s_{k-1,3} + \gamma_k ( \tilde{S}_3(\by, \bpsi) - s_{k-1,3} ) \\<br /> a_k^2 &amp;=&amp; \displaystyle{ \frac{1}{\sum_{i=1}^N n_i} s_{k,3} }\ .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The choice of step-size $(\gamma_k)$ is extremely important for ensuring convergence of SAEM. The sequence $(\gamma_k)$ used in $\monolix$ decreases like $k^{-\alpha}$. We recommend using $\alpha=0$ (that is, $\gamma_k=1$) during the first $K_1$ iterations, in order to converge quickly to a neighborhood of a maximum of the likelihood, and $\alpha=1$ during the next $K_2$ iterations.<br /> Indeed, the initial guess $\theta_0$ may be far from the maximum likelihood value we are looking for, and the first iterations with $\gamma_k=1$ allow SAEM to converge quickly to a neighborhood of this value. Following this, smaller step-sizes ensure the<br /> almost sure convergence of the algorithm to the maximum likelihood estimator.<br /> <br /> <br /> <br /> {{Example<br /> |title=Example<br /> |text= Consider a simple model for continuous data:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> y_{ij} &amp;\sim&amp; {\cal N}(A_i\,e^{-k_i \, t_{ij} } , a^2) \\<br /> \log(A_i)&amp;\sim&amp;{\cal N}(\log(A_{\rm pop}) , \omega_A^2) \\<br /> \log(k_i)&amp;\sim&amp;{\cal N}(\log(k_{\rm pop}) , \omega_k^2) ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $A_{\rm pop}=6$, $k_{\rm pop}=0.25$, $\omega_A=0.3$, $\omega_k=0.3$ and $a=0.2$.<br /> Let us look at the effect of different settings for $(\gamma_k)$ (and $L$) for estimating the population parameters of the model with SAEM.<br /> <br /> <br /> 1. For all $k$, $\gamma_k = 1$: the sequence $(\theta_{k})$ converges very quickly to a neighborhood of the &quot;solution&quot;. The sequence $(\theta_{k})$ is a homogeneous Markov Chain that converges in distribution but does not converge almost surely. <br /> <br /> [[File:saem1.png|link=]]<br /> <br /> <br /> 2. For all $k$, $\gamma_k = 1/k$: the sequence $(\theta_{k})$ converges almost surely to the maximum likelihood estimate of $\theta$, but very slowly. <br /> <br /> [[File:saem2.png|link=]]<br /> <br /> <br /> 3. $\gamma_k = 1$, $k=1$, ...,$40$, $\gamma_k = 1/(k-40)$, $k \geq 41$: the sequence $(\theta_{k})$ converges almost surely to the maximum likelihood estimate of $\theta$, and quickly.<br /> <br /> [[File:saem3.png|link=]]<br /> <br /> <br /> 4. $L=10$, $\gamma_k = 1$, $k \geq 1$: the sequence $(\theta_{k})$ is an homogeneous Markov chain that converges in distribution, as in Example 1, but the variance is reduced by a factor $\sqrt{10}$; in this case, SAEM behaves like EM. <br /> <br /> [[File:saem4.png|link=]]<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==A simple example to understand why SAEM converges in practice==<br /> <br /> <br /> Let us look at a very simple Gaussian model, with only one observation per individual:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \psi_i &amp;\sim&amp; {\cal N}(\theta,\omega^2) , \ \ \ 1 \leq i \leq N \\<br /> y_i &amp;\sim&amp; {\cal N}(\psi_i,\sigma^2).<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> We will furthermore assume that both $\omega^2$ and $\sigma^2$ are known.<br /> <br /> Here, the maximum likelihood estimator $\hat{\theta}$ of $\theta$ is easy to compute since $y_i \sim_{i.i.d.} {\cal N}(\theta,\omega^2+\sigma^2)$. We find that<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \hat{\theta} = \displaystyle{\frac{1}{N} }\sum_{i=1}^{N} y_i .<br /> &lt;/math&gt;}}<br /> <br /> We now propose to try and compute $\hat{\theta}$ using SAEM instead. The simulation step is straightforward since the conditional distribution of $\psi_i$ is a normal distribution:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> \psi_i {{!}} y_i \sim {\cal N}(a \theta + (1-a)y_i , \gamma^2) ,<br /> &lt;/math&gt; }}<br /> <br /> where<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> a &amp;= &amp; \displaystyle{ \frac{1}{\omega^2} } \left(\displaystyle{ \frac{1}{\sigma^2} }+ \displaystyle{\frac{1}{\omega^2} }\right)^{-1} \\<br /> \gamma^2 &amp;= &amp;\left(\displaystyle{ \frac{1}{\sigma^2} }+ \displaystyle{\frac{1}{\omega^2} }\right)^{-1}.<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The maximization step is also straightforward. Indeed, a sufficient statistic for estimating $\theta$ is<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; {\cal S}(\bpsi) = \sum_{i=1}^{N} \psi_i. &lt;/math&gt; }}<br /> <br /> Then,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \tilde{\theta}({\cal S(\bpsi)} ) &amp;=&amp; \argmax{\theta} \pmacro(y_1,\ldots,y_N,\psi_1,\ldots,\psi_N;\theta) \\<br /> &amp;=&amp; \argmax{\theta} \pmacro(\psi_1,\ldots,\psi_N;\theta) \\<br /> &amp;=&amp; \frac{ {\cal S}(\bpsi)}{N}.<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Let us first look at the behavior of SAEM when $\gamma_k=1$. At iteration $k$,<br /> <br /> <br /> * Simulation step: $\psi_i^{(k)} \sim {\cal N}( a \theta_{k-1} + (1-a)y_i , \gamma^2).$<br /> <br /> * Maximization step: $\theta_k = \displaystyle{ \frac{ {\cal S}(\bpsi^{(k)})}{N} } = \displaystyle{ \frac{1}{N} }\sum_{i=1}^{N} \psi_i^{(k)}$.<br /> <br /> <br /> It can be shown that:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_k - \hat{\theta} = a(\theta_{k-1} - \hat{\theta}) + e_k ,<br /> &lt;/math&gt; }}<br /> <br /> where $e_k \sim {\cal N}(0, \gamma^2 /N)$. Then, the sequence $(\theta_k)$ is an autoregressive process of order 1 (AR(1)) which converges in distribution to a normal distribution when $k\to \infty$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\theta_k \limite{}{\cal D} {\cal N}\left(\hat{\theta} , \displaystyle{ \frac{\gamma^2}{N(1-a^2)} }\right) .<br /> &lt;/math&gt; }}<br /> <br /> <br /> {{ImageWithCaption|image=saemb1.png|caption=10 sequences $(\theta_k)$ obtained with different initial values and $\gamma_k=1$ for $1\leq k \leq 50$ }} <br /> <br /> <br /> Now, let us see what happens instead when $\gamma_k$ decreases like $1/k$. At iteration $k$,<br /> <br /> <br /> * Simulation step: $\psi_i^{(k)} \sim {\cal N}( a \theta_{k-1} + (1-a)y_i , \gamma^2)$<br /> <br /> * Maximization step:<br /> <br /> {{Equation1<br /> |equation= &lt;math&gt;\theta_k = \theta_{k-1} + \displaystyle{ \frac{1}{k} }\left( \displaystyle{ \frac{1}{N} }\sum_{i=1}^{N} \psi_i^{(k)} -\theta_{k-1} \right). <br /> &lt;/math&gt; }}<br /> <br /> <br /> : Here, we can show that:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_k - \hat{\theta} = \displaystyle{ \frac{k-a}{k} }(\theta_{k-1} - \hat{\theta}) + \displaystyle{\frac{e_k}{k} }, <br /> &lt;/math&gt; }}<br /> <br /> : where $e_k \sim {\cal N}(0, \gamma^2 /N)$. Then, the sequence $(\theta_k)$ converges almost surely to $\hat{\theta}$.<br /> <br /> <br /> {{ImageWithCaption|image=saemb2.png|caption=10 sequences $(\theta_k)$ obtained with different initial values and $\gamma_k=1/k$ for $1\leq k \leq 50$ }}<br /> <br /> <br /> Thus, we see that by combining the two strategies, the sequence $(\theta_k)$ is a Markov chain that converges to a random walk around $\hat{\theta}$ during the first $K_1$ iterations, then converges almost surely to $\hat{\theta}$ during the next $K_2$ iterations.<br /> <br /> <br /> {{ImageWithCaption|image=saemb3.png|caption=10 sequences $(\theta_k)$ obtained with different initial values, $\gamma_k=1$ for $1\leq k \leq 20$ and $\gamma_k=1/(k-20)$ for $21\leq k \leq 50$ }}<br /> <br /> <br /> {{ShowVideo|image=saem5b.png|video=http://popix.lixoft.net/images/2/20/saem.mp4|caption=The SAEM algorithm in practice. }}<br /> <br /> &lt;!-- {{ImageWithCaptionL|image=saem5.png|size=750px|caption= The SAEM algorithm in practice. (a) the observations and the initialization $p_0(\psi_i)$, (b) the initialization $p_0(\psi_i)$ and the conditional distributions of the observations $p(y_i{{!}}\psi_i)$, (c) the conditional distributions $p_0(\psi_i{{!}}y_i)$ and the simulated individual parameters $(\psi_i^{(1)})$, (d) the updated distribution $p_1(\psi_i)$. }} --&gt;<br /> <br /> ==A simulated annealing version of SAEM==<br /> <br /> <br /> Convergence of SAEM can strongly depend on the initial guess when the likelihood ${\like}$ has several local maxima. A simulated annealing version of SAEM can improve convergence of the algorithm toward the global maximum of ${\like}$.<br /> <br /> To detail this, we can first rewrite the joint pdf of $(\by,\bpsi)$ as follows:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \pmacro(\by,\bpsi;\theta) = C(\theta)\, \exp \left\{-U(\by,\bpsi;\theta)\right\} ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $C(\theta)$ is a normalizing constant that only depends on $\theta$. Then, for any &quot;temperature&quot; $T\geq0$, we consider the complete model<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\pmacro_T(\by,\bpsi;\theta) = C_T(\theta)\, \exp \left\{-\displaystyle{\frac{1}{T} }U(\by,\bpsi;\theta) \right\} ,<br /> &lt;/math&gt; }}<br /> <br /> where $C_T(\theta)$ is still a normalizing constant.<br /> <br /> We then introduce a decreasing temperature sequence $(T_k, 1\leq k \leq K)$ and use the SAEM algorithm on the complete model $\pmacro_{T_k}(\by,\bpsi;\theta)$ at iteration $k$ (the usual version of SAEM uses $T_k=1$ at each iteration). The sequence $(T_k)$ is chosen to have large positive values during the first iterations, then decrease with an exponential rate to 1: $T_k = \max(1, \tau \ T_{k-1})$.<br /> <br /> Consider for example the following model for continuous data:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> y_{ij} &amp;\sim&amp; {\cal N}(f(t_{ij};\psi_i) , a^2) \\<br /> h(\psi_i) &amp;\sim&amp; {\cal N}(\mu , \Omega) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Here, $\theta = (\mu,\Omega,a^2)$ and<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> \pmacro(\by,\bpsi;\theta) = C(\theta)\, \exp \left\{- \displaystyle{ \frac{1}{2 a^2} }\sum_{i=1}^N \sum_{j=1}^{n_i} (y_{ij} - f(t_{ij};\psi_i))^2 - \displaystyle{ \frac{1}{2} } \sum_{i=1}^N (h(\psi_i)-\mu)^\transpose \Omega (h(\psi_i)-\mu) \right\},<br /> &lt;/math&gt; }}<br /> <br /> where $C(\theta)$ is a normalizing constant that only depends on $a$ and $\Omega$.<br /> <br /> <br /> We see that $\pmacro_T(\by,\bpsi;\theta)$ will also be a normal distribution whose residual error variance $a^2$ is replaced by $T a^2$ and variance matrix $\Omega$ for the random effects by $T\Omega$.<br /> In other words, a model with a &quot;large temperature&quot; is a model with large variances.<br /> <br /> The algorithm therefore consists in choosing large initial variances $\Omega_0$ and $a^2_0$ (that include the initial temperature $T_0$ implicitly) and setting $a^2_k = \max(\tau \ a^2_{k-1} , \hat{a}(\by,\bpsi^{(k)})$ and $\Omega_k = \max(\tau \ \Omega_{k-1} , \hat{\Omega}(\bpsi^{(k)})$ during the first iterations. Here, $0\leq\tau\leq 1$.<br /> <br /> These large values of the variance make the conditional distributions $\pmacro_T(\psi_i | y_i;\theta)$ less concentrated around their modes, and thus allow the sequence $(\theta_k)$ to &quot;escape&quot; from local maxima of the likelihood during the first iterations of SAEM and converge to a neighborhood of the global maximum of ${\like}$.<br /> After these initial iterations, the usual SAEM algorithm is used to estimate these variances at each iteration.<br /> <br /> <br /> {{Remarks<br /> |title= Remark<br /> |text= We can use two different coefficients $\tau_1$ and $\tau_2$ for $\Omega$ and $a^2$ in $\monolix$. It is possible, for example, to choose $\tau_1&lt;1$ and $\tau_2&gt;1$, with large initial inter-subject variances $\Omega_0$ and small initial residual variance $a^2_0$. In this case, SAEM tries to obtain the best possible fit during the first iterations, allowing for a large inter-subject variability. During the next iterations, this variability is reduced and the residual variance increases until reaching the best possible trade-off between the two criteria.<br /> }}<br /> <br /> <br /> {{Example<br /> |title=A PK example<br /> |text= <br /> <br /> Consider a simple one-compartment model for oral administration:<br /> <br /> {{EquationWithRef<br /> |equation=&lt;div id=&quot;eq:saem_sa&quot;&gt;&lt;math&gt;<br /> f(t;ka,V,k) = \displaystyle{ \frac{D\, ka}{V(ka-ke)} }\left( e^{-ke \, t} - e^{-ka \, t} \right) .<br /> &lt;/math&gt;&lt;/div&gt;<br /> |reference=(2) }}<br /> <br /> We then simulate PK data from 80 patients using the following population PK parameters:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; ka_{\rm pop} = 1, \quad V_{\rm pop}=8, \quad ke_{\rm pop}=0.25 .&lt;/math&gt; }}<br /> <br /> We can see that the following parametrization gives the same prediction as the one given in [[#eq:saem_sa|(2)]]:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \tilde{ka} = ke, \quad \tilde{V}=V \times ke/ka, \quad \tilde{ke}=ka . &lt;/math&gt; }}<br /> <br /> We can then expect a (global) maximum around $(ka,V,ke) = (1, \ 8, \ 0.25)$ and a (local) maximum of the likelihood around $(ka,V,ke) = (0.25, \ 2, \ 1).$<br /> <br /> The figure below displays the convergence of SAEM without simulated annealing to a local maximum of the likelihood (deviance = $-2\,\log {\like} =816$). The initial values of the population parameters we chose were $(ka_0,V_0,k_0) = (1,1,1)$.<br /> <br /> :{{ImageWithCaption_special|image=recuit1.png|caption=Convergence of SAEM to a local maxima of the likelihood}} <br /> <br /> Using the same initial guess, the simulated annealing version of SAEM converges to the global maximum of the likelihood (deviance = 734).<br /> <br /> :{{ImageWithCaption_special|image=recuit2.png|caption=Convergence of SAEM to the global maxima of the likelihood }}<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> == Bibliography ==<br /> <br /> <br /> &lt;bibtex&gt;<br /> @article{allassonniere2010construction,<br /> title={Construction of Bayesian deformable models via a stochastic approximation algorithm: a convergence study},<br /> author={Allassonnière, S. and Kuhn, E. and Trouvé, A.},<br /> journal={Bernoulli},<br /> volume={16},<br /> number={3},<br /> pages={641--678},<br /> year={2010},<br /> publisher={Bernoulli Society for Mathematical Statistics and Probability}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{delattre2012maximum,<br /> title={Maximum likelihood estimation in discrete mixed hidden Markov models using the SAEM algorithm},<br /> author={Delattre, M. and Lavielle, M.},<br /> journal={Computational Statistics &amp; Data Analysis},<br /> year={2012},<br /> volume={56},<br /> pages={2073-2085}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{delattre2013sde,<br /> title={Coupling the SAEM algorithm and the extended Kalman filter for maximum likelihood estimation in mixed-effects diffusion models},<br /> author={Delattre, M. and Lavielle, M.},<br /> journal={Statistics and its interfaces},<br /> year={2013},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{delyon1999convergence,<br /> title={Convergence of a stochastic approximation version of the EM algorithm},<br /> author={Delyon, B. and Lavielle, M. and Moulines, E.},<br /> journal={Annals of Statistics},<br /> pages={94-128},<br /> year={1999},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{dempster1977maximum,<br /> title={Maximum likelihood from incomplete data via the EM algorithm},<br /> author={Dempster, A.P. and Laird, N.M. and Rubin, D.B.},<br /> journal={Journal of the Royal Statistical Society. Series B (Methodological)},<br /> pages={1-38},<br /> year={1977},<br /> publisher={JSTOR}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{kuhn2004coupling,<br /> title={Coupling a stochastic approximation version of EM with an MCMC procedure},<br /> author={Kuhn, E. and Lavielle, M.},<br /> journal={ESAIM: Probability and Statistics},<br /> volume={8},<br /> pages={115-131},<br /> year={2004},<br /> publisher={EDP Sciences, 17 Avenue du Hoggar Les Ulis Cedex A BP 112 91944 France}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{lavielle2013improved,<br /> title={An improved SAEM algorithm for maximum likelihood estimation in mixtures of non linear mixed effects models},<br /> author={Lavielle, M. and Mbogning, C.},<br /> journal={Statistics and Computing},<br /> year={2013},<br /> publisher={Springer}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{mclachlan2007algorithm,<br /> title={The EM algorithm and extensions},<br /> author={McLachlan, G.J. and Krishnan, T.},<br /> volume={382},<br /> year={2007},<br /> publisher={Wiley-Interscience}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{samson2006extension,<br /> title={Extension of the SAEM algorithm to left-censored data in nonlinear mixed-effects model: Application to HIV dynamics model},<br /> author={Samson, A. and Lavielle, M. and Mentr&amp;eacute;, F.},<br /> journal={Computational statistics &amp; data analysis},<br /> volume={51},<br /> number={3},<br /> pages={1562-1574},<br /> year={2006},<br /> publisher={Elsevier}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{wei1990monte,<br /> title={A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms},<br /> author={Wei, G. and Tanner, M.},<br /> journal={Journal of the American Statistical Association},<br /> volume={85},<br /> number={411},<br /> pages={699-704},<br /> year={1990},<br /> publisher={Taylor &amp; Francis}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{wu1983convergence,<br /> title={On the convergence properties of the EM algorithm},<br /> author={Wu, C.F.},<br /> journal={The Annals of Statistics},<br /> volume={11},<br /> number={1},<br /> pages={95-103},<br /> year={1983},<br /> publisher={Institute of Mathematical Statistics}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> <br /> <br /> {{Back&amp;Next<br /> |linkBack=Introduction and notation<br /> |linkNext=The Metropolis-Hastings algorithm for simulating the individual parameters }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=The_SAEM_algorithm_for_estimating_population_parameters&diff=7171 The SAEM algorithm for estimating population parameters 2013-06-04T18:24:47Z <p>Marc: /* Bibliography */</p> <hr /> <div>==Introduction ==<br /> <br /> <br /> The SAEM (Stochastic Approximation of EM) algorithm is a stochastic algorithm for calculating the maximum likelihood estimator (MLE) in the quite general setting of incomplete data models. SAEM has been shown to be a very powerful NLMEM tool, known to accurately estimate population parameters as well as having good theoretical properties. In fact, it converges to the MLE under very general hypotheses.<br /> <br /> SAEM was first implemented in the $\monolix$ software. It has also been implemented in NONMEM, the {{Verbatim|R}} package {{Verbatim|saemix}} and the Matlab statistics toolbox as the function {{Verbatim|nlmefitsa.m}}.<br /> <br /> Here, we consider a model that includes observations $\by=(y_i , 1\leq i \leq N)$, unobserved individual parameters $\bpsi=(\psi_i , 1\leq i \leq N)$ and a vector of parameters $\theta$. By definition, the maximum likelihood estimator of $\theta$ maximizes<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; {\like}(\theta ; \by) = \py(\by ; \theta) = \displaystyle{ \int \pypsi(\by,\bpsi ; \theta) \, d \bpsi}.<br /> &lt;/math&gt; }}<br /> <br /> <br /> SAEM is an iterative algorithm that essentially consists of constructing $N$ Markov chains $(\psi_1^{(k)})$, ..., $(\psi_N^{(k)})$ that converge to the conditional distributions $\pmacro(\psi_1|y_1),\ldots , \pmacro(\psi_N|y_N)$, using at each step the complete data $(\by,\bpsi^{(k)})$ to calculate a new parameter vector $\theta_k$. We will present a general description of the algorithm highlighting the connection with the EM algorithm, and present by way of a simple example how to implement SAEM and use it in practice.<br /> <br /> We will also give some extensions of the base algorithm that allow us to improve the convergence properties of the algorithm. For instance, it is possible to stabilize the algorithm's convergence by using several Markov chains per individual. Also, a simulated annealing version of SAEM allows us improve the chances of converging to the global maximum of the likelihood rather than to local maxima.<br /> <br /> <br /> &lt;br&gt;<br /> ==The EM algorithm==<br /> <br /> <br /> We first remark that if the individual parameters $\bpsi=(\psi_i)$ are observed, estimation is not thwarted by any particular problem because an estimator could be found by directly maximizing the joint distribution $\pypsi(\by,\bpsi ; \theta)$.<br /> <br /> However, since the $\psi_i$ are not observed, the EM algorithm replaces $\bpsi$ by its conditional expectation. Then, given some initial value $\theta_0$, iteration $k$ updates ${\theta}_{k-1}$ to ${\theta}_{k}$ with the two following steps:<br /> <br /> <br /> * $\textbf{E-step:}$ evaluate the quantity<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta)=\esp{\log \pmacro(\by,\bpsi;\theta){{!}} \by;\theta_{k-1} } .&lt;/math&gt; }}<br /> <br /> <br /> * $\textbf{M-step:}$ update the estimation of $\theta$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_{k} = \argmax{\theta} \, Q_k(\theta) .<br /> &lt;/math&gt; }}<br /> <br /> <br /> In can be proved that each EM iteration increases the likelihood of observations and that the EM sequence $(\theta_k)$ converges to a<br /> stationary point of the observed likelihood under mild regularity conditions.<br /> <br /> Unfortunately, in the framework of nonlinear mixed-effects models, there is no explicit expression for the E-step since the relationship between observations $\by$ and individual parameters $\bpsi$ is nonlinear. However, even though this expectation cannot be computed in a closed-form, it can be approximated by simulation. For instance,<br /> <br /> <br /> * The Monte Carlo EM (MCEM) algorithm replaces the E-step by a Monte Carlo approximation based on a large number of independent simulations of the non-observed individual parameters $\bpsi$.<br /> <br /> * The SAEM algorithm replaces the E-step by a stochastic approximation based on a single simulation of $\bpsi$.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==The SAEM algorithm==<br /> <br /> At iteration $k$ of SAEM:<br /> <br /> <br /> * $\textbf{Simulation step}$: for $i=1,2,\ldots, N$, draw $\psi_i^{(k)}$ from the conditional distribution $\pmacro(\psi_i |y_i ;\theta_{k-1})$.<br /> <br /> <br /> * $\textbf{Stochastic approximation}$: update $Q_k(\theta)$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta) = Q_{k-1}(\theta) + \gamma_k ( \log \pmacro(\by,\bpsi^{(k)};\theta) - Q_{k-1}(\theta) ),<br /> &lt;/math&gt; }}<br /> <br /> where $(\gamma_k)$ is a decreasing sequence of positive numbers such that $\gamma_1=1$, $\sum_{k=1}^{\infty} \gamma_k = \infty$ and $\sum_{k=1}^{\infty} \gamma_k^2 &lt; \infty$.<br /> <br /> <br /> * $\textbf{Maximization step}$: update $\theta_{k-1}$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_{k} = \argmax{\theta} \, Q_k(\theta) .&lt;/math&gt; }}<br /> <br /> <br /> {{Remarks <br /> |title=Remarks<br /> |text= &amp;#32;<br /> * Setting $\gamma_k=1$ for all $k$ means that there is no memory in the stochastic approximation:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta) = \log \pmacro(\by,\bpsi^{(k)};\theta) . &lt;/math&gt; }}<br /> <br /> : This algorithm, known as Stochastic EM (SEM) thus consists of successively simulating $\bpsi^{(k)}$ with the conditional distribution $\pmacro(\bpsi^{(k)} {{!}} \by;\theta_{k-1})$, then computing $\theta_k$ by maximizing the joint distribution $\pmacro(\by,\bpsi^{(k)};\theta)$.<br /> <br /> <br /> * When the number $N$ of subjects is small, convergence of SAEM can be improved by running $L$ Markov chains for each individual instead of one. The simulation step at iteration $k$ then requires us to draw $L$ sequences ${ \phi_i^{(k,1)} } ,\ldots , { \phi_i^{(k,L)} }$ for each individual $i$ and to combine stochastic approximation and Monte Carlo in the approximation step:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; Q_k(\theta) = Q_{k-1}(\theta) + \gamma_k \left( \frac{1}{L}\sum_{\ell=1}^{L} \log \pmacro(\by,\bpsi^{(k,\ell)};\theta) - Q_{k-1}(\theta) \right) .<br /> &lt;/math&gt; }}<br /> <br /> : By default, $\monolix$ selects $L$ so that $N\times L \geq 50$.<br /> }}<br /> <br /> <br /> Implementation of SAEM is simplified when the complete model $\pmacro(\by,\bpsi;\theta)$ belongs to a regular (curved) exponential family:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \pmacro(\by,\bpsi ;\theta) = \exp\left\{ - \zeta(\theta) + \langle \tilde{S}(\by,\bpsi) , \varphi(\theta) \rangle \right\} , &lt;/math&gt; }}<br /> <br /> where $\tilde{S}(\by,\bpsi)$ is a sufficient statistic of the complete model (i.e., whose value contains all the information needed to compute any estimate of $\theta$) which takes its values in an open subset ${\cal S}$ of $\Rset^m$. Then, there exists a function $\tilde{\theta}$ such that for any $s\in {\cal S}$,<br /> <br /> {{EquationWithRef<br /> |equation=&lt;div id=&quot;eq:saem_stat&quot;&gt;&lt;math&gt;<br /> \tilde{\theta}(s) = \argmax{\theta} \left\{ - \zeta(\theta) + \langle s , \varphi(\theta) \rangle \right\} .<br /> &lt;/math&gt;&lt;/div&gt;<br /> |reference=(1) }}<br /> <br /> The approximation step of SAEM simplifies to a general Robbins-Monro-type scheme for approximating this conditional expectation:<br /> <br /> <br /> * $\textbf{Stochastic approximation}$: update $s_k$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> s_k = s_{k-1} + \gamma_k ( \tilde{S}(\by,\bpsi^{(k)}) - s_{k-1} ) . &lt;/math&gt; }}<br /> <br /> <br /> Note that the E-step of EM simplifies to computing $s_k=\esp{\tilde{S}(\by,\bpsi) | \by ; \theta_{k-1}}$.<br /> <br /> Then, both EM and SAEM algorithms use [[#eq:saem_stat|(1)]] for the M-step: $\theta_k = \tilde{\theta}(s_k)$.<br /> <br /> Precise results for convergence of SAEM were obtained in the [[Estimation of the observed Fisher information matrix#Estimation using linearization of the model|Estimation of the F.I.M. using a linearization of the model]] chapter in the case where $\pmacro(\by,\bpsi;\theta)$ belongs to a regular curved exponential family. This first version of [[The SAEM algorithm for estimating population parameters|SAEM]] and these first results assume that the individual parameters are simulated exactly under the conditional distribution at each iteration. Unfortunately, for most nonlinear models or non-Gaussian models, the unobserved data cannot be simulated exactly under this conditional distribution. A well-known alternative consists in using the Metropolis-Hastings algorithm: introduce a transition probability which has as unique invariant distribution the conditional distribution we want to simulate.<br /> <br /> In other words, the procedure consists of replacing the Simulation step of SAEM at iteration $k$ by $m$ iterations of the<br /> Metropolis-Hastings (MH) algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters|The Metropolis-Hastings algorithm]] section. It was shown in the [[Estimation of the observed Fisher information matrix#Estimation using linearization of the model|Estimation of the F.I.M. using a linearization of the model]] section that [[The SAEM algorithm for estimating population parameters|SAEM]] still converges under general conditions when coupled with a Markov chain Monte Carlo procedure.<br /> <br /> <br /> {{Remarks<br /> |title= Remark<br /> |text= Convergence of the Markov chains $(\psi_i^{(k)})$ is not necessary at each SAEM iteration. It suffices to run a few MH iterations with various transition kernels before resetting $\theta_{k-1}$. In $\monolix$ by default, three transition kernels are used twice each, successively, in each SAEM iteration.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> == Implementing SAEM ==<br /> <br /> Implementation of SAEM can be difficult to describe when looking at complex statistical models such as mixture models, models with inter-occasion variability, etc. We are therefore going to limit ourselves to looking at some basic models in order to illustrate how SAEM can be implemented.<br /> <br /> &lt;br&gt;<br /> ===SAEM for general hierarchical models===<br /> <br /> Consider first a very general model for any type (continuous, categorical, survival, etc.) of data $(y_i)$:<br /> <br /> {{Equation1<br /> |equation= &lt;math&gt;\begin{eqnarray} y_i {{!}} \psi_i &amp;\sim&amp; \pcyipsii(y_i {{!}} \psi_i) \\<br /> h(\psi_i) &amp;\sim&amp; {\cal N}( \mu , \Omega),<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $h(\psi_i)=(h_1(\psi_{i,1}), h_2(\psi_{i,2}), \ldots , h_d(\psi_{i,d}) )^\transpose$ is a $d$-vector of (transformed) individual parameters, $\mu$ a $d$-vector of fixed effects and $\Omega$ a $d\times d$ variance-covariance matrix.<br /> <br /> We assume here that $\Omega$ is positive-definite. Then, a sufficient statistic for the complete model $\pmacro(\by,\bpsi;\theta)$ is<br /> $\tilde{S}(\bpsi) = (\tilde{S}_1(\bpsi),\tilde{S}_2(\bpsi))$, where<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \tilde{S}_1(\bpsi) &amp;= &amp; \sum_{i=1}^N h(\psi_i) \\<br /> \tilde{S}_2(\bpsi) &amp;= &amp; \sum_{i=1}^N h(\psi_i) h(\psi_i)^\transpose .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> At iteration $k$ of SAEM, we have:<br /> <br /> <br /> * $\textbf{Simulation step}$: for $i=1,2,\ldots, N$, draw $\psi_i^{(k)}$ from $m$ iterations of the MH algorithm described in [[The Metropolis-Hastings algorithm for simulating the individual parameters|The Metropolis-Hastings algorithm]] with $\pmacro(\psi_i |y_i ;\mu_{k-1},\Omega_{k-1})$ as limiting distribution.<br /> <br /> * $\textbf{Stochastic approximation}$: update $s_k=(s_{k,1},s_{k,2})$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> s_{k,1} &amp;=&amp; s_{k-1,1} + \gamma_k \left( \sum_{i=1}^N h(\psi_i^{(k)}) - s_{k-1,1} \right) \\<br /> s_{k,2} &amp;=&amp; s_{k-1,2} + \gamma_k \left( \sum_{i=1}^N h(\psi_i^{(k)})h(\psi_i^{(k)})^\transpose - s_{k-1,2} \right) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> * $\textbf{Maximization step}$: update $(\mu_{k-1},\Omega_{k-1})$ according to<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \mu_{k} &amp;=&amp; \frac{1}{N} s_{k,1} \\<br /> \Omega_k &amp;=&amp; \frac{1}{N}\left( s_{k,2} - s_{k,1}s_{k,1}^\transpose \right) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> What is remarkable is that it suffices to be able to calculate $\pcyipsii(y_i | \psi_i)$ for all $\psi_i$ and $y_i$ in order to be able to run SAEM. In effect, this allows the simulation step to be run using MH since the acceptance probabilities can be calculated.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ===SAEM for continuous data models===<br /> Consider now a continuous data model in which the residual error variance is now constant:<br /> <br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> y_{ij} &amp;=&amp; f(t_{ij},\phi_i) + a \teps_{ij} \\<br /> h(\psi_i) &amp;\sim&amp; {\cal N}( \mu , \Omega) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> Here, the individual parameters are $\psi_i=(\phi_i,a)$. The variance-covariance matrix for $\psi_i$ is not positive-definite in this case because $a$ has no variability. If we suppose that the variance matrix $\Omega$ is positive-definite, then noting $\theta=(\mu,\Omega,a)$, a natural decomposition of the model is:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\pmacro(\by,\bpsi;\theta) = \pmacro(\by {{!}} \bpsi;a)\pmacro(\bpsi;\mu,\Omega) .<br /> &lt;/math&gt; }}<br /> <br /> The previous statistic $\tilde{S}(\bpsi) = (\tilde{S}_1(\bpsi),\tilde{S}_2(\bpsi))$ is not sufficient for estimating $a$. Indeed, we need an additional component which is a function both of $\by$ and $\bpsi$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \tilde{S}_3(\by, \bpsi) =\sum_{i=1}^N \sum_{j=1}^{n_i}(y_{ij} - f(t_{ij},\psi_i))^2. &lt;/math&gt; }}<br /> <br /> Then,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \begin{eqnarray}<br /> s_{k,3} &amp;=&amp; s_{k-1,3} + \gamma_k ( \tilde{S}_3(\by, \bpsi) - s_{k-1,3} ) \\<br /> a_k^2 &amp;=&amp; \displaystyle{ \frac{1}{\sum_{i=1}^N n_i} s_{k,3} }\ .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The choice of step-size $(\gamma_k)$ is extremely important for ensuring convergence of SAEM. The sequence $(\gamma_k)$ used in $\monolix$ decreases like $k^{-\alpha}$. We recommend using $\alpha=0$ (that is, $\gamma_k=1$) during the first $K_1$ iterations, in order to converge quickly to a neighborhood of a maximum of the likelihood, and $\alpha=1$ during the next $K_2$ iterations.<br /> Indeed, the initial guess $\theta_0$ may be far from the maximum likelihood value we are looking for, and the first iterations with $\gamma_k=1$ allow SAEM to converge quickly to a neighborhood of this value. Following this, smaller step-sizes ensure the<br /> almost sure convergence of the algorithm to the maximum likelihood estimator.<br /> <br /> <br /> <br /> {{Example<br /> |title=Example<br /> |text= Consider a simple model for continuous data:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> y_{ij} &amp;\sim&amp; {\cal N}(A_i\,e^{-k_i \, t_{ij} } , a^2) \\<br /> \log(A_i)&amp;\sim&amp;{\cal N}(\log(A_{\rm pop}) , \omega_A^2) \\<br /> \log(k_i)&amp;\sim&amp;{\cal N}(\log(k_{\rm pop}) , \omega_k^2) ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $A_{\rm pop}=6$, $k_{\rm pop}=0.25$, $\omega_A=0.3$, $\omega_k=0.3$ and $a=0.2$.<br /> Let us look at the effect of different settings for $(\gamma_k)$ (and $L$) for estimating the population parameters of the model with SAEM.<br /> <br /> <br /> 1. For all $k$, $\gamma_k = 1$: the sequence $(\theta_{k})$ converges very quickly to a neighborhood of the &quot;solution&quot;. The sequence $(\theta_{k})$ is a homogeneous Markov Chain that converges in distribution but does not converge almost surely. <br /> <br /> [[File:saem1.png|link=]]<br /> <br /> <br /> 2. For all $k$, $\gamma_k = 1/k$: the sequence $(\theta_{k})$ converges almost surely to the maximum likelihood estimate of $\theta$, but very slowly. <br /> <br /> [[File:saem2.png|link=]]<br /> <br /> <br /> 3. $\gamma_k = 1$, $k=1$, ...,$40$, $\gamma_k = 1/(k-40)$, $k \geq 41$: the sequence $(\theta_{k})$ converges almost surely to the maximum likelihood estimate of $\theta$, and quickly.<br /> <br /> [[File:saem3.png|link=]]<br /> <br /> <br /> 4. $L=10$, $\gamma_k = 1$, $k \geq 1$: the sequence $(\theta_{k})$ is an homogeneous Markov chain that converges in distribution, as in Example 1, but the variance is reduced by a factor $\sqrt{10}$; in this case, SAEM behaves like EM. <br /> <br /> [[File:saem4.png|link=]]<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==A simple example to understand why SAEM converges in practice==<br /> <br /> <br /> Let us look at a very simple Gaussian model, with only one observation per individual:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \psi_i &amp;\sim&amp; {\cal N}(\theta,\omega^2) , \ \ \ 1 \leq i \leq N \\<br /> y_i &amp;\sim&amp; {\cal N}(\psi_i,\sigma^2).<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> We will furthermore assume that both $\omega^2$ and $\sigma^2$ are known.<br /> <br /> Here, the maximum likelihood estimator $\hat{\theta}$ of $\theta$ is easy to compute since $y_i \sim_{i.i.d.} {\cal N}(\theta,\omega^2+\sigma^2)$. We find that<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \hat{\theta} = \displaystyle{\frac{1}{N} }\sum_{i=1}^{N} y_i .<br /> &lt;/math&gt;}}<br /> <br /> We now propose to try and compute $\hat{\theta}$ using SAEM instead. The simulation step is straightforward since the conditional distribution of $\psi_i$ is a normal distribution:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> \psi_i {{!}} y_i \sim {\cal N}(a \theta + (1-a)y_i , \gamma^2) ,<br /> &lt;/math&gt; }}<br /> <br /> where<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> a &amp;= &amp; \displaystyle{ \frac{1}{\omega^2} } \left(\displaystyle{ \frac{1}{\sigma^2} }+ \displaystyle{\frac{1}{\omega^2} }\right)^{-1} \\<br /> \gamma^2 &amp;= &amp;\left(\displaystyle{ \frac{1}{\sigma^2} }+ \displaystyle{\frac{1}{\omega^2} }\right)^{-1}.<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The maximization step is also straightforward. Indeed, a sufficient statistic for estimating $\theta$ is<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; {\cal S}(\bpsi) = \sum_{i=1}^{N} \psi_i. &lt;/math&gt; }}<br /> <br /> Then,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \tilde{\theta}({\cal S(\bpsi)} ) &amp;=&amp; \argmax{\theta} \pmacro(y_1,\ldots,y_N,\psi_1,\ldots,\psi_N;\theta) \\<br /> &amp;=&amp; \argmax{\theta} \pmacro(\psi_1,\ldots,\psi_N;\theta) \\<br /> &amp;=&amp; \frac{ {\cal S}(\bpsi)}{N}.<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Let us first look at the behavior of SAEM when $\gamma_k=1$. At iteration $k$,<br /> <br /> <br /> * Simulation step: $\psi_i^{(k)} \sim {\cal N}( a \theta_{k-1} + (1-a)y_i , \gamma^2).$<br /> <br /> * Maximization step: $\theta_k = \displaystyle{ \frac{ {\cal S}(\bpsi^{(k)})}{N} } = \displaystyle{ \frac{1}{N} }\sum_{i=1}^{N} \psi_i^{(k)}$.<br /> <br /> <br /> It can be shown that:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_k - \hat{\theta} = a(\theta_{k-1} - \hat{\theta}) + e_k ,<br /> &lt;/math&gt; }}<br /> <br /> where $e_k \sim {\cal N}(0, \gamma^2 /N)$. Then, the sequence $(\theta_k)$ is an autoregressive process of order 1 (AR(1)) which converges in distribution to a normal distribution when $k\to \infty$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\theta_k \limite{}{\cal D} {\cal N}\left(\hat{\theta} , \displaystyle{ \frac{\gamma^2}{N(1-a^2)} }\right) .<br /> &lt;/math&gt; }}<br /> <br /> <br /> {{ImageWithCaption|image=saemb1.png|caption=10 sequences $(\theta_k)$ obtained with different initial values and $\gamma_k=1$ for $1\leq k \leq 50$ }} <br /> <br /> <br /> Now, let us see what happens instead when $\gamma_k$ decreases like $1/k$. At iteration $k$,<br /> <br /> <br /> * Simulation step: $\psi_i^{(k)} \sim {\cal N}( a \theta_{k-1} + (1-a)y_i , \gamma^2)$<br /> <br /> * Maximization step:<br /> <br /> {{Equation1<br /> |equation= &lt;math&gt;\theta_k = \theta_{k-1} + \displaystyle{ \frac{1}{k} }\left( \displaystyle{ \frac{1}{N} }\sum_{i=1}^{N} \psi_i^{(k)} -\theta_{k-1} \right). <br /> &lt;/math&gt; }}<br /> <br /> <br /> : Here, we can show that:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \theta_k - \hat{\theta} = \displaystyle{ \frac{k-a}{k} }(\theta_{k-1} - \hat{\theta}) + \displaystyle{\frac{e_k}{k} }, <br /> &lt;/math&gt; }}<br /> <br /> : where $e_k \sim {\cal N}(0, \gamma^2 /N)$. Then, the sequence $(\theta_k)$ converges almost surely to $\hat{\theta}$.<br /> <br /> <br /> {{ImageWithCaption|image=saemb2.png|caption=10 sequences $(\theta_k)$ obtained with different initial values and $\gamma_k=1/k$ for $1\leq k \leq 50$ }}<br /> <br /> <br /> Thus, we see that by combining the two strategies, the sequence $(\theta_k)$ is a Markov chain that converges to a random walk around $\hat{\theta}$ during the first $K_1$ iterations, then converges almost surely to $\hat{\theta}$ during the next $K_2$ iterations.<br /> <br /> <br /> {{ImageWithCaption|image=saemb3.png|caption=10 sequences $(\theta_k)$ obtained with different initial values, $\gamma_k=1$ for $1\leq k \leq 20$ and $\gamma_k=1/(k-20)$ for $21\leq k \leq 50$ }}<br /> <br /> <br /> {{ShowVideo|image=saem5b.png|video=http://popix.lixoft.net/images/2/20/saem.mp4|caption=The SAEM algorithm in practice. }}<br /> <br /> &lt;!-- {{ImageWithCaptionL|image=saem5.png|size=750px|caption= The SAEM algorithm in practice. (a) the observations and the initialization $p_0(\psi_i)$, (b) the initialization $p_0(\psi_i)$ and the conditional distributions of the observations $p(y_i{{!}}\psi_i)$, (c) the conditional distributions $p_0(\psi_i{{!}}y_i)$ and the simulated individual parameters $(\psi_i^{(1)})$, (d) the updated distribution $p_1(\psi_i)$. }} --&gt;<br /> <br /> ==A simulated annealing version of SAEM==<br /> <br /> <br /> Convergence of SAEM can strongly depend on the initial guess when the likelihood ${\like}$ has several local maxima. A simulated annealing version of SAEM can improve convergence of the algorithm toward the global maximum of ${\like}$.<br /> <br /> To detail this, we can first rewrite the joint pdf of $(\by,\bpsi)$ as follows:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \pmacro(\by,\bpsi;\theta) = C(\theta)\, \exp \left\{-U(\by,\bpsi;\theta)\right\} ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $C(\theta)$ is a normalizing constant that only depends on $\theta$. Then, for any &quot;temperature&quot; $T\geq0$, we consider the complete model<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\pmacro_T(\by,\bpsi;\theta) = C_T(\theta)\, \exp \left\{-\displaystyle{\frac{1}{T} }U(\by,\bpsi;\theta) \right\} ,<br /> &lt;/math&gt; }}<br /> <br /> where $C_T(\theta)$ is still a normalizing constant.<br /> <br /> We then introduce a decreasing temperature sequence $(T_k, 1\leq k \leq K)$ and use the SAEM algorithm on the complete model $\pmacro_{T_k}(\by,\bpsi;\theta)$ at iteration $k$ (the usual version of SAEM uses $T_k=1$ at each iteration). The sequence $(T_k)$ is chosen to have large positive values during the first iterations, then decrease with an exponential rate to 1: $T_k = \max(1, \tau \ T_{k-1})$.<br /> <br /> Consider for example the following model for continuous data:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> y_{ij} &amp;\sim&amp; {\cal N}(f(t_{ij};\psi_i) , a^2) \\<br /> h(\psi_i) &amp;\sim&amp; {\cal N}(\mu , \Omega) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Here, $\theta = (\mu,\Omega,a^2)$ and<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> \pmacro(\by,\bpsi;\theta) = C(\theta)\, \exp \left\{- \displaystyle{ \frac{1}{2 a^2} }\sum_{i=1}^N \sum_{j=1}^{n_i} (y_{ij} - f(t_{ij};\psi_i))^2 - \displaystyle{ \frac{1}{2} } \sum_{i=1}^N (h(\psi_i)-\mu)^\transpose \Omega (h(\psi_i)-\mu) \right\},<br /> &lt;/math&gt; }}<br /> <br /> where $C(\theta)$ is a normalizing constant that only depends on $a$ and $\Omega$.<br /> <br /> <br /> We see that $\pmacro_T(\by,\bpsi;\theta)$ will also be a normal distribution whose residual error variance $a^2$ is replaced by $T a^2$ and variance matrix $\Omega$ for the random effects by $T\Omega$.<br /> In other words, a model with a &quot;large temperature&quot; is a model with large variances.<br /> <br /> The algorithm therefore consists in choosing large initial variances $\Omega_0$ and $a^2_0$ (that include the initial temperature $T_0$ implicitly) and setting $a^2_k = \max(\tau \ a^2_{k-1} , \hat{a}(\by,\bpsi^{(k)})$ and $\Omega_k = \max(\tau \ \Omega_{k-1} , \hat{\Omega}(\bpsi^{(k)})$ during the first iterations. Here, $0\leq\tau\leq 1$.<br /> <br /> These large values of the variance make the conditional distributions $\pmacro_T(\psi_i | y_i;\theta)$ less concentrated around their modes, and thus allow the sequence $(\theta_k)$ to &quot;escape&quot; from local maxima of the likelihood during the first iterations of SAEM and converge to a neighborhood of the global maximum of ${\like}$.<br /> After these initial iterations, the usual SAEM algorithm is used to estimate these variances at each iteration.<br /> <br /> <br /> {{Remarks<br /> |title= Remark<br /> |text= We can use two different coefficients $\tau_1$ and $\tau_2$ for $\Omega$ and $a^2$ in $\monolix$. It is possible, for example, to choose $\tau_1&lt;1$ and $\tau_2&gt;1$, with large initial inter-subject variances $\Omega_0$ and small initial residual variance $a^2_0$. In this case, SAEM tries to obtain the best possible fit during the first iterations, allowing for a large inter-subject variability. During the next iterations, this variability is reduced and the residual variance increases until reaching the best possible trade-off between the two criteria.<br /> }}<br /> <br /> <br /> {{Example<br /> |title=A PK example<br /> |text= <br /> <br /> Consider a simple one-compartment model for oral administration:<br /> <br /> {{EquationWithRef<br /> |equation=&lt;div id=&quot;eq:saem_sa&quot;&gt;&lt;math&gt;<br /> f(t;ka,V,k) = \displaystyle{ \frac{D\, ka}{V(ka-ke)} }\left( e^{-ke \, t} - e^{-ka \, t} \right) .<br /> &lt;/math&gt;&lt;/div&gt;<br /> |reference=(2) }}<br /> <br /> We then simulate PK data from 80 patients using the following population PK parameters:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; ka_{\rm pop} = 1, \quad V_{\rm pop}=8, \quad ke_{\rm pop}=0.25 .&lt;/math&gt; }}<br /> <br /> We can see that the following parametrization gives the same prediction as the one given in [[#eq:saem_sa|(2)]]:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \tilde{ka} = ke, \quad \tilde{V}=V \times ke/ka, \quad \tilde{ke}=ka . &lt;/math&gt; }}<br /> <br /> We can then expect a (global) maximum around $(ka,V,ke) = (1, \ 8, \ 0.25)$ and a (local) maximum of the likelihood around $(ka,V,ke) = (0.25, \ 2, \ 1).$<br /> <br /> The figure below displays the convergence of SAEM without simulated annealing to a local maximum of the likelihood (deviance = $-2\,\log {\like} =816$). The initial values of the population parameters we chose were $(ka_0,V_0,k_0) = (1,1,1)$.<br /> <br /> :{{ImageWithCaption_special|image=recuit1.png|caption=Convergence of SAEM to a local maxima of the likelihood}} <br /> <br /> Using the same initial guess, the simulated annealing version of SAEM converges to the global maximum of the likelihood (deviance = 734).<br /> <br /> :{{ImageWithCaption_special|image=recuit2.png|caption=Convergence of SAEM to the global maxima of the likelihood }}<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> == Bibliography ==<br /> <br /> <br /> &lt;bibtex&gt;<br /> @article{allassonniere2010construction,<br /> title={Construction of Bayesian deformable models via a stochastic approximation algorithm: a convergence study},<br /> author={Allassonni{\`e}re, S. and Kuhn, E. and Trouv{\'e}, A.},<br /> journal={Bernoulli},<br /> volume={16},<br /> number={3},<br /> pages={641--678},<br /> year={2010},<br /> publisher={Bernoulli Society for Mathematical Statistics and Probability}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{delattre2012maximum,<br /> title={Maximum likelihood estimation in discrete mixed hidden Markov models using the SAEM algorithm},<br /> author={Delattre, M. and Lavielle, M.},<br /> journal={Computational Statistics &amp; Data Analysis},<br /> year={2012},<br /> volume={56},<br /> pages={2073-2085}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{delattre2013sde,<br /> title={Coupling the SAEM algorithm and the extended Kalman filter for maximum likelihood estimation in mixed-effects diffusion models},<br /> author={Delattre, M. and Lavielle, M.},<br /> journal={Statistics and its interfaces},<br /> year={2013},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{delyon1999convergence,<br /> title={Convergence of a stochastic approximation version of the EM algorithm},<br /> author={Delyon, B. and Lavielle, M. and Moulines, E.},<br /> journal={Annals of Statistics},<br /> pages={94-128},<br /> year={1999},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{dempster1977maximum,<br /> title={Maximum likelihood from incomplete data via the EM algorithm},<br /> author={Dempster, A.P. and Laird, N.M. and Rubin, D.B.},<br /> journal={Journal of the Royal Statistical Society. Series B (Methodological)},<br /> pages={1-38},<br /> year={1977},<br /> publisher={JSTOR}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{kuhn2004coupling,<br /> title={Coupling a stochastic approximation version of EM with an MCMC procedure},<br /> author={Kuhn, E. and Lavielle, M.},<br /> journal={ESAIM: Probability and Statistics},<br /> volume={8},<br /> pages={115-131},<br /> year={2004},<br /> publisher={EDP Sciences, 17 Avenue du Hoggar Les Ulis Cedex A BP 112 91944 France}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{lavielle2013improved,<br /> title={An improved SAEM algorithm for maximum likelihood estimation in mixtures of non linear mixed effects models},<br /> author={Lavielle, M. and Mbogning, C.},<br /> journal={Statistics and Computing},<br /> year={2013},<br /> publisher={Springer}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{mclachlan2007algorithm,<br /> title={The EM algorithm and extensions},<br /> author={McLachlan, G.J. and Krishnan, T.},<br /> volume={382},<br /> year={2007},<br /> publisher={Wiley-Interscience}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{samson2006extension,<br /> title={Extension of the SAEM algorithm to left-censored data in nonlinear mixed-effects model: Application to HIV dynamics model},<br /> author={Samson, A. and Lavielle, M. and Mentr&amp;eacute;, F.},<br /> journal={Computational statistics &amp; data analysis},<br /> volume={51},<br /> number={3},<br /> pages={1562-1574},<br /> year={2006},<br /> publisher={Elsevier}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{wei1990monte,<br /> title={A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms},<br /> author={Wei, G. and Tanner, M.},<br /> journal={Journal of the American Statistical Association},<br /> volume={85},<br /> number={411},<br /> pages={699-704},<br /> year={1990},<br /> publisher={Taylor &amp; Francis}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{wu1983convergence,<br /> title={On the convergence properties of the EM algorithm},<br /> author={Wu, C.F.},<br /> journal={The Annals of Statistics},<br /> volume={11},<br /> number={1},<br /> pages={95-103},<br /> year={1983},<br /> publisher={Institute of Mathematical Statistics}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> <br /> <br /> {{Back&amp;Next<br /> |linkBack=Introduction and notation<br /> |linkNext=The Metropolis-Hastings algorithm for simulating the individual parameters }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Model_evaluation&diff=7170 Model evaluation 2013-06-04T18:21:44Z <p>Marc: /* Bibliography */</p> <hr /> <div>== Introduction ==<br /> <br /> Defining the expression &quot;model evaluation&quot; is a harder task than it may appear at first glance. Intuitively, it would seem to suggest evaluating the performance of a model based on the observed data, the same data that was used to build the model. Fair enough, but what then do we mean by the &quot;performance&quot; of a model?<br /> <br /> Do we mean the ability of the model to characterize and explain the phenomena being studied, in which case the goal is to use the model to understand the phenomena? Or do we mean the model's predictive performance when the model is used to predict the phenomena's behavior, either in the future or under new experimental conditions?<br /> <br /> What is comes down to is this: do we want to use the model to understand or to predict? This is the key question to ask before even starting to think about what tools to use and tasks to execute.<br /> <br /> Here, we will be for the most part focused on the ability of a model to explain the phenomena and data. Therefore, the first goal will be to check whether the data are in agreement with the model, and vice versa. In this process, model diagnostics can be used to eliminate model candidates that do not seem capable of reproducing the observed data.<br /> As is the usual case in statistics, it is not because a model has not been rejected that it is necessarily the &quot;true&quot; one. All that we can say is that the experimental data does not allow us to reject this model. It is merely one of perhaps many models that cannot be rejected. Indeed, we can usually find several models that get past this first diagnostic step and are therefore not rejected.<br /> <br /> What to do, then, when several possible models are retained? Well, we can try to select the &quot;best&quot; one (or best ones if no leader distinguishes itself from the rest). This means developing a model selection process which allows us to compare the models to each other. Compare models? But with what criteria?<br /> <br /> In a purely explanatory context, Occam's razor is a useful parsimony principle which states that among competing hypotheses, the one with the fewest assumptions should be selected. In a modeling context, this means that among valid competing models, the most simple one should be selected.<br /> <br /> Model diagnostic tools are for the most part graphical or visual: we &quot;see&quot; when something is not right between a chosen model and the data it is hypothesized to describe. Model selection tools, on the other hand, are analytical: we calculate the value of some criteria that allows us to compare the models with each other. However, it is absolutely critical to keep in mind the limits of these tools. These are not decision-making tools. It is not a $p$-value or some information criteria that can automatically decide which model to choose. It is always the modeler who must have the last word! This person uses the model diagnostic and selection tools in order to guide their decision, but at the end, it is they that must make the final decision. There is nothing more dangerous that rules applied without thinking and arbitrary cut-off values used blindly without reflection.<br /> <br /> Here, we will not look model selection techniques based on the various models' predictive performances. One such approach consists of splitting the data into three sets: a ''learning set'' is used for fitting the model, a ''validation set'' for choosing between models and a ''test set'' to assess the quality of the predictions made by the chosen model.<br /> <br /> Very few model diagnostic and selection tools exist for mixed-effects models. One of the most complete is [http://xpose.sourceforge.net Xpose], an R-based model-building aid for population analysis using NONMEM that facilitates model diagnostics, candidate covariate identification and model comparison. Here we will use $\monolix$ and illustrate these techniques with the example used previously for model exploration and parameter estimation.<br /> <br /> &lt;br&gt;<br /> <br /> == Model diagnostics==<br /> <br /> &lt;br&gt;<br /> === Model diagnostics and statistical tests ===<br /> <br /> Suppose first that a model has been entirely defined by the modeler, and that its parameters have either been chosen or estimated.<br /> What we call the &quot;model&quot; is therefore a joint probability distribution along with some parameter values.<br /> <br /> <br /> Note ${\cal M}_0$ the model we wish to evaluate. We place ourselves in the framework of statistical testing, and would like to perform the following hypothesis test:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; H_0: \quad {\cal M}={\cal M}_0 \quad vs \quad H_1: \quad {\cal M}\neq{\cal M}_0. &lt;/math&gt;}}<br /> <br /> &quot;Passing&quot; the test does not mean that we accept $H_0$ but rather that we do not reject it. We will use the same point of view for model diagnostics whereby we eliminate model candidates that do not seem capable of reproducing the observed data, i.e., models for which we conclude that ${\cal M}\neq{\cal M}_0$.<br /> <br /> <br /> {{Remarks<br /> |title=Remark<br /> |text=Running a statistical test only makes sense when we have doubts on the hypothesis of retaining a model.<br /> If it is clear from the beginning that for example the structural model is totally misspecified (e.g., we use a linear function of time even though a curve is clearly visible in the data), any basic goodness-of-fit plot (individual fits, observation vs prediction, residuals, etc.) will detect this misspecification without any doubt and without the need to evaluate the probability of making a mistake.<br /> }}<br /> <br /> <br /> To put into practice such a statistical test, we usually construct a test statistic $T(\by)$ which is a function of the observations and for which we are able to calculate a distribution under the null hypothesis $H_0$.<br /> <br /> For a given significance level $\alpha$, we then define a rejection region $R_\alpha$ such that:<br /> <br /> {{Equation1<br /> |equation= &lt;math&gt;\probs{H_0}{T(\by) \in R_\alpha} = \alpha . &lt;/math&gt; }}<br /> <br /> Thus, $\alpha$ is the probability of incorrectly rejecting the null hypothesis $H_0$.<br /> <br /> The difficulty in creating and using such tests comes from two main things:<br /> <br /> <br /> &lt;ul&gt;<br /> * We need to be capable of calculating the distribution of the test statistic $T(\by)$ under $H_0$ in order to carefully track the significance level, i.e., ensure that the probability of incorrectly rejecting $H_0$ is indeed $\alpha$.<br /> &lt;br&gt;<br /> <br /> * Being able to control the type I error $\alpha$ is of no interest if the test has low power, i.e., if the probability of correctly rejecting $H_0$ is low.<br /> &lt;/ul&gt; <br /> <br /> <br /> In the present context, the first point is clearly a problem. Due to the complexity of the models we are interested in, it is impossible to analytically calculate the distribution of a function of the observations, even for something as simple as the empirical mean $\overline{\by}=\sum_{i,j}y_{ij}/\sum_i{n_i}$.<br /> Using limit theorems to approximate such distributions is also more or less hopeless. Perhaps the most powerful, general and precise solution we have available to us is Monte Carlo simulation:<br /> <br /> <br /> &lt;ul&gt;<br /> * Generate independent values $\by^{(1)}, \by^{(2)}, \ldots , \by^{(L)}$ under the model ${\cal M}_0$ using the same design and covariates as in the original data. <br /> &lt;br&gt;<br /> <br /> * Calculate the $L$ statistics $T(\by^{(1)}), T(\by^{(2)}), \ldots , T(\by^{(L)})$.<br /> &lt;br&gt;<br /> <br /> * Estimate the distribution of $T(\by)$ under ${\cal M}_0$ with the empirical distribution of the $T(\by^{(\ell)})$.<br /> &lt;/ul&gt;<br /> <br /> <br /> The estimation error essentially depends on the number $L$ of simulated data sets. We must therefore choose $L$ large enough so that this error is negligible.<br /> <br /> This solves part of the problem. But it remains hard to define a rejection region $R_\alpha$ if the test statistic is multidimensional. We can of course calculate relevant prediction intervals for each component of $T(\by)$ individually, but this does not really help us to define a real multidimensional rejection region.<br /> <br /> Consequently, diagnostics methods are essentially visual: we compare the observed statistic $T(\by)$ with the expected distribution under ${\cal M}_0$ by graphically displaying (for example) 90% or 95% prediction intervals for each component of $T(\by)$.<br /> <br /> The second point is also delicate because we need to decide what ${\cal M}\neq{\cal M}_0$ means, i.e., $H_0$ being false.<br /> Does it mean that the structural model is misspecified? Or the distribution of the random effects, the residual error model, the covariate model? There are so many ways in which a model can be misspecified that we cannot realistically expect to be able to create one unique statistic sufficiently powerful to detect all of these at once. We therefore prefer to construct several different test statistics, i.e., several graphical diagnostics tools, each good at dealing with one particular type of misspecification. It is then the combination of all these tools that will make up our test; we can fairly reasonably hope that a misspecified model will not succeed in passing through this filter.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ===Diagnostic plots using individual parameters===<br /> <br /> Diagnostic plots constructed using only the observations are useful for looking at the distribution $\qy$ of the observations,<br /> but do not help with testing hypotheses made on the non-observed individual parameters (about the distribution, covariate model, etc.).<br /> <br /> One possible solution is to estimate the individual parameters (using for example the conditional mode) and then use these estimates to create new diagnostic tools. This strategy is only useful when the individual parameters have been estimated well.<br /> <br /> If instead the data does not contain enough information to estimate certain individual parameters well, the individual estimates are all shrunk towards the same (population) value; this is the mode (resp. mean) of the population distribution of the parameter if we use the conditional mode (resp. conditional mean). For a parameter $\psi_i$ which is a function of a random effect $\eta_i$, we can quantify this phenomena by defining the so-called $\eta$-shrinkage as<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \eta\text{-shrinkage} = 1 - \displaystyle{\frac{\var{\hat{\eta}_i} }{\var{\eta_i} } }, &lt;/math&gt; }}<br /> <br /> where $\hat{\eta}_i$ is an estimate of $\eta_i$ (the conditional mode, conditional mean, etc.).<br /> <br /> This shrinkage phenomenon is simple to understand because the conditional distribution $\qcetaiyi$ of $\eta_i$ is defined by the product $\pmacro(y_i|\eta_i)\pmacro(\eta_i)$. Saying that the observations $y_i$ provides little information about $\eta_i$ means that the conditional distribution of $y_i$ has a reduced importance in the construction of $\qcetaiyi$. The mode (resp. mean) of $\qcetaiyi$ will therefore be close to 0 which is both the mode and mean of $\qetai$. The result is a high level of shrinkage (close to 1) whenever $\var{\hat{\eta}_i}\ll\var{\eta_i}$.<br /> <br /> Estimates of the $\psi_i$ are therefore biased because they do not correctly reflect the marginal distribution $\qpsii$ (in particular, their variance is much less). A particularly effective solution is to simulate the individual parameters $\psi_i$ with the conditional distribution $\qcpsiiyi$ rather than taking the mode. The resulting estimator is unbiased in the following sense:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \pmacro(\psi_i) &amp;=&amp; \displaystyle{ \int \pmacro(\psi_i {{!}} y_i )\pmacro( y_i ) d\, y_i }\\<br /> &amp;=&amp; \esps{y_i}{\pmacro(\psi_i {{!}} y_i )} .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> This relationship is a fundamental one when considering inverse problems, incomplete data models, mixed-effects models, etc. So what does it imply exactly? Well, if we randomly draw a vector $y_i$ of observations for an individual in a population and then generate a vector $\psi_i$ using the conditional distribution $\qcpsiiyi$, then the distribution of $\psi_i$ is the population distribution $\qpsii$. In other words, even if each $\psi_i$ is simulated using its own conditional distribution, the fact of pooling them allows us to look at them as if they were a sample from $\qpsii$, i.e., the marginal distribution $\qpsii$ is a mixture of conditional distributions $\qcpsiiyi$.<br /> <br /> The procedure is therefore as follows: we generate several values from each conditional distribution $\qcpsiiyi$ using the [[The Metropolis-Hastings algorithm for simulating the individual parameters|Metropolis-Hastings algorithm]], and use them in addition to the observations in order to build various diagnostic plots.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> === An example ===<br /> <br /> We are going to use the same model used for model exploration in the [[Visualization]] section and parameter estimation in the [[Estimation#Maximum likelihood estimation of the population parameters | Maximum likelihood estimation of the population parameters]] chapter.<br /> <br /> <br /> The structural model that defines the concentration in the central compartment and the hazard function for the events (hemorrhaging) is<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> Cc(t) &amp;=&amp; \displaystyle{ \frac{D \, ka}{V(ka-Cl/V)} }\left(e^{-(Cl/V)\,t} - e^{-ka\,t} \right) \\<br /> h(t) &amp;=&amp; h_0 \, \exp(\gamma\, Cc(t)) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The statistical model assumes that $ka_i$ and $V_i$ are log-normally distributed, $Cl_i$ normal, $h0_i$ probit-normal and $\gamma$ logit-normal. No covariates are used in this model. Lastly, we suppose a constant residual error model. Now we are going to review several different diagnostic plots and look at the conclusions that can be made using them.<br /> <br /> <br /> 1) &lt;u&gt;Individual fits.&lt;/u&gt;<br /> <br /> In the continuous data model $y_{ij}=f(t_{ij};\psi_i) + f(t_{ij};\psi_i)\teps_{ij}$, estimation of the population parameters $\psi_{\rm pop}$ and individual parameters $\psi_{i}$ allows us to compute for each individual:<br /> <br /> &lt;ul&gt;<br /> * $f(t ; \hat{\psi}_{\rm pop})$, the predicted profile given by the estimated population model<br /> &lt;br&gt;<br /> <br /> * $f(t ; \hat{\psi}_{i})$, the predicted profile given by the estimated individual model.<br /> &lt;/ul&gt;<br /> <br /> <br /> {{ExampleWithImage<br /> |text=The figure plotting for each individual the two curves for the predicted concentration shows evidence of inter-individual variability in the kinetics, and furthermore does not allow us to reject the proposed PK model since the fits seem acceptable.<br /> |image=indfits1.png<br /> }}<br /> <br /> <br /> 2) &lt;u&gt;Observations vs predictions.&lt;/u&gt;<br /> <br /> The population and individual models also allow us to calculate for each individual predictions at the observation times<br /> $f(t_{ij} ; \hat{\psi}_{\rm pop})$ and $f(t_{ij} ; \hat{\psi}_{i})$.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= The figure showing observations vs individuals reveals no obvious misspecification. In particular it does not allow us to reject the constant residual error model.<br /> |image=obspred1.png<br /> }}<br /> <br /> <br /> 3) &lt;u&gt;Residuals.&lt;/u&gt;<br /> <br /> Several types of residuals can be defined: ''population weighted residuals'' $({\rm PWRES}_{ij})$, ''individual weighted residuals'' $({\rm IWRES}_{ij})$, ''normalised prediction distribution errors'' $({\rm NPDE}_{ij})$, etc. The first two are defined by:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> {\rm PWRES}_{ij} &amp;=&amp; \frac{y_{ij} - \hat{\mathbb{E} }(y_{ij})}{\hat{\rm std}(y_{ij})} \\ {\rm IWRES}_{ij} &amp;=&amp; \frac{y_{ij} - f(t_{ij} ; \hat{\psi}_{i})}{g(t_{ij} ; \hat{\psi}_{i})},<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $\hat{\mathbb{E}}(y_{ij})$ and $\hat{\rm std}(y_{ij})$ are the mean and variance of $y_{ij}$ estimated by Monte Carlo.<br /> ${\rm NPDE}_{ij}$ is a nonparametric version of ${\rm PWRES}_{ij}$, based on a rank statistic. See [http://www.npde.biostat.fr npde] for more details.<br /> <br /> Statistics useful for summarizing the residuals are the 10%, 50% (median) and 90% quantiles. The procedure described earlier for estimating prediction intervals of these quantiles using Monte Carlo can again be used.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= Both the individual residuals and the NPDEs suggest that the model is misspecified. Indeed, under ${\cal M}_0$ residuals are expected to behave as i.i.d. ${\cal N}(0,1)$ random variables, which is clearly not the case here. It is nevertheless difficult to identify the reasons for this misspecification using only these figures.<br /> |image=residual1.png<br /> }}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= These plots show at each observation time the order 10%, 50% and 90% quantiles of the IWRES and NPDE.<br /> The 90% prediction intervals are also displayed. These plots are more informative than the original residual plots.<br /> We can now reasonably conclude that the behavior of the three quantiles is not the one expected under ${\cal M}_0$. In particular, a proportional component in the residual error model appears not to have been taken into account.<br /> |image=residual2.png<br /> }}<br /> <br /> <br /> 4) &lt;u&gt;The distributions of the individual parameters.&lt;/u&gt;<br /> <br /> The hypotheses we have made about the distributions of the individual parameters can be tested by visually comparing the pdf of the pre-selected distribution of each parameter with the empirical distribution of that parameter. We are going to see that using the estimated individual parameters does not allow us to construct a pertinent diagnostic plot, and that we must rather use parameters simulated with the conditional distribution $\qcpsiiyi$ for each individual.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= The plots show for each model parameter the pdf obtained from the estimated parameters and the empirical distribution, shown as a histogram, of the estimated individual parameters (here the estimated parameters are the modes of the conditional distributions).<br /> |image=distparam0.png<br /> }}<br /> <br /> <br /> {{ExampleWithImage<br /> |text= Instead of histograms, $\monolix$ can also display the empirical distribution of a continuous variable using nonparametric density estimation. This is a better way to represent continuous distributions than a histogram.<br /> &lt;br&gt;<br /> It is also possible to display the $\eta$-shrinkage for each parameter. As expected, $\eta$-shrinkage is large for the parameters associated with the time-to-events process. That does not mean that the statistical models for $h_0$ and $\gamma$ are misspecified, but that the data does not allow us to correctly recover these individual parameters.<br /> |image=distparam1.png<br /> }}<br /> <br /> {{ExampleWithImage<br /> |text= The simulated individual parameters now allow us to construct a diagnostic tool for the distributions of the individual parameters. Only the distribution of the clearance $Cl$ would appear to be rejected. Though the model had supposed a normal distribution, the simulated parameters seem to suggest an asymmetric distribution. This diagnostic plot leads us to think about testing the hypothesis of a log-normal distribution for $Cl$.<br /> |image=distparam2.png<br /> }}<br /> <br /> <br /> 5) &lt;u&gt;Covariate model.&lt;/u&gt;<br /> <br /> The model assumes that for a given individual parameter $\psi_i$, there exists a function $h$ such that $h(\psi_i) = h(\psi_{\rm pop}) + \eta_i$, where the random effects are i.i.d. Gaussian variables. We can then graphically display the random effects simulated with the conditional distributions as a function of the various covariates in order to see whether this hypothesis is valid.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= These plots clearly show that the simulated random effects for $V$ and $Cl$ are correlated with the weight and have different distributions depending on gender. The assumption that volume $V$ and clearance $Cl$ are independent of weight should be rejected. The statistical model also needs to take into account the fact that both predicted volume and clearance increase with weight.<br /> |image=Eval_covariate2.png <br /> }}<br /> <br /> <br /> 6) &lt;u&gt;The correlation model.&lt;/u&gt;<br /> The model assumes that for a given individual, the random effects associated with the each individual parameter are independent.<br /> We can plot each pair of random effects simulated with the conditional distributions against each other to see if this hypothesis is valid.<br /> <br /> <br /> {{ExampleWithImage<br /> |text=The various point clouds show no correlation between random effects except for $(\eta_V,\eta_{Cl})$ and perhaps $(\eta_{ka},\eta_V)$.<br /> |image=correl1.png<br /> }}<br /> <br /> <br /> 7) &lt;u&gt;Visual predictive checks.&lt;/u&gt;<br /> <br /> A VPC is a diagnostic tool well suited to continuous data. It allows us to summarize in the same figure the structural and statistical models. The VPC shown uses the order 10%, 50% and 90% for the observations after having regrouped them into bins for successive intervals. Then, prediction intervals for these quantiles under ${\cal M}_0$ are estimated using Monte Carlo.<br /> <br /> <br /> {{ExampleWithImage<br /> |text=To make it easier to interpret VPCs, we represent in red the zones where the observed quantiles are outside the prediction intervals. Here, the structural model seems to be ok, but the statistical model exhibits some incoherencies. In particular, the three quantiles obtained using the observations appear much closer together than the model ${\cal M}_0$ would suggest. This adds weight to the suggestion that a proportional component should be added to the error model.<br /> |image=vpc4.png<br /> }}<br /> <br /> <br /> 8) &lt;u&gt;Kaplan Meier plots.&lt;/u&gt;<br /> <br /> Special diagnostic plots need to be defined for non-continuous observations.<br /> We can for survival (time-to-event) data use Kaplan Meier plots (for the first event) as a statistic, and/or the average cumulated number of events per individual (i.e., the mean number of observed events before time $t$). The prediction intervals for these statistics can be estimated by Monte Carlo.<br /> <br /> <br /> {{ExampleWithImage<br /> |text= The shapes of the curves seem correct. The model appears to slightly overestimate the survival function after the 15 hr mark, but it is difficult at this point to decide whether this comes from the time-to-event model itself, the statistical model or the model for the concentration.<br /> |image=km2.png<br /> }}<br /> <br /> <br /> {{OutlineText<br /> |text= In summary, this ensemble of diagnostic plots has suggested to us that we should suppose:<br /> <br /> <br /> * a log-normal distribution for $Cl$<br /> <br /> * a combined residual error model, e.g., $y=f+(a+b*f)\teps$<br /> <br /> * a statistical model for $Cl$ and $V$ which incorporates weight as a covariate, assuming for instance a linear relationship between $\log(V)$ and $\log({\rm weight})$ and a linear relationship between $\log(Cl)$ and $\log({\rm weight})$<br /> <br /> *a linear correlation between $\log(Cl)$ and $\log(V)$.<br /> }}<br /> <br /> <br /> This new model can be easily implemented in $\monolix$. Population parameters and individual parameters are then estimated anew and new diagnostic plots drawn. A few of these are displayed below and clearly show that the new model is better than the previous one, and can likely be retained as a candidate model.<br /> <br /> ::[[File:diagnostic.png]]<br /> <br /> <br /> &lt;br&gt;<br /> <br /> == Model selection == <br /> <br /> &lt;br&gt;<br /> === Statistical tools for model selection ===<br /> <br /> <br /> Statistical tools for model selection include information criteria (AIC and BIC) and hypothesis tests such as the Wald test and likelihood ratio test (LRT).<br /> <br /> The Akaike Information Criteria (AIC) and the Bayesian information Criteria (BIC) are defined by<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> AIC &amp;=&amp; - 2 {\llike}(\theta;\by) + 2 P \\<br /> BIC &amp;=&amp; - 2 {\llike}(\theta;\by) + \log(N) P ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $P$ is the total number of parameters to be estimated and $N$ the number of subjects. The models being compared using AIC or BIC need not be nested, unlike the case when models are being compared using an F-test or likelihood ratio test.<br /> <br /> <br /> {{Remarks<br /> |title=Remarks<br /> |text= &amp;#32;<br /> * Surprisingly, the formula for calculating the BIC differs from one software to another. The reason is that the effective sample size is not clearly defined in the context of mixed-effects models. The question is whether we should use the number of subjects $N$ or the total number of observations $n_{\mathrm{tot} }$ in the penalty term. The penalty using $n_{\mathrm{tot} }$ is implemented in the R package nlme &lt;!--[http://cran.r-project.org/web/packages/nlme/nlme.pdf nlme]--&gt; and the SPSS procedure MIXED &lt;!--[http://www.spss.ch/upload/1126184451_Linear%20Mixed%20Effects%20Modeling%20in%20SPSS.pdf MIXED]--&gt;, while $N$ is used in saemix for $\monolix$, &lt;!--[http://cran.r-project.org/web/packages/saemix/saemix.pdf saemix]--&gt; and in SAS proc NLMIXED&lt;!--[http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#nlmixed_toc.htm NLMIXED]--&gt;.<br /> <br /> : An appropriate decomposition of the complete log-likelihood combined with the Laplace approximation can be used to derive the asymptotic BIC approximation. This leads to an optimal BIC penalty based on two terms proportional to $\log N$ and $\log n_{\mathrm{tot} }$ that adapts to the mixed-effects structure of the model. This new approach is not implemented yet in any software.<br /> <br /> <br /> * AIC and BIC are justified based on asymptotic criteria (the AIC heuristic uses Wilks' theorem and BIC uses Bayesian statistics), i.e., when the number of individuals increases and the model dimension stays fixed. In the alternative, non-asymptotic approach, the model size can increase freely. The form of the penalty can differ from one model to the next in this framework. It can be shown for example that for certain Gaussian models, the penalty term has the form $c_1P + c_2P\log(N/P)$. The problem then becomes to calibrate the coefficients $c_1$ and $c_2$ in order to obtain an optimal penalty, which is not necessarily a simple task, making it harder to use this approach in real applications.<br /> }}<br /> <br /> <br /> The observed log-likelihood ${\llike}(\theta;\by) = \log(\py(\by;\theta))$ cannot be computed in closed form for nonlinear mixed-effects models. It can be estimated by Monte Carlo using the importance sampling algorithm described in the [[Estimation of the log-likelihood]] section.<br /> <br /> When comparing two nested models ${\cal M}_0$ and ${\cal M}_1$ with dimensions $P_0$ and $P_1$ (with $P_1&gt;P_0$), the ''likelihood ratio test'' (LRT) uses the test statistic<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> LRT = 2 ( {\llike}(\hthetag_1;\by) - {\llike}(\hthetag_0;\by) ) ,<br /> &lt;/math&gt; }}<br /> <br /> where $\hthetag_0$ and $\hthetag_1$ are the MLE of $\theta$ under ${\cal M}_0$ and ${\cal M}_1$.<br /> <br /> Depending on the hypotheses, the limit distribution of $LRT$ is either a $\chi ^2$ distribution or a mixture of a $\chi^2$ distribution and a Dirac $\delta$ distribution. For example:<br /> <br /> <br /> &lt;ul&gt;<br /> - Testing whether some fixed effects are null and assuming the same covariance structure for the random effects implies that<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; LRT \limite{N\to \infty}{} \chi^2(P_1-P_0) .<br /> &lt;/math&gt; }}<br /> <br /> <br /> - Testing whether some correlations in the covariance matrix $\IIV$ are null and assuming the same covariate model implies that<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> LRT \limite{N\to \infty}{} \chi^2(P_1-P_0) .<br /> &lt;/math&gt; }}<br /> <br /> <br /> - Testing whether the variance of one of the random effects is zero and assuming the same covariate model implies that<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> LRT \limite{N\to \infty}{} \displaystyle{ \frac{1}{2} }\chi^2(1) + \displaystyle{ \frac{1}{2} }\delta_0 .<br /> &lt;/math&gt; }}<br /> &lt;/ul&gt;<br /> <br /> <br /> Statistical tests, as is the case for BIC, help us decide whether the difference between two models is statistically significant.<br /> Suppose that we want to test whether a fixed model parameter $\beta$ is null:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> H_0: \quad \beta=0 \quad vs \quad H_1: \quad \beta\neq0 .<br /> &lt;/math&gt; }} <br /> <br /> We construct a statistical test $T$ for which the distribution under $H_0$ allows us to calculate a $p$-value, i.e., the probability that $T$ is at least as big as the value observed under $H_0$. A small $p$-value leads us to reject $H_0$ with high confidence. It is usual to use the arbitrary cutoff of 5% to make this decision: we frequently read statements such as &quot;a decrease in the objective function of a least 3.84 was required to identify a signifiant covariate&quot;.<br /> In the same way we could select models based on their BIC values under $H_0$ and $H_1$ by providing an arbitrary decision rule. It is sometimes suggested to choose $H_1$ if the difference $BIC_{H_1} -BIC_{H_0}$ is inferior to a certain arbitrary cutoff.<br /> <br /> These approaches seem to simplify the modeler's life because they provide decision rules that can be applied systematically without thinking and thus justify decisions. But a rule, whatever it is, should never stop us asking why we are applying it and whether it is applicable in the present case. Remember that even a very small difference will be statistically significant if the sample size is large enough. The question is thus not to know whether a difference is statistically significant, but whether it is physically or biologically significant. We must therefore look carefully at the size of an effect and its real impact, both for understanding the model and for understanding its impact on the model's predictive capacities.<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ===An example===<br /> <br /> We are going to continue using our joint model for concentration data and hemorrhaging events. The structural model is chosen to be the one used earlier for the diagnostic tests. We would now like to compare several statistical models:<br /> <br /> &lt;blockquote&gt;<br /> ${\cal M}_1$: all the individual parameters are log-normally distributed, there is no correlation between individual parameters, the residual error model is a combined one ($y=f+(a+b*f)*\teps$), both $\log(V)$ and $\log(Cl)$ are linear functions of $\log({\rm weight})$. <br /> &lt;/blockquote&gt;<br /> &lt;blockquote&gt;<br /> ${\cal M}_2$: model ${\cal M}_1$, assuming that $\log(V_i)$ and $\log(Cl_i)$ are linearly correlated. <br /> &lt;/blockquote&gt;<br /> &lt;blockquote&gt;<br /> ${\cal M}_3$: model ${\cal M}_2$, assuming that $\gamma_i=\gamma_{\rm pop}$ (i.e., $\omega_\gamma = 0$). <br /> &lt;/blockquote&gt;<br /> &lt;blockquote&gt;<br /> ${\cal M}_4$: model ${\cal M}_2$, assuming that $Cl$ is not a function of the weight. <br /> &lt;/blockquote&gt;<br /> <br /> <br /> The results for these four models are as follows:<br /> <br /> <br /> {| class=&quot;wikitable&quot; cellpadding=&quot;20&quot; cellspacing=&quot;20&quot; style=&quot;margin-left:20%; margin-right=20%; width:50%&quot;<br /> ! Model || $-2\times {\llike}$ || BIC <br /> |- <br /> | ${\cal M}_1$ || 1390 || 1451 <br /> |-<br /> | ${\cal M}_2$ || 1370 || 1436 <br /> |-<br /> | ${\cal M}_3$ || 1370 ||1432 <br /> |-<br /> | ${\cal M}_4$ || 1446 || 1477 <br /> |} <br /> <br /> <br /> Clearly models ${\cal M}_2$ and ${\cal M}_3$ are the best in terms of BIC. We can therefore envisage selecting a model that includes both a correlation between $\log(V)$ and $\log(Cl)$ and that supposes that the volume $V$ is a function of the weight.<br /> <br /> Rather than a purely statistical criteria (LRT or BIC) it is particularly the estimated value of the standard deviation of $\gamma$ ($\hat{\omega}_\gamma = 0.002$) under the model ${\cal M}_2$ that would lead us to conclude that the inter-individual variability of $\gamma$ is negligible.<br /> <br /> It is also important to try and evaluate the impact that a bad decision could have. Retaining $\hat{\omega}_\gamma = 0.002$ would have little impact on predictions because $\gamma_i$ is log-normally distributed, representing a variability of around 0.2%. Models ${\cal M}_2$ and ${\cal M}_3$ are thus practically identical and there is no particular advantage of selecting one over the other.<br /> <br /> <br /> &lt;br&gt;<br /> == Bibliography ==<br /> <br /> <br /> &lt;bibtex&gt;<br /> @article{akaike1983information,<br /> title={Information measures and model selection},<br /> author={Akaike, H.},<br /> journal={Bulletin of the International Statistical Institute},<br /> volume={50},<br /> number={1},<br /> pages={277-291},<br /> year={1983}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{arlot2010survey,<br /> title={A survey of cross-validation procedures for model selection},<br /> author={Arlot, S. and Celisse, A.},<br /> journal={Statistics Surveys},<br /> volume={4},<br /> pages={40 - 79},<br /> year={2010},<br /> publisher={The author, under a Creative Commons Attribution License}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{berger1996intrinsic,<br /> title={The intrinsic Bayes factor for model selection and prediction},<br /> author={Berger, J.O. and Pericchi, L.R.},<br /> journal={Journal of the American Statistical Association},<br /> volume={91},<br /> number={433},<br /> pages={109-122},<br /> year={1996},<br /> publisher={Taylor &amp; Francis}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{bergstrand2011prediction,<br /> title={Prediction-corrected visual predictive checks for diagnosing nonlinear mixed-effects models},<br /> author={Bergstrand, M. and Hooker, A.C. and Wallin, J.E. and Karlsson, M.O.},<br /> journal={The AAPS journal},<br /> volume={13},<br /> number={2},<br /> pages={143-151},<br /> year={2011},<br /> publisher={Springer}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{burnham2002model,<br /> title={Model selection and multi-model inference: a practical information-theoretic approach},<br /> author={Burnham, K.P. and Anderson, D.R.},<br /> year={2002},<br /> publisher={Springer Verlag}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{comets2010model,<br /> title={Model evaluation in nonlinear mixed effect models, with applications to pharmacokinetics},<br /> author={Comets,E. and Brendel,K.},<br /> journal={Journal de la Soci&amp;eacute;t&amp;eacute; Fran&amp;ccedil;aise de Statistique},<br /> volume={151},<br /> number={1},<br /> pages={106-128},<br /> year={2010}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{comets2008computing,<br /> title={Computing normalised prediction distribution errors to evaluate nonlinear mixed-effect models: the npde add-on package for R},<br /> author={Comets, E. and Brendel, K. and Mentr&amp;eacute;, F.},<br /> journal={Computer methods and programs in biomedicine},<br /> volume={90},<br /> number={2},<br /> pages={154-166},<br /> year={2008},<br /> publisher={Elsevier}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{hausman1984specification,<br /> title={Specification tests for the multinomial logit model},<br /> author={Hausman, J. and McFadden, D.},<br /> journal={Econometrica: Journal of the Econometric Society},<br /> pages={1219-1240},<br /> year={1984},<br /> publisher={JSTOR}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @ARTICLE{Hooker2009Xpose,<br /> author = {Hooker, A. and Karlsson, M.O. and Jonsson, E.N.},<br /> title = {Model diagnostic using XPOSE},<br /> url = {http://xpose.sourceforge.net/generic\_chm/xpose.VPC.html},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @INPROCEEDINGS{KarlssonHolford2008Tutorial,<br /> author = {Karlsson, M.O. and Holford, N.},<br /> title = {A Tutorial on Visual Predictive Checks},<br /> booktitle = {PAGE 2008},<br /> year = {2008},<br /> owner = {kb},<br /> timestamp = {2011.04.18},<br /> url = {http://www.page-meeting.org/pdf\_assets/8694-Karlsson_Holford_VPC_Tutorial_hires.pdf}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{lavielle2011automatic,<br /> title={Automatic data binning for improved visual diagnosis of pharmacometric models},<br /> author={Lavielle, M. and Bleakley, K.},<br /> journal={Journal of pharmacokinetics and pharmacodynamics},<br /> volume={38},<br /> number={6},<br /> pages={861-871},<br /> year={2011},<br /> publisher={Springer}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @misc{massart2007concentration,<br /> title={Concentration Inequalities and Model Selection, Ecole d’Et&amp;eacute; de Probabilit&amp;eacute;s de Saint-Flour XXXIII-2003 Lecture Notes in Mathematics 1896},<br /> author={Massart, P.},<br /> year={2007},<br /> publisher={Springer-Verlag, Berlin}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{neyman1992problem,<br /> title={On the problem of the most efficient tests of statistical hypotheses},<br /> author={Neyman, J. and Pearson, E.S.},<br /> year={1992},<br /> publisher={Springer}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{schwarz1978estimating,<br /> title={Estimating the dimension of a model},<br /> author={Schwarz, G.},<br /> journal={The annals of statistics},<br /> volume={6},<br /> number={2},<br /> pages={461-464},<br /> year={1978},<br /> publisher={Institute of Mathematical Statistics}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{shimodaira1999multiple,<br /> title={Multiple comparisons of log-likelihoods with applications to phylogenetic inference},<br /> author={Shimodaira, H. and Hasegawa, M.},<br /> journal={Molecular biology and evolution},<br /> volume={16},<br /> pages={1114-1116},<br /> year={1999},<br /> publisher={UNIVERSITY OF CHICAGO PRESS}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{vaida2005conditional,<br /> title={Conditional Akaike information for mixed-effects models},<br /> author={Vaida, F. and Blanchard, S.},<br /> journal={Biometrika},<br /> volume={92},<br /> number={2},<br /> pages={351--370},<br /> year={2005},<br /> publisher={Biometrika Trust}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{wald1943tests,<br /> title={Tests of statistical hypotheses concerning several parameters when the number of observations is large},<br /> author={Wald, A.},<br /> journal={Transactions of the American Mathematical Society},<br /> volume={54},<br /> number={3},<br /> pages={426-482},<br /> year={1943},<br /> publisher={JSTOR}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{wilks1938large,<br /> title={The large-sample distribution of the likelihood ratio for testing composite hypotheses},<br /> author={Wilks, S.S.},<br /> journal={The Annals of Mathematical Statistics},<br /> volume={9},<br /> number={1},<br /> pages={60-62},<br /> year={1938},<br /> publisher={JSTOR}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{zhao2006model,<br /> title={On model selection consistency of Lasso},<br /> author={Zhao, P. and Yu, B.},<br /> journal={The Journal of Machine Learning Research},<br /> volume={7},<br /> pages={2541-2563},<br /> year={2006},<br /> publisher={JMLR. org}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> <br /> {{Back&amp;Next<br /> |linkBack=Visualization<br /> |linkNext=Simulation }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Estimation&diff=7169 Estimation 2013-06-04T18:19:51Z <p>Marc: /* Bibliography */</p> <hr /> <div>== Introduction ==<br /> <br /> In the modeling context, we usually assume that we have data that includes observations $\by$, measurement times $\bt$ and possibly additional regression variables $\bx$. There may also be individual covariates $\bc$, and in pharmacological applications the dose regimen $\bu$. For clarity, in the following notation we will omit the design variables $\bt$, $\bx$ and $\bu$, and the covariates $\bc$.<br /> <br /> Here, we find ourselves in the classical framework of incomplete data models. Indeed, only $\by = (y_{ij})$ is observed in the joint model $\pypsi(\by,\bpsi;\theta)$.<br /> <br /> Estimation tasks are common ones seen in statistics:<br /> <br /> <br /> &lt;ol&gt;<br /> &lt;li&gt; Estimate the population parameter $\theta$ using the available observations and possibly a priori information that is available.&lt;/li&gt;<br /> <br /> &lt;li&gt;Evaluate the precision of the proposed estimates.&lt;/li&gt;<br /> <br /> &lt;li&gt;Reconstruct missing data, here being the individual parameters $\bpsi=(\psi_i, 1\leq i \leq N)$. &lt;/li&gt;<br /> <br /> &lt;li&gt;Estimate the log-likelihood for a given model, i.e., for a given joint distribution $\qypsi$ and value of $\theta$.&lt;/li&gt;<br /> &lt;/ol&gt;<br /> <br /> <br /> &lt;br&gt;<br /> <br /> == Maximum likelihood estimation of the population parameters== <br /> <br /> &lt;br&gt;<br /> === Definitions ===<br /> <br /> <br /> ''Maximum likelihood estimation'' consists of maximizing with respect to $\theta$ the ''observed likelihood'' defined by:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \like(\theta ; \by) &amp;\eqdef&amp; \py(\by ; \theta) \\<br /> &amp;=&amp; \int \pypsi(\by,\bpsi ;\theta) \, d \bpsi .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Maximum likelihood estimation of the population parameter $\theta$ requires:<br /> <br /> &lt;blockquote&gt;<br /> * A model, i.e., a joint distribution $\qypsi$. Depending on the software used, the model can be implemented using a script or a graphical user interface. $\monolix$ is extremely flexible and allows us to combine both. It is possible for instance to code the structural model using $\mlxtran$ and use the GUI for implementing the statistical model. Whatever the options selected, the complete model can always be saved as a text file. &lt;br&gt;&lt;br&gt;<br /> * Inputs $\by$, $\bc$, $\bu$ and $\bt$. All of these variables tend to be stored in a unique data file (see the [[Visualization#Data exploration | Data Exploration ]] Section). &lt;br&gt;&lt;br&gt;<br /> * An algorithm which allows us to maximize $\int \pypsi(\by,\bpsi ;\theta) \, d \bpsi$ with respect to $\theta$. Each software package has its own algorithms implemented. It is not our goal here to rate and compare the various algorithms and implementations. We will use exclusively the SAEM algorithm as described in [[The SAEM algorithm for estimating population parameters | The SAEM algorithm]] and implemented in $\monolix$ as we are entirely satisfied by both its theoretical and practical qualities: &lt;br&gt;&lt;br&gt;<br /> ** The algorithms implemented in $\monolix$ including SAEM and its extensions (mixture models, hidden Markov models, SDE-based model, censored data, etc.) have been published in statistical journals. Furthermore, convergence of SAEM has been rigorously proved.&lt;br&gt;&lt;br&gt;<br /> ** The SAEM implementation in $\monolix$ is extremely efficient for a wide variety of complex models.&lt;br&gt;&lt;br&gt;<br /> ** The SAEM implementation in $\monolix$ was done by the same group that proposed the algorithm and studied in detail its theoretical and practical properties.<br /> &lt;/blockquote&gt;<br /> <br /> <br /> {{Remarks<br /> |title=Remark<br /> |text= It is important to highlight the fact that for a parameter $\psi_i$ whose distribution is the tranformation of a normal one (log-normal, logit-normal, etc.) the MLE $\hat{\psi}_{\rm pop}$ of the reference parameter $\psi_{\rm pop}$ is neither the mean nor the mode of the distribution. It is in fact the median.<br /> <br /> To show why this is the case, let $h$ be a nonlinear, twice continuously derivable and strictly increasing function such that $h(\psi_i)$ is normally distributed.<br /> <br /> <br /> * First we show that it is not the mean. By definition, the MLE of $h(\psi_{\rm pop})$ is $h(\hat{\psi}_{\rm pop})$. Thus, the estimated distribution of $h(\psi_i)$ is the normal distribution with mean $h(\hat{\psi}_{\rm pop})$, but $\esp{h(\psi_i)} = h(\hat{\psi}_{\rm pop})$ implies that $\esp{\psi_i} \neq \hat{\psi}_{\rm pop}$ since $h$ is nonlinear. In other words, $\hat{\psi}_{\rm pop}$ is not the mean of the estimated distribution of $\psi_i$.<br /> <br /> <br /> * Next we show that it is not the mode. Let $f$ be the pdf of $\psi_i$ and let $f_h$ be the pdf of $h(\psi_i)$. By definition, for any $h(t)\in \mathbb{R}$,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> f(t) = h^\prime(t)f_h(h(t)) . &lt;/math&gt; }}<br /> <br /> : Thus,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; <br /> f^\prime(t) = h^{\prime \prime}(t)f_h(h(t)) + h^{\prime 2}(t)f_h^\prime(h(t)) .<br /> &lt;/math&gt; }}<br /> <br /> : By definition of the mode, $f_h^\prime(h(\hat{\psi}_{\rm pop}))=0$. Since $h$ is nonlinear, $h^{\prime \prime}(\hat{\psi}_{\rm pop})\neq 0$ a.s. and $f^\prime(\hat{\psi}_{\rm pop})\neq 0$ a.s.. In other words, $\hat{\psi}_{\rm pop}$ is not the mode of the estimated distribution of $\psi_i$.<br /> <br /> <br /> * Now we show that it is the median. Since $h$ is a strictly increasing function,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \probs{\hat{\psi}_{\rm pop} }{\psi_i \leq \hat{\psi}_{\rm pop} } &amp;=&amp; \probs{\hat{\psi}_{\rm pop} }{h(\psi_i) \leq h(\hat{\psi}_{\rm pop})} \\<br /> &amp;=&amp; 0.5 .<br /> \end{eqnarray}&lt;/math&gt; }} <br /> <br /> : In other words, $\hat{\psi}_{\rm pop}$ is the median of the estimated distribution of $\psi_i$.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> === Example ===<br /> <br /> Let us again look at the model used in the [[Visualization#Model exploration | Model Visualization]] Section. For the case of a unique dose $D$ given at time $t=0$, the structural model is written:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> ke&amp;=&amp;Cl/V \\<br /> Cc(t) &amp;=&amp; \displaystyle{\frac{D \, ka}{V(ka-ke)} }\left(e^{-ke\,t} - e^{-ka\,t} \right) \\<br /> h(t) &amp;=&amp; h_0 \, \exp(\gamma\, Cc(t)) ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $Cc$ is the concentration in the central compartment and $h$ the hazard function for the event of interest (hemorrhaging). Supposing a constant error model for the concentration, the model for the observations can be easily implemented using $\mlxtran$.<br /> <br /> <br /> {{MLXTran<br /> |name=joint1est_model.txt<br /> |text=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt; <br /> INPUT:<br /> parameter = {ka, V, Cl, h0, gamma}<br /> <br /> EQUATION:<br /> ke=Cl/V<br /> Cc = amtDose*ka/(V*(ka-ke))*(exp(-ke*t) - exp(-ka*t))<br /> h = h0*exp(gamma*Cc)<br /> <br /> OBSERVATION:<br /> Concentration = {type=continuous, prediction=Cc, errorModel=constant}<br /> Hemorrhaging = {type=event, hazard=h}<br /> <br /> OUTPUT:<br /> output = {Concentration, Hemorrhaging}<br /> &lt;/pre&gt; }}<br /> <br /> <br /> Here, {{Verbatim|amtDose}} is a reserved keyword for the last administered dose.<br /> <br /> The model's parameters are the absorption rate constant $ka$, the volume of distribution $V$, the clearance $Cl$, the baseline hazard $h_0$ and the coefficient $\gamma$. The statistical model for the individual parameters can be defined in the $\monolix$ project file (left) and/or the $\monolix$ GUI (right):<br /> <br /> <br /> {{ExampleWithCode&amp;Image<br /> |title=<br /> |text=<br /> |code={{MLXTranForTable<br /> |name=<br /> |text=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt; <br /> INDIVIDUAL:<br /> ka = {distribution=logNormal, iiv=yes}<br /> V = {distribution=logNormal, iiv=yes},<br /> Cl = {distribution=normal, iiv=yes},<br /> h0 = {distribution=probitNormal, iiv=yes},<br /> gamma = {distribution=logitNormal, iiv=yes},<br /> &lt;/pre&gt; }}<br /> |image=<br /> [[File:Vsaem1.png]]<br /> }}<br /> <br /> <br /> Once the model is implemented, tasks such as maximum likelihood estimation can be performed using the SAEM algorithm. Certain settings in SAEM must be provided by the user. Even though SAEM is quite insensitive to the initial parameter values,<br /> it is possible to perform a preliminary sensitivity analysis in order to select &quot;good&quot; initial values.<br /> <br /> <br /> {{ImageWithCaption|image=Vsaem2.png|caption=Looking for good initial values for SAEM}}<br /> <br /> <br /> <br /> Then, when we run SAEM, it converges easily and quickly to the MLE:<br /> <br /> <br /> {{JustCode<br /> |code=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt;Estimation of the population parameters<br /> <br /> parameter<br /> ka : 0.974<br /> V : 7.07<br /> Cl : 2.00<br /> h0 : 0.0102<br /> gamma : 0.485<br /> <br /> omega_ka : 0.668<br /> omega_V : 0.365<br /> omega_Cl : 0.588<br /> omega_h0 : 0.105<br /> omega_gamma : 0.0901<br /> <br /> a_1 : 0.345<br /> &lt;/pre&gt; }}<br /> <br /> <br /> Parameter estimation can therefore be seen as estimating the reference values and variance of the random effects.<br /> <br /> In addition to these numbers, it is important to be able to graphically represent these distributions in order to see them and therefore understand them better. In effect, the interpretation of certain parameters is not always simple. Of course, we know what a normal distribution represents and in particular its mean, median and mode, which are equal (see the distribution of $Cl$ below for instance). These measures of central tendency can be different among themselves for other asymmetric distributions such as the log-normal (see the distribution of $ka$).<br /> <br /> Interpreting dispersion terms like $\omega_{ka}$ and $\omega_{V}$ is not obvious either when the parameter distributions are not normal. In such cases, quartiles or quantiles of order 5% and 95% (for example) may be useful for quantitively describing the variability of these parameters.<br /> <br /> <br /> {{Remarks <br /> |title=Remarks<br /> |text=<br /> For a parameter $\psi$ whose distribution is log-normal, we can approximate the coefficient of variation for $\psi$ by the standard deviation $\omega_{\psi}$ of the random effect $\eta$ if this is fairly small. In effect, when $\omega_{\psi}$ is small,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \psi &amp;=&amp; \psi_{\rm pop} e^{\eta} \\<br /> &amp;\approx &amp; \psi_{\rm pop}(1+ \eta) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Thus<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \esp{\psi} &amp;\approx&amp; \psi_{\rm pop} \\<br /> \std{\psi} &amp;\approx &amp; \psi_{\rm pop}\omega_{\psi},<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> and<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> {\rm cv}(\psi) &amp;=&amp; \frac{\std{\psi} }{\esp{\psi} } \\<br /> &amp;\approx &amp; \omega_{\psi} .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Do not forget that this approximation is only valid when $\omega$ is small and in the case of log-normal distributions. It does not carry over to any other distribution. Thus, when $\omega_{h0}=0.1$ for a probit-normal distribution or $\omega_{\gamma}=0.09$ for a logit-normal one, there is no immediate interpretation available. Only by looking at the graphical display of the pdf or by calculating some quantiles of interest can we begin to get an idea of dispersion in the parameters $h0$ and $\gamma$.<br /> }}<br /> <br /> <br /> {{ImageWithCaption|image=saem3b.png|caption=Estimation of the population distributions of the individual parameters of the model }}<br /> <br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==Bayesian estimation of the population parameters==<br /> <br /> The ''Bayesian approach'' considers $\theta$ as a random vector with a ''prior distribution'' $\qth$. We can then define the posterior distribution of $\theta$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \pcthy(\theta {{!}} \by ) &amp;=&amp; \displaystyle{ \frac{\pth( \theta )\pcyth(\by {{!}} \theta )}{\py(\by)} }\\<br /> &amp;=&amp; \displaystyle{ \frac{\pth( \theta ) \int \pypsith(\by,\bpsi {{!}}\theta) \, d \bpsi}{\py(\by)} }.<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> We can estimate this conditional distribution and derive any statistics (posterior mean, standard deviation, percentiles, etc.) or derive the so-called ''Maximum a Posteriori'' (MAP) estimate of $\theta$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \hat{\theta}^{\rm MAP} &amp;=&amp; \argmax{\theta} \pcthy(\theta {{!}} \by ) \\<br /> &amp;=&amp; \argmax{\theta} \left\{ {\llike}(\theta ; \by) + \log( \pth( \theta ) ) \right\} .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The MAP estimate therefore maximizes a penalized version of the observed likelihood. In other words, maximum a posteriori estimation reduces to penalized maximum likelihood estimation. Suppose for instance that $\theta$ is a scalar parameter and the prior is a normal distribution with mean $\theta_0$ and variance $\gamma^2$. Then, the MAP estimate minimizes<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> \hat{\theta}^{\rm MAP} =\argmax{\theta} \left\{ {\llike} (\theta ; \by) - \displaystyle{ \frac{1}{2\gamma^2} }(\theta - \theta_0)^2 \right\} .<br /> &lt;/math&gt; }}<br /> <br /> The MAP estimate is a trade-off between the MLE which maximizes ${\llike}(\theta ; \by)$ and $\theta_0$ which minimizes $(\theta - \theta_0)^2$. The weight given to the prior directly depends on the variance of the prior distribution: the smaller $\gamma^2$ is, the closer to $\theta_0$ the MAP is. The limiting distribution considers that $\gamma^2=0$: this prior means here that $\theta$ is fixed as $\theta_0$ and no longer needs to be estimated.<br /> <br /> Both the Bayesian and frequentist approaches have their supporters and detractors. But rather than being dogmatic and blindly following the same rule-book every time, we need to be pragmatic and ask the right methodological questions when confronted with a new problem.<br /> <br /> We have to remember that Bayesian methods have been extremely successful, in particular for numerical calculations. For instance, (Bayesian) MCMC methods allow us to estimate more or less any conditional distribution coming from any hierarchical model, whereas frequentist approaches such as maximum likelihood estimation can be much more difficult to implement.<br /> <br /> All things said, the problem comes down to knowing whether the data contains sufficient information to answer a given question, and whether some other information may be available to help answer it. This is the essence of the art of modeling: finding the right compromise between the confidence we have in the data and prior knowledge of the problem. Each problem is different and requires a specific approach. For instance, if all the patients in a pharmacokinetic trial have essentially the same weight, it is pointless to estimate a relationship between weight and the model's PK parameters using the trial data. In this case, the modeler would be better served trying to use prior information based on physiological criteria rather than just a statistical model.<br /> <br /> Therefore, we can use information available to us, of course! Why not? But this information needs to be pertinent. Systematically using a prior for the parameters is not always meaningful. Can we reasonable suppose that we have access to such information? For continuous data for example, what does putting a prior on the residual error model's parameters mean in reality? A reasoned statistical approach consists of only including prior information for certain parameters (those for which we have real prior information) and having confidence in the data for the others.<br /> <br /> $\monolix$ allows this hybrid approach which reconciles the Bayesian and frequentist approaches. A given parameter can be:<br /> <br /> <br /> &lt;ul&gt;<br /> * a fixed constant if we have absolute confidence in its value or the data does not allow it to be estimated, essentially due to identifiability constraints.<br /> &lt;br&gt;<br /> <br /> * estimated by maximum likelihood, either because we have great confidence in the data or have no information on the parameter.<br /> &lt;br&gt;<br /> <br /> * estimated by introducing a prior and calculating the MAP estimate.<br /> &lt;br&gt;<br /> <br /> * estimated by introducing a prior and then estimating the posterior distribution.<br /> &lt;/ul&gt;<br /> <br /> <br /> We put aside dealing with the fixed components of $\theta$ in the following. Here are some possible situations:<br /> <br /> <br /> &lt;ol&gt;<br /> &lt;li&gt; ''Combined maximum likelihood and maximum a posteriori estimation'': decompose $\theta$ into $(\theta_E,\theta_{M})$ where $\theta_E$ are the components of $\theta$ to be estimated with MLE and $\theta_{M}$ those with a prior distribution whose posterior distribution is to be maximized. Then, $(\hat{\theta}_E , \hat{\theta}_{M} )$ below maximizes the penalized likelihood of $(\theta_E,\theta_{M})$: &lt;/li&gt;<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> (\hat{\theta}_E , \hat{\theta}_{M} ) &amp;=&amp; \argmax{\theta_E , \theta_{M} } \log(\py(\by , \theta_{M}; \theta_E)) \\<br /> &amp;=&amp; \argmax{\theta_E , \theta_{M} } \left\{ {\llike}(\theta_E , \theta_{M}; \by) + \log( \pth( \theta_M ) ) \right\} ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where ${\llike} (\theta_E , \theta_{M}; \by) \ \ \eqdef \ \ \log\left(\py(\by | \theta_{M}; \theta_E)\right).$<br /> <br /> <br /> &lt;li&gt; ''Combined maximum likelihood and posterior distribution estimation'': here, decompose $\theta$ into $(\theta_E,\theta_{R})$ where $\theta_E$ are the components of $\theta$ to be estimated with MLE and $\theta_{R}$ those with a prior distribution whose posterior distribution is to be estimated. We propose the following strategy for estimating $\theta_E$ and $\theta_{R}$: &lt;/li&gt;<br /> <br /> <br /> &lt;ol style=&quot;list-style-type:lower-roman&quot;&gt;<br /> &lt;li&gt; Compute the maximum likelihood of $\theta_E$: &lt;/li&gt;<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \hat{\theta}_E &amp;=&amp; \argmax{\theta_E} \log(\py(\by ; \theta_E)) \\<br /> &amp;=&amp; \argmax{\theta_E} \int \pmacro(\by , \theta_R ; \theta_E ) d \theta_R .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> &lt;li&gt; Estimate the conditional distribution $\pmacro(\theta_{R} | \by ;\hat{\theta}_E)$. &lt;/li&gt;<br /> &lt;/ol&gt;<br /> <br /> <br /> It is then straightforward to extend this approach to more complex situations where some components of $\theta$ are estimated with MLE, others using MAP estimation and others still by estimating their conditional distributions.<br /> &lt;/ol&gt;<br /> <br /> <br /> {{Example1<br /> |title1=Example<br /> |title2=A PK example<br /> |text=<br /> In this example we use only the pharmacokinetic data and aim to estimate the population parameter distributions of the PK parameters $ka$, $V$ and $Cl$. We assume log-normal distributions for these three parameters. All of the model's population parameters are estimated by maximum likelihood estimation except $ka_{\rm pop}$ for which a log-normal distribution is used as a prior:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \log(ka_{\rm pop}) \sim {\cal N}(\log(1.5), \gamma^2) . &lt;/math&gt; }}<br /> <br /> $\monolix$ allows us to compute the MAP estimate and to estimate the posterior distribution of $ka_{\rm pop}$ for various values of $\gamma$.<br /> <br /> <br /> &lt;div style=&quot;margin-left:17%; margin-right:17%; align:center&quot;&gt;<br /> {{{!}} class=&quot;wikitable&quot; align=&quot;center&quot; style=&quot;width:100%&quot;<br /> {{!}} $\gamma$ {{!}}{{!}} 0 {{!}}{{!}} 0.01 {{!}}{{!}} 0.025 {{!}}{{!}} 0.05 {{!}}{{!}} 0.1 {{!}}{{!}} 0.2 {{!}}{{!}} $+ \infty$ <br /> {{!}}-<br /> {{!}}$\hat{ka}_{\rm pop}^{\rm MAP}$ {{!}}{{!}} 1.5 {{!}}{{!}} 1.49 {{!}}{{!}} 1.47 {{!}}{{!}} 1.39 {{!}}{{!}} 1.22 {{!}}{{!}} 1.11 {{!}}{{!}} 1.05 <br /> {{!}}}&lt;/div&gt;<br /> <br /> {{ImageWithCaption|image=bayes1.png|caption=Prior and posterior distributions of $ka_{\rm pop}$ for different values of $\gamma$}}<br /> <br /> <br /> As expected, the posterior distribution converges to the prior distribution when the standard deviation $\gamma$ of the prior distribution decreases. Also, the mode of the posterior distribution converges to the maximum likelihood estimate of $ka_{\rm pop}$ when $\gamma$ increases.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> == Estimation of the Fisher information matrix ==<br /> <br /> The variance of the estimator $\thmle$ and thus confidence intervals can be derived from the [[Estimation of the observed Fisher information matrix|observed Fisher information matrix (F.I.M.)]], which itself is calculated using the observed likelihood (i.e., the pdf of the observations $\by$):<br /> <br /> {{EquationWithRef<br /> |equation=&lt;div id=&quot;ofim_intro3&quot;&gt;&lt;math&gt;<br /> \ofim(\thmle ; \by) \ \ \eqdef \ \ - \displaystyle{ \frac{\partial^2}{\partial \theta^2} }\log({\like}(\thmle ; \by)) .<br /> &lt;/math&gt;&lt;/div&gt;<br /> |reference=(1) }}<br /> <br /> Then, the variance-covariance matrix of the maximum likelihood estimator $\thmle$ can be estimated by the inverse of the observed F.I.M. Standard errors (s.e.) for each component of $\thmle$ are their standard deviations, i.e., the square-root of the diagonal elements of this covariance matrix. $\monolix$ also displays the (estimated) relative standard errors (r.s.e.), i.e., the (estimated) standard error divided by the value of the estimated parameter.<br /> <br /> <br /> {{JustCode<br /> |code=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt;Estimation of the population parameters<br /> <br /> parameter s.e. (s.a.) r.s.e.(%)<br /> ka : 0.974 0.082 8<br /> V : 7.07 0.35 5<br /> Cl : 2 0.07 4<br /> h0 : 0.0102 0.0014 14<br /> gamma : 0.485 0.015 3<br /> <br /> omega_ka : 0.668 0.064 10<br /> omega_V : 0.365 0.037 10<br /> omega_Cl : 0.588 0.055 9<br /> omega_h0 : 0.105 0.032 30<br /> omega_gamma : 0.0901 0.044 49<br /> <br /> a_1 : 0.345 0.012 3<br /> &lt;/pre&gt; }}<br /> <br /> The F.I.M. can be used for detecting overparametrization of the structural model. In effect, if the model is poorly identifiable, certain estimators will be quite correlated and the F.I.M. will therefore be poorly conditioned and difficult to inverse. Suppose for example that we want to fit a two compartment PK model to the same data as before. The output is shown below. The large values for the relative standard errors for the inter-compartmental clearance $Q$ and the volume of the peripheral compartment $V_2$ mean that the data does not allow us to estimate well these two parameters.<br /> <br /> <br /> {{JustCode<br /> |code=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt;Estimation of the population parameters<br /> <br /> parameter s.e. (lin) r.s.e.(%)<br /> ka : 0.246 0.0081 3<br /> Cl : 1.9 0.075 4<br /> V1 : 1.71 0.14 8<br /> Q : 0.000171 0.024 1.43e+04<br /> V2 : 0.00673 3.1 4.62e+04<br /> <br /> omega_ka : 0.171 0.026 15<br /> omega_Cl : 0.293 0.026 9<br /> omega_V1 : 0.621 0.062 10<br /> omega_Q : 5.72 1.4e+03 2.41e+04<br /> omega_V2 : 4.61 1.8e+04 3.94e+05<br /> <br /> a : 0.136 0.0073 5<br /> &lt;/pre&gt; }}<br /> <br /> <br /> The Fisher information criteria is also widely used in optimal experimental design. Indeed, minimizing the variance of the estimator corresponds to maximizing the information. Then, estimators and designs can be evaluated by looking at certain summary statistics of the covariance matrix (like the determinant or trace for instance).<br /> <br /> &lt;br&gt;<br /> == Estimation of the individual parameters ==<br /> <br /> Once $\theta$ has been estimated, the conditional distribution $\pmacro(\psi_i | y_i ; \hat{\theta})$ of the individual parameters $\psi_i$ can be estimated for each individual $i$ using the [[The Metropolis-Hastings algorithm for simulating the individual parameters| Metropolis-Hastings algorithm]]. For each $i$, this algorithm generates a sequence $(\psi_i^{k}, k \geq 1)$ which converges in distribution to the conditional distribution $\pmacro(\psi_i | y_i ; \hat{\theta})$ and that can be used for estimating any summary statistic of this distribution (mean, standard deviation, quantiles, etc.).<br /> <br /> The mode of this conditional distribution can be estimated using this sequence or by maximizing $\pmacro(\psi_i | y_i ; \hat{\theta})$ using numerical methods.<br /> <br /> The choice of using the conditional mean or the conditional mode is arbitrary. By default, $\monolix$ uses the conditional mode, taking the philosophy that the &quot;most likely&quot; values of the individual parameters are the most suited for computing the &quot;most likely&quot; predictions.<br /> <br /> <br /> {{ImageWithCaption|image=mode1.png|caption=Predicted concentrations for 6 individuals using the estimated conditional modes of the individual PK parameters}} <br /> <br /> &lt;br&gt;<br /> <br /> == Estimation of the observed log-likelihood ==<br /> <br /> <br /> Once $\theta$ has been estimated, the observed log-likelihood of $\hat{\theta}$ is defined as<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \begin{eqnarray}<br /> {\llike} (\hat{\theta};\by) &amp;=&amp; \log({\like}(\hat{\theta};\by)) \\<br /> &amp;\eqdef&amp; \log(\py(\by;\hat{\theta})) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The observed log-likelihood cannot be computed in closed form for nonlinear mixed effects models, but can be estimated using the methods described in the [[Estimation of the log-likelihood]] Section. The estimated log-likelihood can then be used for performing likelihood ratio tests and for computing information criteria such as AIC and BIC (see the [[Evaluation]] Section).<br /> <br /> <br /> &lt;br&gt;<br /> == Bibliography ==<br /> <br /> &lt;bibtex&gt;<br /> @article{Monolix,<br /> author = {Lixoft},<br /> title = {Monolix 4.2},<br /> year={2012}<br /> journal = {http://www.lixoft.eu/products/monolix/product-monolix-overview},<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @article{comets2011package,<br /> title={saemix: Stochastic Approximation Expectation Maximization {(SAEM)} algorithm. R package version 0.96.1},<br /> author={Comets, E. and Lavenu, A. and Lavielle, M.},<br /> journal = {http://cran.r-project.org/web/packages/saemix/index.html},<br /> year={2013}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @article{nlmefitsa,<br /> title={nlmefitsa: fit nonlinear mixed-effects model with stochastic {EM} algorithm. Matlab R2013a function},<br /> author={The MathWorks},<br /> journal = {http://www.mathworks.fr/fr/help/stats/nlmefitsa.html},<br /> year={2013}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @article{beal1992nonmem,<br /> title={NONMEM users guides},<br /> author={Beal, S.L. and Sheiner, L.B. and Boeckmann, A. and Bauer, R.J.},<br /> journal={San Francisco, NONMEM Project Group, University of California},<br /> year={1992}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @book{pinheiro2000mixed,<br /> title={Mixed effects models in S and S-PLUS},<br /> author={Pinheiro, J.C. and Bates, D.M.},<br /> year={2000},<br /> publisher={Springer Verlag}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @article{pinheiro2010r,<br /> title={the R Core team (2009) nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-96},<br /> author={Pinheiro, J. and Bates, D. and DebRoy, S. and Sarkar, D.},<br /> journal={R Foundation for Statistical Computing, Vienna},<br /> year={2010}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @article{spiegelhalter2003winbugs,<br /> title={WinBUGS user manual},<br /> author={Spiegelhalter, D. and Thomas, A. and Best, N. and Lunn, D.},<br /> journal={Cambridge: MRC Biostatistics Unit},<br /> year={2003}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @Manual{docSPSS,<br /> title = {Linear mixed-effects modeling in {SPSS}. An introduction to the {MIXED} procedure},<br /> author = {SPSS},<br /> year = {2002},<br /> note={Technical Report}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> &lt;bibtex&gt;<br /> @Manual{docSAS,<br /> title = {The {NLMMIXED} procedure, {SAS/STAT} 9.2 User's Guide},<br /> chapter = {61},<br /> pages = {4337--4435},<br /> author = {SAS},<br /> year = {2008}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> <br /> {{Back&amp;Next<br /> |linkBack=Visualization<br /> |linkNext=Evaluation }}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Visualization&diff=7168 Visualization 2013-06-04T18:17:48Z <p>Marc: /* Bibliography */</p> <hr /> <div>== Introduction ==<br /> <br /> Before deciding to model data, it is very important to be able to visualize it. This is especially the case for longitudinal data when we want to see how an outcome varies with time or as a function of another outcome. We may also want to visualize how the individual covariates are distributed, visually detect if there are relationships between variables, visually compare data from different groups, etc. Development of such visual exploration tools poses no methodological problems. It is simple to write a Matlab or R code for one's own needs. To<br /> illustrate the data visualization part of this chapter, we have created a little Matlab toolbox called $\popixplore$ ({{filepath:popixplore 1.1.zip}}) which can be freely downloaded and used.<br /> <br /> It may also be useful to be able to visualize the model itself by undertaking a sensitivity analysis to look at how the structural model changes when we vary one or several parameters. This is important for truly understanding the structural model, i.e., what is behind the given mathematical equations. In the modeling context, we may also want to visually calibrate parameters in order to obtain predictions as close as possible to the observations. Developing such a tool is a difficult task because the tool needs to be able to easily input a model using some coding language, perform complex calculations, and provide a decent graphical interface (e.g., one that lets you easily modify the model parameters).<br /> <br /> Various model visualization tools exist, such as [[http://www.berkeleymadonna.com/index.html Berkeley Madonna]], specialized in the analysis of dynamical systems and the resolution of ordinary differential equations. Here, we use [[http://www.lixoft.eu/products/mlxplore/mlxplore-overview/ $\mlxplore$]] for some different reasons:<br /> <br /> <br /> &lt;ul&gt;<br /> * $\mlxplore$ uses the $\mlxtran$ language which is extremely flexible and well-adapted to implementing complex mixed-effects models. Indeed, with $\mlxtran$ we can implement pharmacokinetic models with complex administration schedules, include inter-individual variability in parameters, define a statistical model for the covariates, etc. Another extremely important aspect of $\mlxtran$ is that it rigorously adopts the model representation formalisms proposed in $\wikipopix$. In other words, model implementation is completely in sync with its mathematical representation.<br /> &lt;br&gt;<br /> <br /> * $\mlxplore$ provides a clear graphical interface that of course allows us to visualize the structural model, but also the statistical model, which is of fundamental importance in the population approach. We can thus visualize the impact of covariates and inter-individual variability of model parameters on predictions.<br /> &lt;/ul&gt;<br /> <br /> <br /> &lt;br&gt;<br /> <br /> == Data exploration ==<br /> <br /> <br /> The following example involves 80 individuals that receive a unique dose of an anticoagulant at time $t=0$. For each patient we then measure the plasmatic concentration of the drug at various times. This drug can cause undesirable side effects such as nose bleeds. If this happens, we also record the times at which this happens. The data is recorded in columns of a single text file {{Verbatim|pkrtte_data.csv}}. In this example, the columns are:<br /> <br /> <br /> &lt;ul&gt;<br /> '''id''' the ID number of the patient<br /> &lt;br&gt;&lt;br&gt;<br /> '''time''' dose administration and observation times<br /> &lt;br&gt;&lt;br&gt;<br /> '''amt''' the amount of drug administered<br /> &lt;br&gt;&lt;br&gt;<br /> '''y''' the observations (concentrations and events)<br /> &lt;br&gt;&lt;br&gt;<br /> '''ytype''' the type of observation: 1=concentration, 2=event<br /> &lt;br&gt;&lt;br&gt;<br /> '''weight''' a continuous individual covariate<br /> &lt;br&gt;&lt;br&gt;<br /> '''gender''' a categorical individual covariate (F or M)<br /> &lt;br&gt;&lt;br&gt;<br /> '''group''' four different groups receive different doses: A=40mg, B=60mg, C=80mg, D=100mg.<br /> &lt;/ul&gt;<br /> <br /> <br /> {{ImageWithCaption|image=exploredata0.png|caption=The datafile {{Verbatim|pkrtte_data.csv}} }} <br /> <br /> <br /> We can read this datafile with the function {{Verbatim|readdatapx}} and add additional information about the data:<br /> <br /> <br /> {{MATLABcode<br /> |name=<br /> |code=<br /> &lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> datafile.name='pkrtte_data.csv';<br /> datafile.format='csv'; % can be &quot;csv&quot;, &quot;space&quot;, &quot;tab&quot; or &quot;;&quot;<br /> <br /> info.header = {'ID','TIME','AMT','Y','YTYPE','COV','CAT','CAT'};<br /> info.observation.name={'concentration','hemorrhaging'};<br /> info.observation.type={'continuous','event'};<br /> info.observation.unit={'mg/l',''};<br /> info.covariate.unit={'kg',''};<br /> info.time.unit='h';<br /> <br /> data=readdatapx(datafile,info);<br /> &lt;/pre&gt; }}<br /> <br /> <br /> How we graphically represent data depends on the type of data. Often for continuous data we use &quot;spaghetti plots&quot;, where all of the observations are given on the same plot, and those for each individual are joined up using line segments. Time-to-event data are usually represented using Kaplan-Meyer plots, i.e., an estimate of the survival function for the first event. In the case of repeated events, we can instead represent the average cumulative number of events per individual.<br /> <br /> <br /> {{MATLABcode<br /> |name=<br /> |code=<br /> &lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> &gt;&gt;exploredatapx(data)<br /> &lt;/pre&gt; }}<br /> <br /> <br /> {{ImageWithCaption|image=exploredata1.png|caption=Graphical representation of the data. Left: concentrations, right: average cumulative number of events per individual}}<br /> <br /> <br /> When different groups receive different treatments, it can be useful to separately visualize the data from each group. Here for instance we can separate the patients into groups depending on the initial dose given.<br /> <br /> <br /> {{ImageWithCaption|image=exploredata2.png|caption=Concentration profiles per dose group}}<br /> <br /> <br /> {| cellpadding=&quot;10&quot; cellspacing=&quot;0&quot;<br /> |style = &quot;width:50%&quot;| [[File:exploredata3a.png]] <br /> |style = &quot;width:50%&quot;| [[File:exploredata3b.png]]<br /> |-<br /> |cellspan=&quot;2&quot; align=&quot;center&quot; style=&quot;text-align:center&quot;| ''Distribution of weight and gender per dose group'' <br /> |}<br /> <br /> <br /> <br /> {{Remarks<br /> |title=Remark<br /> |text=The data file {{Verbatim|pkrtte_data.csv}} and the matlab script {{Verbatim|pkrtte_demo.m}} are available in the folder {{Verbatim|demos}} of $\popixplore$: {{filepath:popixplore 1.1.zip}}.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==Model exploration==<br /> <br /> ===Exploring the structural model===<br /> <br /> Suppose that we now want to visualize the following joint model which is one that can be used for simultaneously modeling PK and time-to-event data:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> k&amp;=&amp;Cl/V \\<br /> \deriv{A_d} &amp;=&amp; - k_a \, A_d(t) \\<br /> \deriv{A_c} &amp;=&amp; k_a \, A_d(t) - k \, A_c(t) \\<br /> Cc(t) &amp;=&amp; {Ac(t)}/{V} \\<br /> h(t) &amp;=&amp; h_0 \, \exp(\gamma\, Cc(t)) .<br /> \end{eqnarray} &lt;/math&gt; }}<br /> <br /> Here, $A_d$ and $A_c$ are the amounts of drug in the depot and central compartments, $Cc$ the concentration in the central compartment and $h$ the hazard function for the event of interest (hemorrhaging for instance). The parameters of the model are the absorption rate constant $ka$, the volume of distribution $V$, the clearance $Cl$, the baseline hazard $h_0$ and the coefficient $\gamma$.<br /> We assume that the drug can be administered both intravenously and orally, meaning that the drug can be administered to both the depot and the central compartment.<br /> <br /> We first need to implement this model using $\mlxtran$:<br /> <br /> <br /> {{MLXTran<br /> |name=joint1_model.txt<br /> |text=<br /> &lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> [PREDICTION]<br /> input={ka, V, Cl, h0, gamma}<br /> <br /> PK:<br /> depot(type=1,target=Ad)<br /> depot(type=2,target=Ac)<br /> <br /> EQUATION:<br /> k = Cl/V<br /> ddt_Ad = -ka*Ad<br /> ddt_Ac = ka*Ad - k*Ac<br /> Cc = Ac/V<br /> h = h0*exp(gamma*Cc)<br /> &lt;/pre&gt;}}<br /> <br /> <br /> Here, an administration of type 1 (resp. 2) is an oral (resp. iv) administration.<br /> <br /> The tasks, i.e., how the model is to be used, are then coded as an $\mlxplore$ project:<br /> <br /> <br /> {{MLXPlore<br /> |name=joint1_project.txt<br /> |text=<br /> &lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> &lt;MODEL&gt;<br /> file='joint1_model.txt'<br /> <br /> &lt;DESIGN&gt;<br /> [ADMINISTRATION]<br /> adm1={time=0, amount=50,type=1}<br /> <br /> &lt;PARAMETER&gt;<br /> ka = 0.5<br /> V = 10<br /> Cl = 0.5<br /> h0 = 0.01<br /> gamma = 0.5<br /> <br /> &lt;OUTPUT&gt;<br /> list={Cc, h}<br /> grid=0:0.1:100<br /> &lt;/pre&gt; }}<br /> <br /> <br /> In this example, a single dose of 50 mg is administered orally ({{Verbatim|target{{-}}Ad}} when {{Verbatim|type{{-}}1}}) at time 0. We have asked $\mlxplore$ to display the predicted concentration $Cc$ and the hazard function $h$ between $t=0$ and $t=100$ every $0.1\,h$ for a given set of parameters. We can then change the values of these parameters with the sliders to see what the impact on the two functions is.<br /> <br /> <br /> {{ImageWithCaption|image=exploremodel1.png|caption=Exploring the model using $\mlxplore$ }}<br /> <br /> <br /> We can easily modify the dose regimen without changing anything in the model itself. Suppose for instance that we want now to compare a treatment with repeated doses of 50mg every 24 hours and a treatment with repeated doses of 25mg every 12 hours. Only the section {{Verbatim|&lt;DESIGN&gt;}} needs to be modified:<br /> <br /> <br /> {{ExampleWithCode&amp;Image<br /> |title=<br /> |text=<br /> |code={{MLXPloreForTable<br /> |name=joint2_project.txt<br /> |text=<br /> &lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> <br /> &lt;DESIGN&gt;<br /> [ADMINISTRATION]<br /> adm1={time=0:24:144, amount=50,type=1}<br /> adm2={time=0:12:144, amount=25,type=1}<br /> &lt;/pre&gt; }}<br /> |image=[[File:exploremodel2.png]] }}<br /> <br /> <br /> We can combine different administrations (oral and intravenous for instance) into one global treatment:<br /> <br /> <br /> {{ExampleWithCode&amp;Image<br /> |title=<br /> |text=<br /> |code={{MLXPloreForTable<br /> |name=joint3_project.txt<br /> |text=<br /> &lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> &lt;DESIGN&gt;<br /> [ADMINISTRATION]<br /> adm1={time=0:24:144, amount=50,type=1}<br /> adm2={time=6:48:150, amount=25,type=2}<br /> <br /> [TREATMENT]<br /> trt1={adm1, adm2}<br /> &lt;/pre&gt; }}<br /> |image= [[File:exploremodel3.png]]<br /> }}<br /> <br /> ===Exploring the statistical model===<br /> <br /> One of the main advantages of $\mlxplore$ is its ability to graphically display the predicted distribution of the functions of interest $Cc$ and $h$ when certain parameters of the model are assumed to be random variables. Assume for instance that $V$, $Cl$ and $h_0$ are log-normally distributed. To take this into account, we simply need to insert a section {{Verbatim|[INDIVIDUAL]}} into the project file:<br /> <br /> <br /> {{MLXTran<br /> |name=joint2_model.txt<br /> |text=&lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> [INDIVIDUAL]<br /> input={V_pop,Cl_pop,h0_pop,omega_V,omega_Cl,omega_h0}<br /> <br /> DEFINITION:<br /> V = {distribution=lognormal, reference=V_pop, sd=omega_V}<br /> Cl = {distribution=lognormal, reference=Cl_pop, sd=omega_Cl}<br /> h0 = {distribution=lognormal, reference=h0_pop, sd=omega_h0}<br /> <br /> [PREDICTION]<br /> input={ka, V, Cl, h0, gamma}<br /> .<br /> .<br /> .<br /> &lt;/pre&gt; }}<br /> <br /> <br /> The parameters of the model are now the population parameters $V_{\rm pop}$, $Cl_{\rm pop}$, $h0_{\rm pop}$, $\omega_V$, $\omega_{Cl}$ and $\omega_{h_0}$ and the parameters $k_a$ and $\gamma$ which have no inter-individual variability.<br /> <br /> <br /> {{MLXTran<br /> |name=joint4_project.txt<br /> |text=&lt;pre style=&quot;background-color: #EFEFEF; border:none&quot;&gt;<br /> &lt;MODEL&gt;<br /> file='joint2_model.txt'<br /> <br /> &lt;DESIGN&gt;<br /> [ADMINISTRATION]<br /> adm1={time=0, amount=50,type=1}<br /> <br /> &lt;PARAMETER&gt;<br /> V_pop = 10<br /> Cl_pop = 0.5<br /> h0_pop=0.01<br /> omega_V = 0.2<br /> omega_Cl = 0.3<br /> omega_h0 = 0.2<br /> ka = 0.5<br /> gamma = 0.5<br /> <br /> &lt;OUTPUT&gt;<br /> list={Cc, h}<br /> grid=0:0.1:100<br /> &lt;/pre&gt; }}<br /> <br /> <br /> When some parameters of the model are random variables, $\mlxplore$ displays the median of the predicted distribution and several prediction intervals (the default is to use different shaded areas for the 10%, 20%, ..., 90% quantiles).<br /> <br /> <br /> {{ImageWithCaption|image=exploremodel4b.png|caption=Exploring the statistical model using $\mlxplore$}}<br /> <br /> <br /> It is possible to introduce covariates into the statistical model by considering for example that the volume depends on the weight, and considering that these covariates are themselves random variables. This may be important if we are for example looking to visualize the amount of variation in concentration due to variation in weight, and the variation in concentration which remains unaccounted for, caused by random effects.<br /> <br /> <br /> {{ImageWithCaption|image=exploremodel5.png|caption=Exploring the statistical model using $\mlxplore$ }}<br /> <br /> <br /> The $\mlxtran$ model files and the $\mlxplore$ scripts can be downloaded here: {{filepath:pk mlxplore.zip}}.<br /> <br /> <br /> &lt;br&gt;<br /> == Bibliography ==<br /> <br /> <br /> &lt;bibtex&gt;<br /> @ARTICLE{popixplore,<br /> author = {POPIX Inria team},<br /> title = {Popixplore 1.0},<br /> url = {https://wiki.inria.fr/wikis/popix/images/7/71/Popixplore_1.1.zip},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @ARTICLE{MLXplore,<br /> author = {Lixoft},<br /> title = {MLXPlore 1.0},<br /> url = {http://www.lixoft.eu/products/mlxplore/mlxplore-overview},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{macey2000berkeley,<br /> title={Berkeley Madonna user’s guide},<br /> author={Macey, R. and Oster, G. and Zahnley, T.},<br /> journal={Berkeley (CA): University of California},<br /> year={2000}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{chatterjee2009sensitivity,<br /> title={Sensitivity analysis in linear regression},<br /> author={Chatterjee, S. and Hadi, A. S.},<br /> volume={327},<br /> year={2009},<br /> publisher={Wiley}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{sensibilité2013,<br /> title={Analyse de sensibilité et exploration de modèles},<br /> author={Faivre R. and Looss B. and Mah&amp;eacute;vas, S. and Makowski, D. and Monod, H.},<br /> year={2013},<br /> publisher={Editions Quae}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{saltelli2000sensitivity,<br /> title={Sensitivity analysis},<br /> author={Saltelli, A. and Chan, K. and Scott, E. M. and others},<br /> volume={134},<br /> year={2000},<br /> publisher={Wiley New York}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{saltelli2008global,<br /> title={Global sensitivity analysis: the primer},<br /> author={Saltelli, A. and Ratto, M. and Andres, T. and Campolongo, F. and Cariboni, J. and Gatelli, D. and Saisana, M. and Tarantola, S.},<br /> year={2008},<br /> publisher={Wiley-Interscience}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @book{saltelli2004sensitivity,<br /> title={Sensitivity analysis in practice: a guide to assessing scientific models},<br /> author={Saltelli, A. and Tarantola, S. and Campolongo, F. and Ratto, M.},<br /> year={2004},<br /> publisher={Wiley}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> <br /> <br /> {{Next<br /> |link=Modeling}}</div> Marc https://wiki.inria.fr/wikis/popix/index.php?title=Estimation&diff=7167 Estimation 2013-06-04T17:30:20Z <p>Marc: </p> <hr /> <div>== Introduction ==<br /> <br /> In the modeling context, we usually assume that we have data that includes observations $\by$, measurement times $\bt$ and possibly additional regression variables $\bx$. There may also be individual covariates $\bc$, and in pharmacological applications the dose regimen $\bu$. For clarity, in the following notation we will omit the design variables $\bt$, $\bx$ and $\bu$, and the covariates $\bc$.<br /> <br /> Here, we find ourselves in the classical framework of incomplete data models. Indeed, only $\by = (y_{ij})$ is observed in the joint model $\pypsi(\by,\bpsi;\theta)$.<br /> <br /> Estimation tasks are common ones seen in statistics:<br /> <br /> <br /> &lt;ol&gt;<br /> &lt;li&gt; Estimate the population parameter $\theta$ using the available observations and possibly a priori information that is available.&lt;/li&gt;<br /> <br /> &lt;li&gt;Evaluate the precision of the proposed estimates.&lt;/li&gt;<br /> <br /> &lt;li&gt;Reconstruct missing data, here being the individual parameters $\bpsi=(\psi_i, 1\leq i \leq N)$. &lt;/li&gt;<br /> <br /> &lt;li&gt;Estimate the log-likelihood for a given model, i.e., for a given joint distribution $\qypsi$ and value of $\theta$.&lt;/li&gt;<br /> &lt;/ol&gt;<br /> <br /> <br /> &lt;br&gt;<br /> <br /> == Maximum likelihood estimation of the population parameters== <br /> <br /> &lt;br&gt;<br /> === Definitions ===<br /> <br /> <br /> ''Maximum likelihood estimation'' consists of maximizing with respect to $\theta$ the ''observed likelihood'' defined by:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \like(\theta ; \by) &amp;\eqdef&amp; \py(\by ; \theta) \\<br /> &amp;=&amp; \int \pypsi(\by,\bpsi ;\theta) \, d \bpsi .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Maximum likelihood estimation of the population parameter $\theta$ requires:<br /> <br /> &lt;blockquote&gt;<br /> * A model, i.e., a joint distribution $\qypsi$. Depending on the software used, the model can be implemented using a script or a graphical user interface. $\monolix$ is extremely flexible and allows us to combine both. It is possible for instance to code the structural model using $\mlxtran$ and use the GUI for implementing the statistical model. Whatever the options selected, the complete model can always be saved as a text file. &lt;br&gt;&lt;br&gt;<br /> * Inputs $\by$, $\bc$, $\bu$ and $\bt$. All of these variables tend to be stored in a unique data file (see the [[Visualization#Data exploration | Data Exploration ]] Section). &lt;br&gt;&lt;br&gt;<br /> * An algorithm which allows us to maximize $\int \pypsi(\by,\bpsi ;\theta) \, d \bpsi$ with respect to $\theta$. Each software package has its own algorithms implemented. It is not our goal here to rate and compare the various algorithms and implementations. We will use exclusively the SAEM algorithm as described in [[The SAEM algorithm for estimating population parameters | The SAEM algorithm]] and implemented in $\monolix$ as we are entirely satisfied by both its theoretical and practical qualities: &lt;br&gt;&lt;br&gt;<br /> ** The algorithms implemented in $\monolix$ including SAEM and its extensions (mixture models, hidden Markov models, SDE-based model, censored data, etc.) have been published in statistical journals. Furthermore, convergence of SAEM has been rigorously proved.&lt;br&gt;&lt;br&gt;<br /> ** The SAEM implementation in $\monolix$ is extremely efficient for a wide variety of complex models.&lt;br&gt;&lt;br&gt;<br /> ** The SAEM implementation in $\monolix$ was done by the same group that proposed the algorithm and studied in detail its theoretical and practical properties.<br /> &lt;/blockquote&gt;<br /> <br /> <br /> {{Remarks<br /> |title=Remark<br /> |text= It is important to highlight the fact that for a parameter $\psi_i$ whose distribution is the tranformation of a normal one (log-normal, logit-normal, etc.) the MLE $\hat{\psi}_{\rm pop}$ of the reference parameter $\psi_{\rm pop}$ is neither the mean nor the mode of the distribution. It is in fact the median.<br /> <br /> To show why this is the case, let $h$ be a nonlinear, twice continuously derivable and strictly increasing function such that $h(\psi_i)$ is normally distributed.<br /> <br /> <br /> * First we show that it is not the mean. By definition, the MLE of $h(\psi_{\rm pop})$ is $h(\hat{\psi}_{\rm pop})$. Thus, the estimated distribution of $h(\psi_i)$ is the normal distribution with mean $h(\hat{\psi}_{\rm pop})$, but $\esp{h(\psi_i)} = h(\hat{\psi}_{\rm pop})$ implies that $\esp{\psi_i} \neq \hat{\psi}_{\rm pop}$ since $h$ is nonlinear. In other words, $\hat{\psi}_{\rm pop}$ is not the mean of the estimated distribution of $\psi_i$.<br /> <br /> <br /> * Next we show that it is not the mode. Let $f$ be the pdf of $\psi_i$ and let $f_h$ be the pdf of $h(\psi_i)$. By definition, for any $h(t)\in \mathbb{R}$,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> f(t) = h^\prime(t)f_h(h(t)) . &lt;/math&gt; }}<br /> <br /> : Thus,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; <br /> f^\prime(t) = h^{\prime \prime}(t)f_h(h(t)) + h^{\prime 2}(t)f_h^\prime(h(t)) .<br /> &lt;/math&gt; }}<br /> <br /> : By definition of the mode, $f_h^\prime(h(\hat{\psi}_{\rm pop}))=0$. Since $h$ is nonlinear, $h^{\prime \prime}(\hat{\psi}_{\rm pop})\neq 0$ a.s. and $f^\prime(\hat{\psi}_{\rm pop})\neq 0$ a.s.. In other words, $\hat{\psi}_{\rm pop}$ is not the mode of the estimated distribution of $\psi_i$.<br /> <br /> <br /> * Now we show that it is the median. Since $h$ is a strictly increasing function,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \probs{\hat{\psi}_{\rm pop} }{\psi_i \leq \hat{\psi}_{\rm pop} } &amp;=&amp; \probs{\hat{\psi}_{\rm pop} }{h(\psi_i) \leq h(\hat{\psi}_{\rm pop})} \\<br /> &amp;=&amp; 0.5 .<br /> \end{eqnarray}&lt;/math&gt; }} <br /> <br /> : In other words, $\hat{\psi}_{\rm pop}$ is the median of the estimated distribution of $\psi_i$.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> <br /> === Example ===<br /> <br /> Let us again look at the model used in the [[Visualization#Model exploration | Model Visualization]] Section. For the case of a unique dose $D$ given at time $t=0$, the structural model is written:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> ke&amp;=&amp;Cl/V \\<br /> Cc(t) &amp;=&amp; \displaystyle{\frac{D \, ka}{V(ka-ke)} }\left(e^{-ke\,t} - e^{-ka\,t} \right) \\<br /> h(t) &amp;=&amp; h_0 \, \exp(\gamma\, Cc(t)) ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where $Cc$ is the concentration in the central compartment and $h$ the hazard function for the event of interest (hemorrhaging). Supposing a constant error model for the concentration, the model for the observations can be easily implemented using $\mlxtran$.<br /> <br /> <br /> {{MLXTran<br /> |name=joint1est_model.txt<br /> |text=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt; <br /> INPUT:<br /> parameter = {ka, V, Cl, h0, gamma}<br /> <br /> EQUATION:<br /> ke=Cl/V<br /> Cc = amtDose*ka/(V*(ka-ke))*(exp(-ke*t) - exp(-ka*t))<br /> h = h0*exp(gamma*Cc)<br /> <br /> OBSERVATION:<br /> Concentration = {type=continuous, prediction=Cc, errorModel=constant}<br /> Hemorrhaging = {type=event, hazard=h}<br /> <br /> OUTPUT:<br /> output = {Concentration, Hemorrhaging}<br /> &lt;/pre&gt; }}<br /> <br /> <br /> Here, {{Verbatim|amtDose}} is a reserved keyword for the last administered dose.<br /> <br /> The model's parameters are the absorption rate constant $ka$, the volume of distribution $V$, the clearance $Cl$, the baseline hazard $h_0$ and the coefficient $\gamma$. The statistical model for the individual parameters can be defined in the $\monolix$ project file (left) and/or the $\monolix$ GUI (right):<br /> <br /> <br /> {{ExampleWithCode&amp;Image<br /> |title=<br /> |text=<br /> |code={{MLXTranForTable<br /> |name=<br /> |text=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt; <br /> INDIVIDUAL:<br /> ka = {distribution=logNormal, iiv=yes}<br /> V = {distribution=logNormal, iiv=yes},<br /> Cl = {distribution=normal, iiv=yes},<br /> h0 = {distribution=probitNormal, iiv=yes},<br /> gamma = {distribution=logitNormal, iiv=yes},<br /> &lt;/pre&gt; }}<br /> |image=<br /> [[File:Vsaem1.png]]<br /> }}<br /> <br /> <br /> Once the model is implemented, tasks such as maximum likelihood estimation can be performed using the SAEM algorithm. Certain settings in SAEM must be provided by the user. Even though SAEM is quite insensitive to the initial parameter values,<br /> it is possible to perform a preliminary sensitivity analysis in order to select &quot;good&quot; initial values.<br /> <br /> <br /> {{ImageWithCaption|image=Vsaem2.png|caption=Looking for good initial values for SAEM}}<br /> <br /> <br /> <br /> Then, when we run SAEM, it converges easily and quickly to the MLE:<br /> <br /> <br /> {{JustCode<br /> |code=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt;Estimation of the population parameters<br /> <br /> parameter<br /> ka : 0.974<br /> V : 7.07<br /> Cl : 2.00<br /> h0 : 0.0102<br /> gamma : 0.485<br /> <br /> omega_ka : 0.668<br /> omega_V : 0.365<br /> omega_Cl : 0.588<br /> omega_h0 : 0.105<br /> omega_gamma : 0.0901<br /> <br /> a_1 : 0.345<br /> &lt;/pre&gt; }}<br /> <br /> <br /> Parameter estimation can therefore be seen as estimating the reference values and variance of the random effects.<br /> <br /> In addition to these numbers, it is important to be able to graphically represent these distributions in order to see them and therefore understand them better. In effect, the interpretation of certain parameters is not always simple. Of course, we know what a normal distribution represents and in particular its mean, median and mode, which are equal (see the distribution of $Cl$ below for instance). These measures of central tendency can be different among themselves for other asymmetric distributions such as the log-normal (see the distribution of $ka$).<br /> <br /> Interpreting dispersion terms like $\omega_{ka}$ and $\omega_{V}$ is not obvious either when the parameter distributions are not normal. In such cases, quartiles or quantiles of order 5% and 95% (for example) may be useful for quantitively describing the variability of these parameters.<br /> <br /> <br /> {{Remarks <br /> |title=Remarks<br /> |text=<br /> For a parameter $\psi$ whose distribution is log-normal, we can approximate the coefficient of variation for $\psi$ by the standard deviation $\omega_{\psi}$ of the random effect $\eta$ if this is fairly small. In effect, when $\omega_{\psi}$ is small,<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \psi &amp;=&amp; \psi_{\rm pop} e^{\eta} \\<br /> &amp;\approx &amp; \psi_{\rm pop}(1+ \eta) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Thus<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \esp{\psi} &amp;\approx&amp; \psi_{\rm pop} \\<br /> \std{\psi} &amp;\approx &amp; \psi_{\rm pop}\omega_{\psi},<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> and<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> {\rm cv}(\psi) &amp;=&amp; \frac{\std{\psi} }{\esp{\psi} } \\<br /> &amp;\approx &amp; \omega_{\psi} .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> Do not forget that this approximation is only valid when $\omega$ is small and in the case of log-normal distributions. It does not carry over to any other distribution. Thus, when $\omega_{h0}=0.1$ for a probit-normal distribution or $\omega_{\gamma}=0.09$ for a logit-normal one, there is no immediate interpretation available. Only by looking at the graphical display of the pdf or by calculating some quantiles of interest can we begin to get an idea of dispersion in the parameters $h0$ and $\gamma$.<br /> }}<br /> <br /> <br /> {{ImageWithCaption|image=saem3b.png|caption=Estimation of the population distributions of the individual parameters of the model }}<br /> <br /> <br /> <br /> &lt;br&gt;<br /> <br /> ==Bayesian estimation of the population parameters==<br /> <br /> The ''Bayesian approach'' considers $\theta$ as a random vector with a ''prior distribution'' $\qth$. We can then define the posterior distribution of $\theta$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \pcthy(\theta {{!}} \by ) &amp;=&amp; \displaystyle{ \frac{\pth( \theta )\pcyth(\by {{!}} \theta )}{\py(\by)} }\\<br /> &amp;=&amp; \displaystyle{ \frac{\pth( \theta ) \int \pypsith(\by,\bpsi {{!}}\theta) \, d \bpsi}{\py(\by)} }.<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> We can estimate this conditional distribution and derive any statistics (posterior mean, standard deviation, percentiles, etc.) or derive the so-called ''Maximum a Posteriori'' (MAP) estimate of $\theta$:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \hat{\theta}^{\rm MAP} &amp;=&amp; \argmax{\theta} \pcthy(\theta {{!}} \by ) \\<br /> &amp;=&amp; \argmax{\theta} \left\{ {\llike}(\theta ; \by) + \log( \pth( \theta ) ) \right\} .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The MAP estimate therefore maximizes a penalized version of the observed likelihood. In other words, maximum a posteriori estimation reduces to penalized maximum likelihood estimation. Suppose for instance that $\theta$ is a scalar parameter and the prior is a normal distribution with mean $\theta_0$ and variance $\gamma^2$. Then, the MAP estimate minimizes<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;<br /> \hat{\theta}^{\rm MAP} =\argmax{\theta} \left\{ {\llike} (\theta ; \by) - \displaystyle{ \frac{1}{2\gamma^2} }(\theta - \theta_0)^2 \right\} .<br /> &lt;/math&gt; }}<br /> <br /> The MAP estimate is a trade-off between the MLE which maximizes ${\llike}(\theta ; \by)$ and $\theta_0$ which minimizes $(\theta - \theta_0)^2$. The weight given to the prior directly depends on the variance of the prior distribution: the smaller $\gamma^2$ is, the closer to $\theta_0$ the MAP is. The limiting distribution considers that $\gamma^2=0$: this prior means here that $\theta$ is fixed as $\theta_0$ and no longer needs to be estimated.<br /> <br /> Both the Bayesian and frequentist approaches have their supporters and detractors. But rather than being dogmatic and blindly following the same rule-book every time, we need to be pragmatic and ask the right methodological questions when confronted with a new problem.<br /> <br /> We have to remember that Bayesian methods have been extremely successful, in particular for numerical calculations. For instance, (Bayesian) MCMC methods allow us to estimate more or less any conditional distribution coming from any hierarchical model, whereas frequentist approaches such as maximum likelihood estimation can be much more difficult to implement.<br /> <br /> All things said, the problem comes down to knowing whether the data contains sufficient information to answer a given question, and whether some other information may be available to help answer it. This is the essence of the art of modeling: finding the right compromise between the confidence we have in the data and prior knowledge of the problem. Each problem is different and requires a specific approach. For instance, if all the patients in a pharmacokinetic trial have essentially the same weight, it is pointless to estimate a relationship between weight and the model's PK parameters using the trial data. In this case, the modeler would be better served trying to use prior information based on physiological criteria rather than just a statistical model.<br /> <br /> Therefore, we can use information available to us, of course! Why not? But this information needs to be pertinent. Systematically using a prior for the parameters is not always meaningful. Can we reasonable suppose that we have access to such information? For continuous data for example, what does putting a prior on the residual error model's parameters mean in reality? A reasoned statistical approach consists of only including prior information for certain parameters (those for which we have real prior information) and having confidence in the data for the others.<br /> <br /> $\monolix$ allows this hybrid approach which reconciles the Bayesian and frequentist approaches. A given parameter can be:<br /> <br /> <br /> &lt;ul&gt;<br /> * a fixed constant if we have absolute confidence in its value or the data does not allow it to be estimated, essentially due to identifiability constraints.<br /> &lt;br&gt;<br /> <br /> * estimated by maximum likelihood, either because we have great confidence in the data or have no information on the parameter.<br /> &lt;br&gt;<br /> <br /> * estimated by introducing a prior and calculating the MAP estimate.<br /> &lt;br&gt;<br /> <br /> * estimated by introducing a prior and then estimating the posterior distribution.<br /> &lt;/ul&gt;<br /> <br /> <br /> We put aside dealing with the fixed components of $\theta$ in the following. Here are some possible situations:<br /> <br /> <br /> &lt;ol&gt;<br /> &lt;li&gt; ''Combined maximum likelihood and maximum a posteriori estimation'': decompose $\theta$ into $(\theta_E,\theta_{M})$ where $\theta_E$ are the components of $\theta$ to be estimated with MLE and $\theta_{M}$ those with a prior distribution whose posterior distribution is to be maximized. Then, $(\hat{\theta}_E , \hat{\theta}_{M} )$ below maximizes the penalized likelihood of $(\theta_E,\theta_{M})$: &lt;/li&gt;<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> (\hat{\theta}_E , \hat{\theta}_{M} ) &amp;=&amp; \argmax{\theta_E , \theta_{M} } \log(\py(\by , \theta_{M}; \theta_E)) \\<br /> &amp;=&amp; \argmax{\theta_E , \theta_{M} } \left\{ {\llike}(\theta_E , \theta_{M}; \by) + \log( \pth( \theta_M ) ) \right\} ,<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> where ${\llike} (\theta_E , \theta_{M}; \by) \ \ \eqdef \ \ \log\left(\py(\by | \theta_{M}; \theta_E)\right).$<br /> <br /> <br /> &lt;li&gt; ''Combined maximum likelihood and posterior distribution estimation'': here, decompose $\theta$ into $(\theta_E,\theta_{R})$ where $\theta_E$ are the components of $\theta$ to be estimated with MLE and $\theta_{R}$ those with a prior distribution whose posterior distribution is to be estimated. We propose the following strategy for estimating $\theta_E$ and $\theta_{R}$: &lt;/li&gt;<br /> <br /> <br /> &lt;ol style=&quot;list-style-type:lower-roman&quot;&gt;<br /> &lt;li&gt; Compute the maximum likelihood of $\theta_E$: &lt;/li&gt;<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt;\begin{eqnarray}<br /> \hat{\theta}_E &amp;=&amp; \argmax{\theta_E} \log(\py(\by ; \theta_E)) \\<br /> &amp;=&amp; \argmax{\theta_E} \int \pmacro(\by , \theta_R ; \theta_E ) d \theta_R .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> <br /> &lt;li&gt; Estimate the conditional distribution $\pmacro(\theta_{R} | \by ;\hat{\theta}_E)$. &lt;/li&gt;<br /> &lt;/ol&gt;<br /> <br /> <br /> It is then straightforward to extend this approach to more complex situations where some components of $\theta$ are estimated with MLE, others using MAP estimation and others still by estimating their conditional distributions.<br /> &lt;/ol&gt;<br /> <br /> <br /> {{Example1<br /> |title1=Example<br /> |title2=A PK example<br /> |text=<br /> In this example we use only the pharmacokinetic data and aim to estimate the population parameter distributions of the PK parameters $ka$, $V$ and $Cl$. We assume log-normal distributions for these three parameters. All of the model's population parameters are estimated by maximum likelihood estimation except $ka_{\rm pop}$ for which a log-normal distribution is used as a prior:<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \log(ka_{\rm pop}) \sim {\cal N}(\log(1.5), \gamma^2) . &lt;/math&gt; }}<br /> <br /> $\monolix$ allows us to compute the MAP estimate and to estimate the posterior distribution of $ka_{\rm pop}$ for various values of $\gamma$.<br /> <br /> <br /> &lt;div style=&quot;margin-left:17%; margin-right:17%; align:center&quot;&gt;<br /> {{{!}} class=&quot;wikitable&quot; align=&quot;center&quot; style=&quot;width:100%&quot;<br /> {{!}} $\gamma$ {{!}}{{!}} 0 {{!}}{{!}} 0.01 {{!}}{{!}} 0.025 {{!}}{{!}} 0.05 {{!}}{{!}} 0.1 {{!}}{{!}} 0.2 {{!}}{{!}} $+ \infty$ <br /> {{!}}-<br /> {{!}}$\hat{ka}_{\rm pop}^{\rm MAP}$ {{!}}{{!}} 1.5 {{!}}{{!}} 1.49 {{!}}{{!}} 1.47 {{!}}{{!}} 1.39 {{!}}{{!}} 1.22 {{!}}{{!}} 1.11 {{!}}{{!}} 1.05 <br /> {{!}}}&lt;/div&gt;<br /> <br /> {{ImageWithCaption|image=bayes1.png|caption=Prior and posterior distributions of $ka_{\rm pop}$ for different values of $\gamma$}}<br /> <br /> <br /> As expected, the posterior distribution converges to the prior distribution when the standard deviation $\gamma$ of the prior distribution decreases. Also, the mode of the posterior distribution converges to the maximum likelihood estimate of $ka_{\rm pop}$ when $\gamma$ increases.<br /> }}<br /> <br /> <br /> &lt;br&gt;<br /> == Estimation of the Fisher information matrix ==<br /> <br /> The variance of the estimator $\thmle$ and thus confidence intervals can be derived from the [[Estimation of the observed Fisher information matrix|observed Fisher information matrix (F.I.M.)]], which itself is calculated using the observed likelihood (i.e., the pdf of the observations $\by$):<br /> <br /> {{EquationWithRef<br /> |equation=&lt;div id=&quot;ofim_intro3&quot;&gt;&lt;math&gt;<br /> \ofim(\thmle ; \by) \ \ \eqdef \ \ - \displaystyle{ \frac{\partial^2}{\partial \theta^2} }\log({\like}(\thmle ; \by)) .<br /> &lt;/math&gt;&lt;/div&gt;<br /> |reference=(1) }}<br /> <br /> Then, the variance-covariance matrix of the maximum likelihood estimator $\thmle$ can be estimated by the inverse of the observed F.I.M. Standard errors (s.e.) for each component of $\thmle$ are their standard deviations, i.e., the square-root of the diagonal elements of this covariance matrix. $\monolix$ also displays the (estimated) relative standard errors (r.s.e.), i.e., the (estimated) standard error divided by the value of the estimated parameter.<br /> <br /> <br /> {{JustCode<br /> |code=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt;Estimation of the population parameters<br /> <br /> parameter s.e. (s.a.) r.s.e.(%)<br /> ka : 0.974 0.082 8<br /> V : 7.07 0.35 5<br /> Cl : 2 0.07 4<br /> h0 : 0.0102 0.0014 14<br /> gamma : 0.485 0.015 3<br /> <br /> omega_ka : 0.668 0.064 10<br /> omega_V : 0.365 0.037 10<br /> omega_Cl : 0.588 0.055 9<br /> omega_h0 : 0.105 0.032 30<br /> omega_gamma : 0.0901 0.044 49<br /> <br /> a_1 : 0.345 0.012 3<br /> &lt;/pre&gt; }}<br /> <br /> The F.I.M. can be used for detecting overparametrization of the structural model. In effect, if the model is poorly identifiable, certain estimators will be quite correlated and the F.I.M. will therefore be poorly conditioned and difficult to inverse. Suppose for example that we want to fit a two compartment PK model to the same data as before. The output is shown below. The large values for the relative standard errors for the inter-compartmental clearance $Q$ and the volume of the peripheral compartment $V_2$ mean that the data does not allow us to estimate well these two parameters.<br /> <br /> <br /> {{JustCode<br /> |code=&lt;pre style=&quot;background-color:#EFEFEF; border:none;&quot;&gt;Estimation of the population parameters<br /> <br /> parameter s.e. (lin) r.s.e.(%)<br /> ka : 0.246 0.0081 3<br /> Cl : 1.9 0.075 4<br /> V1 : 1.71 0.14 8<br /> Q : 0.000171 0.024 1.43e+04<br /> V2 : 0.00673 3.1 4.62e+04<br /> <br /> omega_ka : 0.171 0.026 15<br /> omega_Cl : 0.293 0.026 9<br /> omega_V1 : 0.621 0.062 10<br /> omega_Q : 5.72 1.4e+03 2.41e+04<br /> omega_V2 : 4.61 1.8e+04 3.94e+05<br /> <br /> a : 0.136 0.0073 5<br /> &lt;/pre&gt; }}<br /> <br /> <br /> The Fisher information criteria is also widely used in optimal experimental design. Indeed, minimizing the variance of the estimator corresponds to maximizing the information. Then, estimators and designs can be evaluated by looking at certain summary statistics of the covariance matrix (like the determinant or trace for instance).<br /> <br /> &lt;br&gt;<br /> == Estimation of the individual parameters ==<br /> <br /> Once $\theta$ has been estimated, the conditional distribution $\pmacro(\psi_i | y_i ; \hat{\theta})$ of the individual parameters $\psi_i$ can be estimated for each individual $i$ using the [[The Metropolis-Hastings algorithm for simulating the individual parameters| Metropolis-Hastings algorithm]]. For each $i$, this algorithm generates a sequence $(\psi_i^{k}, k \geq 1)$ which converges in distribution to the conditional distribution $\pmacro(\psi_i | y_i ; \hat{\theta})$ and that can be used for estimating any summary statistic of this distribution (mean, standard deviation, quantiles, etc.).<br /> <br /> The mode of this conditional distribution can be estimated using this sequence or by maximizing $\pmacro(\psi_i | y_i ; \hat{\theta})$ using numerical methods.<br /> <br /> The choice of using the conditional mean or the conditional mode is arbitrary. By default, $\monolix$ uses the conditional mode, taking the philosophy that the &quot;most likely&quot; values of the individual parameters are the most suited for computing the &quot;most likely&quot; predictions.<br /> <br /> <br /> {{ImageWithCaption|image=mode1.png|caption=Predicted concentrations for 6 individuals using the estimated conditional modes of the individual PK parameters}} <br /> <br /> &lt;br&gt;<br /> <br /> == Estimation of the observed log-likelihood ==<br /> <br /> <br /> Once $\theta$ has been estimated, the observed log-likelihood of $\hat{\theta}$ is defined as<br /> <br /> {{Equation1<br /> |equation=&lt;math&gt; \begin{eqnarray}<br /> {\llike} (\hat{\theta};\by) &amp;=&amp; \log({\like}(\hat{\theta};\by)) \\<br /> &amp;\eqdef&amp; \log(\py(\by;\hat{\theta})) .<br /> \end{eqnarray}&lt;/math&gt; }}<br /> <br /> The observed log-likelihood cannot be computed in closed form for nonlinear mixed effects models, but can be estimated using the methods described in the [[Estimation of the log-likelihood]] Section. The estimated log-likelihood can then be used for performing likelihood ratio tests and for computing information criteria such as AIC and BIC (see the [[Evaluation]] Section).<br /> <br /> <br /> &lt;br&gt;<br /> == Bibliography ==<br /> <br /> <br /> &lt;bibtex&gt;<br /> @article{beal1992nonmem,<br /> title={NONMEM users guides},<br /> author={Beal, S.L. and Sheiner, L.B. and Boeckmann, A. and Bauer, R.J.},<br /> journal={San Francisco, NONMEM Project Group, University of California},<br /> year={1992}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{comets2011package,<br /> title={saemix: Stochastic Approximation Expectation Maximization (SAEM) algorithm. R package version 0.96.1},<br /> author={Comets, E. and Lavenu, A. and Lavielle, M.},<br /> url = {http://cran.r-project.org/web/packages/saemix/index.html},<br /> year={2013}<br /> }<br /> &lt;bibtex&gt;<br /> @ARTICLE{Monolix,<br /> author = {Lixoft},<br /> title = {Monolix 4.2},<br /> year={2012}<br /> url = {http://www.lixoft.eu/products/monolix/product-monolix-overview},<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @misc{pinheiro1995lme,<br /> title={lme and nlme: Mixed-effects Methods and Classes for S and S-plus},<br /> author={Pinheiro, J. C. and Bates, D. M.},<br /> year={1995},<br /> publisher={Version}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{pinheiro2010r,<br /> title={the R Core team (2009) nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-96},<br /> author={Pinheiro, J. and Bates, D. and DebRoy, S. and Sarkar, D.},<br /> journal={R Foundation for Statistical Computing, Vienna},<br /> year={2010}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @inbook{docSAS,<br /> title = {The NLMMIXED procedure, SAS/STAT 9.2 User's Guide},<br /> chapter = {61},<br /> pages = {4337--4435},<br /> author = {SAS},<br /> year = {2008}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{spiegelhalter2003winbugs,<br /> title={WinBUGS user manual},<br /> author={Spiegelhalter, D. and Thomas, A. and Best, N. and Lunn, D.},<br /> journal={Cambridge: MRC Biostatistics Unit},<br /> year={2003}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @Manual{docSPSS,<br /> title = {Linear mixed-effects modeling in SPSS. An introduction to the MIXED procedure},<br /> author = {SPSS},<br /> year = {2002},<br /> note={Technical Report}<br /> }<br /> &lt;/bibtex&gt;<br /> &lt;bibtex&gt;<br /> @article{nlmefitsa,<br /> title={nlmefitsa: fit nonlinear mixed-effects model with stochastic EM algorithm. Matlab R2013a function},<br /> author={The MathWorks},<br /> url = {http://www.mathworks.fr/fr/help/stats/nlmefitsa.html},<br /> year={2013}<br /> }<br /> &lt;/bibtex&gt;<br /> <br /> <br /> {{Back&amp;Next<br /> |linkBack=Visualization<br /> |linkNext=Evaluation }}</div> Marc