Likelihood of observed biomarker data#

Definitions#

The model assumes that a disease progresses according to a set of events \(E_1, \ldots, E_N\), where \(N\) is the number of biomarkers. Our goal is to estimate an ordering \(S \in \sigma(N)\) over the events, which is a permutation of size \(N\). The value of biomarker marker \(n\) for patient \(j\) is \(X_{nj} \in \mathbb{R}\), which is a real-valued number. Each person \(j\) has a corresponding Bernoulli random variable \(d_j \in \{0,1\}\), which denotes whether they have the disease or not (Note this is generally observed for these models, but I am including it for completeness). For person \(j\) (assumed to have the disease to simplify notation), \(k_n \in \{0, 1, \ldots, N\}\) denotes their current disease stage. Let \(\theta_n\) denote the parameters for the distribution of biomarker \(n\) when it is diseased and \(\phi_n\) be the corresponding parameters for when it is healthy.

Known \(k_j\)#

Let’s first deal with this equation:

\[p(X_{j} | S , z_j = 1, k_j) = \prod_{i=1}^{k_j}{p(X_{S(i)j} \mid \theta_n )} \prod_{i=k_j+1}^N{p(X_{S(i)j} \mid \phi_n)}\]

This equation compuates the likelihood of the observed biomarker data of a specific participant, given that we know the disease stage this patient is at (\(k_j\)).

  • \(S\) is an orded array of biomarkers that are affected by the disease, for example, \([b, a, d, c]\). This means that at biomarker \(b\) is affected in stage 1. At stage 2, biomarker \(b\) and \(a\) will be affected.

  • \(n\) indicates one biomarker.

  • \(k_j\) indicates the stage the patient is at, for example, \(k_j = 2\). This means that the disease has effected biomarker \(a\) and \(b\). Biomarker \(c\) and \(d\) have not been affected yet.

  • \(\theta_n\) is the parameters for the probability density function (PDF) of observed value of biomarker \(n\) when this biomarker has been affected by the disease. Let’s assume this distribution is a Gaussian distribution with means of \([45, 50, 55, 60]\) and a standard deviation of \(5\) for biomarker \(b\), \(a\), \(d\), and \(c\).

  • \(\phi_n\) is the parameters for the probability density function (PDF) of observed value of biomarker \(n\) when this biomarker has NOT been affected by the disease. Let’s assume this distribution is a Gaussian distribution with means of \([25, 30, 35, 40]\) and a standard deviation of \(3\) for biomarker \(b\), \(a\), \(d\), and \(c\).

  • \(X_j\) is an array representing the patient’s observed data for all biomarker. Assume the data is \([77, 45, 53, 90]\) for biomarker \(b\), \(a\), \(d\), and \(c\).

We assume that the patient is at stage \(2\) of this disease; hence \(k_j = 2\).

Next, we are going to calculate \(p(X_j|S, z_j = 1, k_j)\):

When \(i = 1\), we have \(S_{(i)} = n = b\) and \(X_{S_{(i)}} = X_b = 45\). So

\[p(X_{S_{(i)}} | \theta_n) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{X_b - \mu}{\sigma} \right)^2}\]

Because \(k_j = 2\), so biomarker \(b\) and \(a\) are affected. We should use the distribution of \(\theta_b\); therefore, we should plug in \(\mu = 45, \sigma = 5\) in the above equation.

We can do the same for \(i\) = 2, 3, and 4.

So

\[p(X_j | S, k_j = 2) = p (X_b | \theta_b) \times p (X_a | \theta_a) \times p (X_d | \phi_d) \times p (X_c | \phi_c)\]

The above is the likelihood of the given biomarker data when \(k_j = 2\).

Note that \(p (X_b | \theta_b)\) is probability density, a value of a probability density function at a specific point; so it is not a probability itself.

Multiplying multiple probability densities will give us a likelihood.

Unknown \(k_j\)#

\[P(X_{j} | z_j=1, S) = \sum_{k_j=0}^N{P(k_j) p(X_{j} \mid S, k_j)}\]

Suppose we have the same information above, except that we do not know at which disease stage the patient is, i.e., we do not know \(k_j\). We have the observed biomarker data: \(X_j = [77, 45, 53, 90]\). And I wonder: what is the likelihood of seeing this specific ovserved data?

We assume that all five stages (including \(k_j = 0\)) are equally likely.

We do not know \(k_j\), so the best option is to calculate the “average” likelihood of all the biomarker data.

Based on the equation in the first section, we can calculate the following:

\(L_1 = p(X_j | S, k_j = 1)\)

\(L_2 = p(X_j | S, k_j = 2)\)

\(L_3 = p(X_j | S, k_j = 3)\)

\(L_4 = p(X_j | S, k_j = 4)\)

Also note that we need to consider \(L_0\) because in the equation above, \(k_j\) starts from \(0\).

\[L_0 = p(X_j | S, k_j = 0) = p (X_b | \phi_b) \times p (X_a | \phi_a) \times p (X_d | \phi_d) \times p (X_c | \phi_c)\]
\[L_1 = p(X_j | S, k_j = 1) = p (X_b | \theta_b) \times p (X_a | \phi_a) \times p (X_d | \phi_d) \times p (X_c | \phi_c)\]
\[L_2 = p(X_j | S, k_j = 2) = p (X_b | \theta_b) \times p (X_a | \theta_a) \times p (X_d | \phi_d) \times p (X_c | \phi_c)\]
\[L_3 = p(X_j | S, k_j = 3) = p (X_b | \theta_b) \times p (X_a | \theta_a) \times p (X_d | \theta_d) \times p (X_c | \phi_c)\]
\[L_4 = p(X_j | S, k_j = 4) = p (X_b | \theta_b) \times p (X_a | \theta_a) \times p (X_d | \theta_d) \times p (X_c | \theta_c)\]

\(P(k_j)\) is the prior likelihood of being at stage \(k\). If we have a uniform prior on \(k_j\), then:

\(P(X_{j} | z_j=1, S) = \frac{1}{5} \left(L_0 + L_1 + L_2 + L_3 + L_4 \right)\)