# Likelihood of observed biomarker data#

## Definitions#

The model assumes that a disease progresses according to a set of events \(E_1, \ldots, E_N\), where \(N\) is the number of biomarkers. Our goal is to estimate an ordering \(S \in \sigma(N)\) over the events, which is a permutation of size \(N\). The value of biomarker marker \(n\) for patient \(j\) is \(X_{nj} \in \mathbb{R}\), which is a real-valued number. Each person \(j\) has a corresponding Bernoulli random variable \(d_j \in \{0,1\}\), which denotes whether they have the disease or not (Note this is generally observed for these models, but I am including it for completeness). For person \(j\) (assumed to have the disease to simplify notation), \(k_n \in \{0, 1, \ldots, N\}\) denotes their current disease stage. Let \(\theta_n\) denote the parameters for the distribution of biomarker \(n\) when it is diseased and \(\phi_n\) be the corresponding parameters for when it is healthy.

## Known \(k_j\)#

Let’s first deal with this equation:

This equation compuates the likelihood of the observed biomarker data of a specific participant, given that we know the disease stage this patient is at (\(k_j\)).

\(S\) is an

**orded array**of biomarkers that are affected by the disease, for example, \([b, a, d, c]\). This means that at biomarker \(b\) is affected in stage 1. At stage 2, biomarker \(b\) and \(a\) will be affected.\(n\) indicates one biomarker.

\(k_j\) indicates the stage the patient is at, for example, \(k_j = 2\). This means that the disease has effected biomarker \(a\) and \(b\). Biomarker \(c\) and \(d\) have not been affected yet.

\(\theta_n\) is the parameters for the probability density function (PDF) of observed value of biomarker \(n\) when this biomarker has been affected by the disease. Let’s assume this distribution is a Gaussian distribution with means of \([45, 50, 55, 60]\) and a standard deviation of \(5\) for biomarker \(b\), \(a\), \(d\), and \(c\).

\(\phi_n\) is the parameters for the probability density function (PDF) of observed value of biomarker \(n\) when this biomarker has

**NOT**been affected by the disease. Let’s assume this distribution is a Gaussian distribution with means of \([25, 30, 35, 40]\) and a standard deviation of \(3\) for biomarker \(b\), \(a\), \(d\), and \(c\).\(X_j\) is an array representing the patient’s observed data for all biomarker. Assume the data is \([77, 45, 53, 90]\) for biomarker \(b\), \(a\), \(d\), and \(c\).

We assume that the patient is at stage \(2\) of this disease; hence \(k_j = 2\).

Next, we are going to calculate \(p(X_j|S, z_j = 1, k_j)\):

When \(i = 1\), we have \(S_{(i)} = n = b\) and \(X_{S_{(i)}} = X_b = 45\). So

Because \(k_j = 2\), so biomarker \(b\) and \(a\) are affected. We should use the distribution of \(\theta_b\); therefore, we should plug in \(\mu = 45, \sigma = 5\) in the above equation.

We can do the same for \(i\) = 2, 3, and 4.

So

The above is **the likelihood of the given biomarker data when \(k_j = 2\)**.

Note that \(p (X_b | \theta_b)\) is probability density, a value of a probability density function at a specific point; so it is not a probability itself.

Multiplying multiple probability densities will give us a likelihood.

## Unknown \(k_j\)#

Suppose we have the same information above, except that we do not know at which disease stage the patient is, i.e., we do not know \(k_j\). We have the observed biomarker data: \(X_j = [77, 45, 53, 90]\). And I wonder: what is the likelihood of seeing this specific ovserved data?

We assume that all five stages (including \(k_j = 0\)) are equally likely.

We do not know \(k_j\), so the best option is to calculate the “average” likelihood of all the biomarker data.

Based on the equation in the first section, we can calculate the following:

\(L_1 = p(X_j | S, k_j = 1)\)

\(L_2 = p(X_j | S, k_j = 2)\)

\(L_3 = p(X_j | S, k_j = 3)\)

\(L_4 = p(X_j | S, k_j = 4)\)

Also note that we need to consider \(L_0\) because in the equation above, \(k_j\) starts from \(0\).

\(P(k_j)\) is the prior likelihood of being at stage \(k\). If we have a uniform prior on \(k_j\), then:

\(P(X_{j} | z_j=1, S) = \frac{1}{5} \left(L_0 + L_1 + L_2 + L_3 + L_4 \right)\)