Likelihood of observed biomarker data#
Definitions#
The model assumes that a disease progresses according to a set of events \(E_1, \ldots, E_N\), where \(N\) is the number of biomarkers. Our goal is to estimate an ordering \(S \in \sigma(N)\) over the events, which is a permutation of size \(N\). The value of biomarker marker \(n\) for patient \(j\) is \(X_{nj} \in \mathbb{R}\), which is a real-valued number. Each person \(j\) has a corresponding Bernoulli random variable \(d_j \in \{0,1\}\), which denotes whether they have the disease or not (Note this is generally observed for these models, but I am including it for completeness). For person \(j\) (assumed to have the disease to simplify notation), \(k_n \in \{0, 1, \ldots, N\}\) denotes their current disease stage. Let \(\theta_n\) denote the parameters for the distribution of biomarker \(n\) when it is diseased and \(\phi_n\) be the corresponding parameters for when it is healthy.
Known \(k_j\)#
Let’s first deal with this equation:
This equation compuates the likelihood of the observed biomarker data of a specific participant, given that we know the disease stage this patient is at (\(k_j\)).
\(S\) is an orded array of biomarkers that are affected by the disease, for example, \([b, a, d, c]\). This means that at biomarker \(b\) is affected in stage 1. At stage 2, biomarker \(b\) and \(a\) will be affected.
\(n\) indicates one biomarker.
\(k_j\) indicates the stage the patient is at, for example, \(k_j = 2\). This means that the disease has effected biomarker \(a\) and \(b\). Biomarker \(c\) and \(d\) have not been affected yet.
\(\theta_n\) is the parameters for the probability density function (PDF) of observed value of biomarker \(n\) when this biomarker has been affected by the disease. Let’s assume this distribution is a Gaussian distribution with means of \([45, 50, 55, 60]\) and a standard deviation of \(5\) for biomarker \(b\), \(a\), \(d\), and \(c\).
\(\phi_n\) is the parameters for the probability density function (PDF) of observed value of biomarker \(n\) when this biomarker has NOT been affected by the disease. Let’s assume this distribution is a Gaussian distribution with means of \([25, 30, 35, 40]\) and a standard deviation of \(3\) for biomarker \(b\), \(a\), \(d\), and \(c\).
\(X_j\) is an array representing the patient’s observed data for all biomarker. Assume the data is \([77, 45, 53, 90]\) for biomarker \(b\), \(a\), \(d\), and \(c\).
We assume that the patient is at stage \(2\) of this disease; hence \(k_j = 2\).
Next, we are going to calculate \(p(X_j|S, z_j = 1, k_j)\):
When \(i = 1\), we have \(S_{(i)} = n = b\) and \(X_{S_{(i)}} = X_b = 45\). So
Because \(k_j = 2\), so biomarker \(b\) and \(a\) are affected. We should use the distribution of \(\theta_b\); therefore, we should plug in \(\mu = 45, \sigma = 5\) in the above equation.
We can do the same for \(i\) = 2, 3, and 4.
So
The above is the likelihood of the given biomarker data when \(k_j = 2\).
Note that \(p (X_b | \theta_b)\) is probability density, a value of a probability density function at a specific point; so it is not a probability itself.
Multiplying multiple probability densities will give us a likelihood.
Unknown \(k_j\)#
Suppose we have the same information above, except that we do not know at which disease stage the patient is, i.e., we do not know \(k_j\). We have the observed biomarker data: \(X_j = [77, 45, 53, 90]\). And I wonder: what is the likelihood of seeing this specific ovserved data?
We assume that all five stages (including \(k_j = 0\)) are equally likely.
We do not know \(k_j\), so the best option is to calculate the “average” likelihood of all the biomarker data.
Based on the equation in the first section, we can calculate the following:
\(L_1 = p(X_j | S, k_j = 1)\)
\(L_2 = p(X_j | S, k_j = 2)\)
\(L_3 = p(X_j | S, k_j = 3)\)
\(L_4 = p(X_j | S, k_j = 4)\)
Also note that we need to consider \(L_0\) because in the equation above, \(k_j\) starts from \(0\).
\(P(k_j)\) is the prior likelihood of being at stage \(k\). If we have a uniform prior on \(k_j\), then:
\(P(X_{j} | z_j=1, S) = \frac{1}{5} \left(L_0 + L_1 + L_2 + L_3 + L_4 \right)\)