Estimating distribution parameters#

If I were to write a sampler for it, I’d first write one where I knew the correct ordering and the stage the participant’s disease was in and inferring the means of the Gaussians for both distributions based on that. We didn’t discuss priors on the Gaussians means and variances, but I would use conjugate ones in which case the update is exact.

Therefore, in this estimation, I assume that we know \(S\) (therefore \(S_n\)) and \(k_j\) and am estimating \(\theta\) and \(\phi\) based on participants’ data. We also know that for each biomarker, its value is drawn from a normal distribution, either from the healthy (\(\theta\)) or the diseased (\(\phi\)) distribution.

Conjugate priors#

Conjugacy occurs when the posterior distribution is in the same family of distribution as the prior distribution, but with new parameter values.

Why conjugacy is important? Because without it, one has to do the integration, which oftentimes is hard.

Three major conjugate families:

  • Beta-Binomial

  • Gamma-Poisson

  • Normal-Normal

In our example, we assume that the measurement data for each biomarker follows a normal distribution; however, we do not know the exact \(\mu\) and \(\sigma\). Our job is to estimate the two parameters for each biomarker based on the data we have.

According to An Introduction to Bayesian Thinking by Clyde et al. (2022), if the data comes from a normal distribution with unknown \(\mu\) and \(\sigma\), the conjugate prior for \(\mu\) has a normal distribution with mean \(m_0\) and variance \(\frac{\sigma^2}{n_0}\). The conjugate prior for \(\frac{1}{\sigma^2}\) has a Gamma distribution with shape \(\frac{v_0}{2}\) and rate \(\frac{v_0 s_0^{2}}{2}\) where

  • \(m_0\): prior estimate of \(\mu\).

  • \(n_0\): how strongly is the prior belief in \(m_0\) is held.

  • \(s_0^2\): prior estimate of \(\sigma^2\).

  • \(v_0\): prior degress of freedome, influencing the certainty of \(s_0^2\).

That is to say:

\[\mu | \sigma^2 \sim \mathcal{N}(m_0, \sigma^2/n_0)\]
\[1/\sigma^2 \sim Gamma\left(\frac{v_0}{2}, \frac{v_0 s_0^2}{2} \right)\]

Combined, we have:

\[(\mu, 1/\sigma^2) \sim NormalGamma(m_0, n_0, s_0^2, v_0)\]

The posterior also follows a Normal-Gamma distribution:

\[(\mu, 1/\sigma^2) | data \sim NormalGamma(m_n, n_n, s_n^2, v_n)\]

More specifically

\[1/\sigma^2 | data \sim Gamma(v_n/2, s_n^2 v_n/2)\]
\[\mu | data, \sigma^2 \sim \mathcal{N}(m_n, \sigma^2/n_n)\]

Based on the above two equations, we know that the mean of posterior mean is \(m_n\) and the mean of the posterior variance is \((s_n^2 v_n/2)/(v_n/2)\). This is beceause the expected value of \(Gamma(\alpha, \beta)\) is \(\frac{\alpha}{\beta}\).

where

  • \(m_n\): posterior mean, mode, and median for \(\mu\)

  • \(n_n\): posterior sample size

  • \(s_n^2\): posterior variance

  • \(v_n\): posterior degrees of freedome

The updating rules to get the new hyper-parameters:

\[m_n = \frac{n \bar{y} + n_0m_0}{n + n_0}\]
\[\frac{n}{n+n_0} \bar{y} + \frac{n_0}{n+n_0}m_0\]
\[n_n = n_0 + n\]
\[v_n = v_0 + n\]
\[s_n^2 = \frac{1}{v_n}\left[s^2(n-1) + s_0^2v_0 + \frac{n_0n}{n_n}(\bar{y}-m_0)^2\right]\]

where

  • \(n\): sample size

  • \(\bar{y}\): sample mean

  • \(s^2\): sample variance

Implementing estimation#

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import mode
import altair as alt
from sklearn.cluster import KMeans
def estimate_params_exact(m0, n0, s0_sq, v0, data):
    '''This is to estimate means and vars based on conjugate priors
    Inputs:
        - data: a vector of measurements 
        - m0: prior estimate of $\mu$.
        - n0: how strongly is the prior belief in $m_0$ is held.
        - s0_sq: prior estimate of $\sigma^2$.
        - v0: prior degress of freedome, influencing the certainty of $s_0^2$.
    
    Outputs:
        - mu estiate, std estimate
    '''
    # Data summary
    sample_mean = np.mean(data)
    sample_size = len(data)
    sample_var = np.var(data, ddof=1)  # ddof=1 for unbiased estimator

    # Update hyperparameters for the Normal-Inverse Gamma posterior
    updated_m0 = (n0 * m0 + sample_size * sample_mean) / (n0 + sample_size)
    updated_n0 = n0 + sample_size
    updated_v0 = v0 + sample_size 
    updated_s0_sq = (1 / updated_v0) * ((sample_size - 1) * sample_var + v0 * s0_sq + 
                    (n0 * sample_size / updated_n0) * (sample_mean - m0)**2)
    updated_alpha = updated_v0/2
    updated_beta = updated_v0*updated_s0_sq/2

    # Posterior estimates
    mu_posterior_mean = updated_m0
    sigma_squared_posterior_mean = updated_beta/updated_alpha

    mu_estimation = mu_posterior_mean
    std_estimation = np.sqrt(sigma_squared_posterior_mean)

    return mu_estimation, std_estimation
# truth data; used to validate the estimates
truth = pd.read_csv('data/means_stds.csv')
truth
biomarker theta_mean theta_std phi_mean phi_std
0 MMSE 28.0 0.666667 22.0 2.666667
1 ADAS -6.0 1.333333 -20.0 4.000000
2 AB 250.0 50.000000 150.0 16.666667
3 P-Tau -25.0 16.666667 -50.0 33.333333
4 HIP-FCI 5.0 1.666667 -5.0 6.666667
5 HIP-GMI 0.4 0.233333 0.3 0.333333
6 AVLT-Sum 40.0 15.000000 20.0 6.666667
7 PCC-FCI 12.0 4.000000 5.0 3.333333
8 FUS-GMI 0.6 0.066667 0.5 0.066667
9 FUS-FCI -10.0 3.333333 -20.0 6.000000
# observed data:
observed_data = pd.read_csv('data/participant_data.csv')
observed_data.head()
participant biomarker measurement k_j S_n affected_or_not
0 0 MMSE 28.812457 5 5 affected
1 1 MMSE 28.483480 8 5 affected
2 2 MMSE 27.483507 5 5 affected
3 3 MMSE 27.886646 9 5 affected
4 4 MMSE 27.412189 8 5 affected
biomarkers = np.array(truth.biomarker)
biomarkers
array(['MMSE', 'ADAS', 'AB', 'P-Tau', 'HIP-FCI', 'HIP-GMI', 'AVLT-Sum',
       'PCC-FCI', 'FUS-GMI', 'FUS-FCI'], dtype=object)
def get_estimated_means_stds_df(biomarkers, observed_data):
    '''This is to get the dataframe of estimated means and stds 
    Inputs:
        - biomarkers: biomarker names, a numpy array
        - observed_data: participants data 
    Outputs:
        - a pandas dataframe containing theta and phi means and stds
    '''
    # empty list of dictionaries to store the estimates 
    means_stds_estimate_dict_list = []
    
    for biomarker in biomarkers: 
        dic = {'biomarker': biomarker}  # Initialize dictionary outside the inner loop
        for i in ['affected', 'not_affected']:
            data_full = observed_data[(observed_data.biomarker == biomarker) & (
            observed_data.affected_or_not == i)]
            data = data_full.measurement
            mu_estimate, std_estimate = estimate_params_exact(
                m0 = 0, n0 = 1, s0_sq = 1, v0 = 1, data=data)
            if i == 'affected':
                dic['theta_mean'] = mu_estimate
                dic['theta_std'] = std_estimate
            else:
                dic['phi_mean'] = mu_estimate
                dic['phi_std'] = std_estimate
        print(f"biomarker {biomarker} done!")
        means_stds_estimate_dict_list.append(dic)
    estimate_means_stds_df = pd.DataFrame(means_stds_estimate_dict_list)
    return estimate_means_stds_df 
estimate_df = get_estimated_means_stds_df(biomarkers, observed_data)
estimate_df.to_csv("data/estimate_means_stds.csv", index = False)
estimate_df
biomarker MMSE done!
biomarker ADAS done!
biomarker AB done!
biomarker P-Tau done!
biomarker HIP-FCI done!
biomarker HIP-GMI done!
biomarker AVLT-Sum done!
biomarker PCC-FCI done!
biomarker FUS-GMI done!
biomarker FUS-FCI done!
biomarker theta_mean theta_std phi_mean phi_std
0 MMSE 27.617968 3.850069 22.199018 4.192632
1 ADAS -6.151077 1.533481 -19.439533 4.711673
2 AB 247.235481 52.713262 146.151967 29.119009
3 P-Tau -25.402886 17.776365 -51.237207 35.051918
4 HIP-FCI 4.917914 1.843385 -3.768449 6.875395
5 HIP-GMI 0.350846 0.314217 0.253432 0.341515
6 AVLT-Sum 36.621843 15.053672 19.293318 7.846781
7 PCC-FCI 11.660100 3.693987 4.778861 3.538650
8 FUS-GMI 0.573859 0.306157 0.485299 0.141073
9 FUS-FCI -8.350404 4.117264 -19.691945 6.087333
def obtain_theta_phi_params(biomarker, estimate_df, truth):
    '''This is to obtain both true and estimated theta and phi params for each biomarker '''
    biomarker_data_est = estimate_df[estimate_df.biomarker == biomarker].reset_index()
    biomarker_data = truth[truth.biomarker == biomarker].reset_index()
    # theta for affected
    theta_mean_est = biomarker_data_est.theta_mean[0]
    theta_std_est = biomarker_data_est.theta_std[0]

    theta_mean = biomarker_data.theta_mean[0]
    theta_std = biomarker_data.theta_std[0]

    # phi for not affected
    phi_mean_est = biomarker_data_est.phi_mean[0]
    phi_std_est = biomarker_data_est.phi_std[0]

    phi_mean = biomarker_data.phi_mean[0]
    phi_std = biomarker_data.phi_std[0]

    return theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est
charts = []
for n in biomarkers: 
    theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
    n, estimate_df, truth)
    mean1, std1 = theta_mean, theta_std
    mean2, std2 = theta_mean_est, theta_std_est

    # Generating points on the x axis
    x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_theta = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
    ).properties(
        title=f'{n}, Theta',
        width=200,
        height=200
        )

    mean1, std1 = phi_mean, phi_std
    mean2, std2 = phi_mean_est, phi_std_est

    # Generating points on the x axis
    x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_phi = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
    ).properties(
        title=f'{n}, Phi',
        width=200,
        height=200
        )
    
    # Concatenate theta and phi charts horizontally
    hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")

    # Append the concatenated chart to the list of charts
    charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts).properties(title = "Comparing Theta and Phi Distributions Using Conjugate Priors")

# Display the final chart
final_chart.display()

Using K-means#

# We know whether this participant is healthy or not
observed_data['Diseased'] = observed_data.apply(lambda row: row.k_j > 0, axis = 1)
observed_data.head()
# select only known information
data = observed_data[['participant', 'biomarker', 'measurement', 'Diseased']]
data.head()
participant biomarker measurement Diseased
0 0 MMSE 28.812457 True
1 1 MMSE 28.483480 True
2 2 MMSE 27.483507 True
3 3 MMSE 27.886646 True
4 4 MMSE 27.412189 True
def get_theta_phi(data, biomarker, kmeans_setup):
    """To get theta and phi parametesr
    Input:
        - data: data we have right now, without access to S_n and kj
        - biomarker: a string of biomarker name
    Output:
        mean and std of theta and phi
    """
    # two empty clusters to strore measurements
    clusters = [[] for _ in range(2)]
    # dataframe for this biomarker
    biomarker_df = data[data['biomarker'] == biomarker].reset_index(drop=True)
    measurements = np.array(biomarker_df['measurement'])
    # reshape to satisfy sklearn requirements
    measurements = measurements.reshape(-1, 1)
    # dataframe for non-diseased participants
    healthy_df = biomarker_df[biomarker_df['Diseased'] == False]
    kmeans = kmeans_setup.fit(measurements)
    predictions = kmeans.predict(measurements)
    # to store measurements into their cluster
    for i, prediction in enumerate(predictions):
        clusters[prediction].append(measurements[i][0])
    # which cluster are healthy participants in
    healthy_predictions = kmeans.predict(measurements[healthy_df.index])
    # the mode of the above predictions will be the phi cluster index
    phi_cluster_idx = mode(healthy_predictions, keepdims=False).mode
    theta_cluster_idx = 1 - phi_cluster_idx
    theta_mean, theta_std = np.mean(clusters[theta_cluster_idx]), np.std(clusters[theta_cluster_idx])
    phi_mean, phi_std = np.mean(clusters[phi_cluster_idx]), np.std(clusters[phi_cluster_idx])
    return theta_mean, theta_std, phi_mean, phi_std
kmeans_setup = KMeans(n_clusters=2, random_state=0, n_init="auto")
# empty list of dictionaries to store the estimates 
means_stds_estimate_dict_list = []
for biomarker in biomarkers:
    dic = {'biomarker': biomarker}
    theta_mean, theta_std, phi_mean, phi_std = get_theta_phi(data, biomarker, kmeans_setup)
    dic['theta_mean'] = theta_mean
    dic['theta_std'] = theta_std
    dic['phi_mean'] = phi_mean
    dic['phi_std'] = phi_std
    means_stds_estimate_dict_list.append(dic)
estimate_df = pd.DataFrame(means_stds_estimate_dict_list)
estimate_df.to_csv("data/estimate_means_stds_kmeans.csv", index=False)
charts = []
for n in biomarkers: 
    theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
    n, estimate_df, truth)
    mean1, std1 = theta_mean, theta_std
    mean2, std2 = theta_mean_est, theta_std_est

    # Generating points on the x axis
    x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_theta = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
    ).properties(
        title=f'{n}, Theta',
        width=200,
        height=200
        )

    mean1, std1 = phi_mean, phi_std
    mean2, std2 = phi_mean_est, phi_std_est

    # Generating points on the x axis
    x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_phi = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
    ).properties(
        title=f'{n}, Phi',
        width=200,
        height=200
        )
    
    # Concatenate theta and phi charts horizontally
    hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")

    # Append the concatenated chart to the list of charts
    charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts).properties(title = "Comparing Theta and Phi Distributions Using K-Means Clustering")

# Display the final chart
final_chart.display()