Estimating distribution parameters#
If I were to write a sampler for it, I’d first write one where I knew the correct ordering and the stage the participant’s disease was in and inferring the means of the Gaussians for both distributions based on that. We didn’t discuss priors on the Gaussians means and variances, but I would use conjugate ones in which case the update is exact.
Therefore, in this estimation, I assume that we know \(S\) (therefore \(S_n\)) and \(k_j\) and am estimating \(\theta\) and \(\phi\) based on participants’ data. We also know that for each biomarker, its value is drawn from a normal distribution, either from the healthy (\(\theta\)) or the diseased (\(\phi\)) distribution.
Conjugate priors#
Conjugacy occurs when the posterior distribution is in the same family of distribution as the prior distribution, but with new parameter values.
Why conjugacy is important? Because without it, one has to do the integration, which oftentimes is hard.
Three major conjugate families:
Beta-Binomial
Gamma-Poisson
Normal-Normal
In our example, we assume that the measurement data for each biomarker follows a normal distribution; however, we do not know the exact \(\mu\) and \(\sigma\). Our job is to estimate the two parameters for each biomarker based on the data we have.
According to An Introduction to Bayesian Thinking by Clyde et al. (2022), if the data comes from a normal distribution with unknown \(\mu\) and \(\sigma\), the conjugate prior for \(\mu\) has a normal distribution with mean \(m_0\) and variance \(\frac{\sigma^2}{n_0}\). The conjugate prior for \(\frac{1}{\sigma^2}\) has a Gamma distribution with shape \(\frac{v_0}{2}\) and rate \(\frac{v_0 s_0^{2}}{2}\) where
\(m_0\): prior estimate of \(\mu\).
\(n_0\): how strongly is the prior belief in \(m_0\) is held.
\(s_0^2\): prior estimate of \(\sigma^2\).
\(v_0\): prior degress of freedome, influencing the certainty of \(s_0^2\).
That is to say:
Combined, we have:
The posterior also follows a Normal-Gamma distribution:
More specifically
Based on the above two equations, we know that the mean of posterior mean is \(m_n\) and the mean of the posterior variance is \((s_n^2 v_n/2)/(v_n/2)\). This is beceause the expected value of \(Gamma(\alpha, \beta)\) is \(\frac{\alpha}{\beta}\).
where
\(m_n\): posterior mean, mode, and median for \(\mu\)
\(n_n\): posterior sample size
\(s_n^2\): posterior variance
\(v_n\): posterior degrees of freedome
The updating rules to get the new hyper-parameters:
where
\(n\): sample size
\(\bar{y}\): sample mean
\(s^2\): sample variance
Implementing estimation#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import mode
import altair as alt
from sklearn.cluster import KMeans
def estimate_params_exact(m0, n0, s0_sq, v0, data):
'''This is to estimate means and vars based on conjugate priors
Inputs:
- data: a vector of measurements
- m0: prior estimate of $\mu$.
- n0: how strongly is the prior belief in $m_0$ is held.
- s0_sq: prior estimate of $\sigma^2$.
- v0: prior degress of freedome, influencing the certainty of $s_0^2$.
Outputs:
- mu estiate, std estimate
'''
# Data summary
sample_mean = np.mean(data)
sample_size = len(data)
sample_var = np.var(data, ddof=1) # ddof=1 for unbiased estimator
# Update hyperparameters for the Normal-Inverse Gamma posterior
updated_m0 = (n0 * m0 + sample_size * sample_mean) / (n0 + sample_size)
updated_n0 = n0 + sample_size
updated_v0 = v0 + sample_size
updated_s0_sq = (1 / updated_v0) * ((sample_size - 1) * sample_var + v0 * s0_sq +
(n0 * sample_size / updated_n0) * (sample_mean - m0)**2)
updated_alpha = updated_v0/2
updated_beta = updated_v0*updated_s0_sq/2
# Posterior estimates
mu_posterior_mean = updated_m0
sigma_squared_posterior_mean = updated_beta/updated_alpha
mu_estimation = mu_posterior_mean
std_estimation = np.sqrt(sigma_squared_posterior_mean)
return mu_estimation, std_estimation
# truth data; used to validate the estimates
truth = pd.read_csv('data/means_stds.csv')
truth
biomarker | theta_mean | theta_std | phi_mean | phi_std | |
---|---|---|---|---|---|
0 | MMSE | 28.0 | 0.666667 | 22.0 | 2.666667 |
1 | ADAS | -6.0 | 1.333333 | -20.0 | 4.000000 |
2 | AB | 250.0 | 50.000000 | 150.0 | 16.666667 |
3 | P-Tau | -25.0 | 16.666667 | -50.0 | 33.333333 |
4 | HIP-FCI | 5.0 | 1.666667 | -5.0 | 6.666667 |
5 | HIP-GMI | 0.4 | 0.233333 | 0.3 | 0.333333 |
6 | AVLT-Sum | 40.0 | 15.000000 | 20.0 | 6.666667 |
7 | PCC-FCI | 12.0 | 4.000000 | 5.0 | 3.333333 |
8 | FUS-GMI | 0.6 | 0.066667 | 0.5 | 0.066667 |
9 | FUS-FCI | -10.0 | 3.333333 | -20.0 | 6.000000 |
# observed data:
observed_data = pd.read_csv('data/participant_data.csv')
observed_data.head()
participant | biomarker | measurement | k_j | S_n | affected_or_not | |
---|---|---|---|---|---|---|
0 | 0 | MMSE | 28.812457 | 5 | 5 | affected |
1 | 1 | MMSE | 28.483480 | 8 | 5 | affected |
2 | 2 | MMSE | 27.483507 | 5 | 5 | affected |
3 | 3 | MMSE | 27.886646 | 9 | 5 | affected |
4 | 4 | MMSE | 27.412189 | 8 | 5 | affected |
biomarkers = np.array(truth.biomarker)
biomarkers
array(['MMSE', 'ADAS', 'AB', 'P-Tau', 'HIP-FCI', 'HIP-GMI', 'AVLT-Sum',
'PCC-FCI', 'FUS-GMI', 'FUS-FCI'], dtype=object)
def get_estimated_means_stds_df(biomarkers, observed_data):
'''This is to get the dataframe of estimated means and stds
Inputs:
- biomarkers: biomarker names, a numpy array
- observed_data: participants data
Outputs:
- a pandas dataframe containing theta and phi means and stds
'''
# empty list of dictionaries to store the estimates
means_stds_estimate_dict_list = []
for biomarker in biomarkers:
dic = {'biomarker': biomarker} # Initialize dictionary outside the inner loop
for i in ['affected', 'not_affected']:
data_full = observed_data[(observed_data.biomarker == biomarker) & (
observed_data.affected_or_not == i)]
data = data_full.measurement
mu_estimate, std_estimate = estimate_params_exact(
m0 = 0, n0 = 1, s0_sq = 1, v0 = 1, data=data)
if i == 'affected':
dic['theta_mean'] = mu_estimate
dic['theta_std'] = std_estimate
else:
dic['phi_mean'] = mu_estimate
dic['phi_std'] = std_estimate
print(f"biomarker {biomarker} done!")
means_stds_estimate_dict_list.append(dic)
estimate_means_stds_df = pd.DataFrame(means_stds_estimate_dict_list)
return estimate_means_stds_df
estimate_df = get_estimated_means_stds_df(biomarkers, observed_data)
estimate_df.to_csv("data/estimate_means_stds.csv", index = False)
estimate_df
biomarker MMSE done!
biomarker ADAS done!
biomarker AB done!
biomarker P-Tau done!
biomarker HIP-FCI done!
biomarker HIP-GMI done!
biomarker AVLT-Sum done!
biomarker PCC-FCI done!
biomarker FUS-GMI done!
biomarker FUS-FCI done!
biomarker | theta_mean | theta_std | phi_mean | phi_std | |
---|---|---|---|---|---|
0 | MMSE | 27.617968 | 3.850069 | 22.199018 | 4.192632 |
1 | ADAS | -6.151077 | 1.533481 | -19.439533 | 4.711673 |
2 | AB | 247.235481 | 52.713262 | 146.151967 | 29.119009 |
3 | P-Tau | -25.402886 | 17.776365 | -51.237207 | 35.051918 |
4 | HIP-FCI | 4.917914 | 1.843385 | -3.768449 | 6.875395 |
5 | HIP-GMI | 0.350846 | 0.314217 | 0.253432 | 0.341515 |
6 | AVLT-Sum | 36.621843 | 15.053672 | 19.293318 | 7.846781 |
7 | PCC-FCI | 11.660100 | 3.693987 | 4.778861 | 3.538650 |
8 | FUS-GMI | 0.573859 | 0.306157 | 0.485299 | 0.141073 |
9 | FUS-FCI | -8.350404 | 4.117264 | -19.691945 | 6.087333 |
def obtain_theta_phi_params(biomarker, estimate_df, truth):
'''This is to obtain both true and estimated theta and phi params for each biomarker '''
biomarker_data_est = estimate_df[estimate_df.biomarker == biomarker].reset_index()
biomarker_data = truth[truth.biomarker == biomarker].reset_index()
# theta for affected
theta_mean_est = biomarker_data_est.theta_mean[0]
theta_std_est = biomarker_data_est.theta_std[0]
theta_mean = biomarker_data.theta_mean[0]
theta_std = biomarker_data.theta_std[0]
# phi for not affected
phi_mean_est = biomarker_data_est.phi_mean[0]
phi_std_est = biomarker_data_est.phi_std[0]
phi_mean = biomarker_data.phi_mean[0]
phi_std = biomarker_data.phi_std[0]
return theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est
charts = []
for n in biomarkers:
theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
n, estimate_df, truth)
mean1, std1 = theta_mean, theta_std
mean2, std2 = theta_mean_est, theta_std_est
# Generating points on the x axis
x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2),
max(mean1 + 3*std1, mean2 + 3*std2), 1000)
# Creating DataFrames for each distribution
df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})
# Combining the DataFrames
df3 = pd.concat([df1, df2])
# Altair plot
chart_theta = alt.Chart(df3).mark_line().encode(
x='x',
y='pdf',
color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
).properties(
title=f'{n}, Theta',
width=200,
height=200
)
mean1, std1 = phi_mean, phi_std
mean2, std2 = phi_mean_est, phi_std_est
# Generating points on the x axis
x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2),
max(mean1 + 3*std1, mean2 + 3*std2), 1000)
# Creating DataFrames for each distribution
df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})
# Combining the DataFrames
df3 = pd.concat([df1, df2])
# Altair plot
chart_phi = alt.Chart(df3).mark_line().encode(
x='x',
y='pdf',
color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
).properties(
title=f'{n}, Phi',
width=200,
height=200
)
# Concatenate theta and phi charts horizontally
hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")
# Append the concatenated chart to the list of charts
charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts).properties(title = "Comparing Theta and Phi Distributions Using Conjugate Priors")
# Display the final chart
final_chart.display()
Using K-means#
# We know whether this participant is healthy or not
observed_data['Diseased'] = observed_data.apply(lambda row: row.k_j > 0, axis = 1)
observed_data.head()
# select only known information
data = observed_data[['participant', 'biomarker', 'measurement', 'Diseased']]
data.head()
participant | biomarker | measurement | Diseased | |
---|---|---|---|---|
0 | 0 | MMSE | 28.812457 | True |
1 | 1 | MMSE | 28.483480 | True |
2 | 2 | MMSE | 27.483507 | True |
3 | 3 | MMSE | 27.886646 | True |
4 | 4 | MMSE | 27.412189 | True |
def get_theta_phi(data, biomarker, kmeans_setup):
"""To get theta and phi parametesr
Input:
- data: data we have right now, without access to S_n and kj
- biomarker: a string of biomarker name
Output:
mean and std of theta and phi
"""
# two empty clusters to strore measurements
clusters = [[] for _ in range(2)]
# dataframe for this biomarker
biomarker_df = data[data['biomarker'] == biomarker].reset_index(drop=True)
measurements = np.array(biomarker_df['measurement'])
# reshape to satisfy sklearn requirements
measurements = measurements.reshape(-1, 1)
# dataframe for non-diseased participants
healthy_df = biomarker_df[biomarker_df['Diseased'] == False]
kmeans = kmeans_setup.fit(measurements)
predictions = kmeans.predict(measurements)
# to store measurements into their cluster
for i, prediction in enumerate(predictions):
clusters[prediction].append(measurements[i][0])
# which cluster are healthy participants in
healthy_predictions = kmeans.predict(measurements[healthy_df.index])
# the mode of the above predictions will be the phi cluster index
phi_cluster_idx = mode(healthy_predictions, keepdims=False).mode
theta_cluster_idx = 1 - phi_cluster_idx
theta_mean, theta_std = np.mean(clusters[theta_cluster_idx]), np.std(clusters[theta_cluster_idx])
phi_mean, phi_std = np.mean(clusters[phi_cluster_idx]), np.std(clusters[phi_cluster_idx])
return theta_mean, theta_std, phi_mean, phi_std
kmeans_setup = KMeans(n_clusters=2, random_state=0, n_init="auto")
# empty list of dictionaries to store the estimates
means_stds_estimate_dict_list = []
for biomarker in biomarkers:
dic = {'biomarker': biomarker}
theta_mean, theta_std, phi_mean, phi_std = get_theta_phi(data, biomarker, kmeans_setup)
dic['theta_mean'] = theta_mean
dic['theta_std'] = theta_std
dic['phi_mean'] = phi_mean
dic['phi_std'] = phi_std
means_stds_estimate_dict_list.append(dic)
estimate_df = pd.DataFrame(means_stds_estimate_dict_list)
estimate_df.to_csv("data/estimate_means_stds_kmeans.csv", index=False)
charts = []
for n in biomarkers:
theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
n, estimate_df, truth)
mean1, std1 = theta_mean, theta_std
mean2, std2 = theta_mean_est, theta_std_est
# Generating points on the x axis
x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2),
max(mean1 + 3*std1, mean2 + 3*std2), 1000)
# Creating DataFrames for each distribution
df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})
# Combining the DataFrames
df3 = pd.concat([df1, df2])
# Altair plot
chart_theta = alt.Chart(df3).mark_line().encode(
x='x',
y='pdf',
color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
).properties(
title=f'{n}, Theta',
width=200,
height=200
)
mean1, std1 = phi_mean, phi_std
mean2, std2 = phi_mean_est, phi_std_est
# Generating points on the x axis
x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2),
max(mean1 + 3*std1, mean2 + 3*std2), 1000)
# Creating DataFrames for each distribution
df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})
# Combining the DataFrames
df3 = pd.concat([df1, df2])
# Altair plot
chart_phi = alt.Chart(df3).mark_line().encode(
x='x',
y='pdf',
color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
).properties(
title=f'{n}, Phi',
width=200,
height=200
)
# Concatenate theta and phi charts horizontally
hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")
# Append the concatenated chart to the list of charts
charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts).properties(title = "Comparing Theta and Phi Distributions Using K-Means Clustering")
# Display the final chart
final_chart.display()