Estimating distribution parameters

Estimating distribution parameters#

If I were to write a sampler for it, I’d first write one where I knew the correct ordering and the stage the participant’s disease was in and inferring the means of the Gaussians for both distributions based on that. We didn’t discuss priors on the Gaussians means and variances, but I would use conjugate ones in which case the update is exact.

Therefore, in this estimation, I assume that we know \(S\) (therefore \(S_n\)) and \(k_j\) and am estimating \(\theta\) and \(\phi\) based on participants’ data. We also know that for each biomarker, its value is drawn from a normal distribution, either from the healthy (\(\theta\)) or the diseased (\(\phi\)) distribution.

Conjugate priors#

Conjugacy occurs when the posterior distribution is in the same family of distribution as the prior distribution, but with new parameter values.

Why conjugacy is important? Because without it, one has to do the integration, which oftentimes is hard.

Three major conjugate families:

Beta-Binomial
Gamma-Poisson
Normal-Normal

In our example, we assume that the measurement data for each biomarker follows a normal distribution; however, we do not know the exact \(\mu\) and \(\sigma\). Our job is to estimate the two parameters for each biomarker based on the data we have.

According to An Introduction to Bayesian Thinking by Clyde et al. (2022), if the data comes from a normal distribution with unknown \(\mu\) and \(\sigma\), the conjugate prior for \(\mu\) has a normal distribution with mean \(m_0\) and variance \(\frac{\sigma^2}{n_0}\). The conjugate prior for \(\frac{1}{\sigma^2}\) has a Gamma distribution with shape \(\frac{v_0}{2}\) and rate \(\frac{v_0 s_0^{2}}{2}\) where

\(m_0\): prior estimate of \(\mu\).
\(n_0\): how strongly is the prior belief in \(m_0\) is held.
\(s_0^2\): prior estimate of \(\sigma^2\).
\(v_0\): prior degress of freedome, influencing the certainty of \(s_0^2\).

That is to say:

\[\mu | \sigma^2 \sim \mathcal{N}(m_0, \sigma^2/n_0)\]

\[1/\sigma^2 \sim Gamma\left(\frac{v_0}{2}, \frac{v_0 s_0^2}{2} \right)\]

Combined, we have:

\[(\mu, 1/\sigma^2) \sim NormalGamma(m_0, n_0, s_0^2, v_0)\]

The posterior also follows a Normal-Gamma distribution:

\[(\mu, 1/\sigma^2) | data \sim NormalGamma(m_n, n_n, s_n^2, v_n)\]

More specifically

\[1/\sigma^2 | data \sim Gamma(v_n/2, s_n^2 v_n/2)\]

\[\mu | data, \sigma^2 \sim \mathcal{N}(m_n, \sigma^2/n_n)\]

Based on the above two equations, we know that the mean of posterior mean is \(m_n\) and the mean of the posterior variance is \((s_n^2 v_n/2)/(v_n/2)\). This is beceause the expected value of \(Gamma(\alpha, \beta)\) is \(\frac{\alpha}{\beta}\).

where

\(m_n\): posterior mean, mode, and median for \(\mu\)
\(n_n\): posterior sample size
\(s_n^2\): posterior variance
\(v_n\): posterior degrees of freedome

The updating rules to get the new hyper-parameters:

\[m_n = \frac{n \bar{y} + n_0m_0}{n + n_0}\]

\[\frac{n}{n+n_0} \bar{y} + \frac{n_0}{n+n_0}m_0\]

\[n_n = n_0 + n\]

\[v_n = v_0 + n\]

\[s_n^2 = \frac{1}{v_n}\left[s^2(n-1) + s_0^2v_0 + \frac{n_0n}{n_n}(\bar{y}-m_0)^2\right]\]

where

\(n\): sample size
\(\bar{y}\): sample mean
\(s^2\): sample variance

Implementing estimation#

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import mode
import altair as alt
from sklearn.cluster import KMeans

def estimate_params_exact(m0, n0, s0_sq, v0, data):
    '''This is to estimate means and vars based on conjugate priors
    Inputs:
        - data: a vector of measurements 
        - m0: prior estimate of $\mu$.
        - n0: how strongly is the prior belief in $m_0$ is held.
        - s0_sq: prior estimate of $\sigma^2$.
        - v0: prior degress of freedome, influencing the certainty of $s_0^2$.
    
    Outputs:
        - mu estiate, std estimate
    '''
    # Data summary
    sample_mean = np.mean(data)
    sample_size = len(data)
    sample_var = np.var(data, ddof=1)  # ddof=1 for unbiased estimator

    # Update hyperparameters for the Normal-Inverse Gamma posterior
    updated_m0 = (n0 * m0 + sample_size * sample_mean) / (n0 + sample_size)
    updated_n0 = n0 + sample_size
    updated_v0 = v0 + sample_size 
    updated_s0_sq = (1 / updated_v0) * ((sample_size - 1) * sample_var + v0 * s0_sq + 
                    (n0 * sample_size / updated_n0) * (sample_mean - m0)**2)
    updated_alpha = updated_v0/2
    updated_beta = updated_v0*updated_s0_sq/2

    # Posterior estimates
    mu_posterior_mean = updated_m0
    sigma_squared_posterior_mean = updated_beta/updated_alpha

    mu_estimation = mu_posterior_mean
    std_estimation = np.sqrt(sigma_squared_posterior_mean)

    return mu_estimation, std_estimation

# truth data; used to validate the estimates
truth = pd.read_csv('data/means_stds.csv')
truth

	biomarker	theta_mean	theta_std	phi_mean	phi_std
0	MMSE	28.0	0.666667	22.0	2.666667
1	ADAS	-6.0	1.333333	-20.0	4.000000
2	AB	250.0	50.000000	150.0	16.666667
3	P-Tau	-25.0	16.666667	-50.0	33.333333
4	HIP-FCI	5.0	1.666667	-5.0	6.666667
5	HIP-GMI	0.4	0.233333	0.3	0.333333
6	AVLT-Sum	40.0	15.000000	20.0	6.666667
7	PCC-FCI	12.0	4.000000	5.0	3.333333
8	FUS-GMI	0.6	0.066667	0.5	0.066667
9	FUS-FCI	-10.0	3.333333	-20.0	6.000000

# observed data:
observed_data = pd.read_csv('data/participant_data.csv')
observed_data.head()

	participant	biomarker	measurement	k_j	S_n	affected_or_not
0	0	MMSE	28.812457	5	5	affected
1	1	MMSE	28.483480	8	5	affected
2	2	MMSE	27.483507	5	5	affected
3	3	MMSE	27.886646	9	5	affected
4	4	MMSE	27.412189	8	5	affected

biomarkers = np.array(truth.biomarker)
biomarkers

array(['MMSE', 'ADAS', 'AB', 'P-Tau', 'HIP-FCI', 'HIP-GMI', 'AVLT-Sum',
       'PCC-FCI', 'FUS-GMI', 'FUS-FCI'], dtype=object)

def get_estimated_means_stds_df(biomarkers, observed_data):
    '''This is to get the dataframe of estimated means and stds 
    Inputs:
        - biomarkers: biomarker names, a numpy array
        - observed_data: participants data 
    Outputs:
        - a pandas dataframe containing theta and phi means and stds
    '''
    # empty list of dictionaries to store the estimates 
    means_stds_estimate_dict_list = []
    
    for biomarker in biomarkers: 
        dic = {'biomarker': biomarker}  # Initialize dictionary outside the inner loop
        for i in ['affected', 'not_affected']:
            data_full = observed_data[(observed_data.biomarker == biomarker) & (
            observed_data.affected_or_not == i)]
            data = data_full.measurement
            mu_estimate, std_estimate = estimate_params_exact(
                m0 = 0, n0 = 1, s0_sq = 1, v0 = 1, data=data)
            if i == 'affected':
                dic['theta_mean'] = mu_estimate
                dic['theta_std'] = std_estimate
            else:
                dic['phi_mean'] = mu_estimate
                dic['phi_std'] = std_estimate
        print(f"biomarker {biomarker} done!")
        means_stds_estimate_dict_list.append(dic)
    estimate_means_stds_df = pd.DataFrame(means_stds_estimate_dict_list)
    return estimate_means_stds_df 

estimate_df = get_estimated_means_stds_df(biomarkers, observed_data)
estimate_df.to_csv("data/estimate_means_stds.csv", index = False)
estimate_df

biomarker MMSE done!
biomarker ADAS done!
biomarker AB done!
biomarker P-Tau done!
biomarker HIP-FCI done!
biomarker HIP-GMI done!
biomarker AVLT-Sum done!
biomarker PCC-FCI done!
biomarker FUS-GMI done!
biomarker FUS-FCI done!

	biomarker	theta_mean	theta_std	phi_mean	phi_std
0	MMSE	27.617968	3.850069	22.199018	4.192632
1	ADAS	-6.151077	1.533481	-19.439533	4.711673
2	AB	247.235481	52.713262	146.151967	29.119009
3	P-Tau	-25.402886	17.776365	-51.237207	35.051918
4	HIP-FCI	4.917914	1.843385	-3.768449	6.875395
5	HIP-GMI	0.350846	0.314217	0.253432	0.341515
6	AVLT-Sum	36.621843	15.053672	19.293318	7.846781
7	PCC-FCI	11.660100	3.693987	4.778861	3.538650
8	FUS-GMI	0.573859	0.306157	0.485299	0.141073
9	FUS-FCI	-8.350404	4.117264	-19.691945	6.087333

def obtain_theta_phi_params(biomarker, estimate_df, truth):
    '''This is to obtain both true and estimated theta and phi params for each biomarker '''
    biomarker_data_est = estimate_df[estimate_df.biomarker == biomarker].reset_index()
    biomarker_data = truth[truth.biomarker == biomarker].reset_index()
    # theta for affected
    theta_mean_est = biomarker_data_est.theta_mean[0]
    theta_std_est = biomarker_data_est.theta_std[0]

    theta_mean = biomarker_data.theta_mean[0]
    theta_std = biomarker_data.theta_std[0]

    # phi for not affected
    phi_mean_est = biomarker_data_est.phi_mean[0]
    phi_std_est = biomarker_data_est.phi_std[0]

    phi_mean = biomarker_data.phi_mean[0]
    phi_std = biomarker_data.phi_std[0]

    return theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est

charts = []
for n in biomarkers: 
    theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
    n, estimate_df, truth)
    mean1, std1 = theta_mean, theta_std
    mean2, std2 = theta_mean_est, theta_std_est

    # Generating points on the x axis
    x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_theta = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
    ).properties(
        title=f'{n}, Theta',
        width=200,
        height=200
        )

    mean1, std1 = phi_mean, phi_std
    mean2, std2 = phi_mean_est, phi_std_est

    # Generating points on the x axis
    x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_phi = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
    ).properties(
        title=f'{n}, Phi',
        width=200,
        height=200
        )
    
    # Concatenate theta and phi charts horizontally
    hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")

    # Append the concatenated chart to the list of charts
    charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts).properties(title = "Comparing Theta and Phi Distributions Using Conjugate Priors")

# Display the final chart
final_chart.display()

Using K-means#

# We know whether this participant is healthy or not
observed_data['Diseased'] = observed_data.apply(lambda row: row.k_j > 0, axis = 1)
observed_data.head()
# select only known information
data = observed_data[['participant', 'biomarker', 'measurement', 'Diseased']]
data.head()

	participant	biomarker	measurement	Diseased
0	0	MMSE	28.812457	True
1	1	MMSE	28.483480	True
2	2	MMSE	27.483507	True
3	3	MMSE	27.886646	True
4	4	MMSE	27.412189	True

def get_theta_phi(data, biomarker, kmeans_setup):
    """To get theta and phi parametesr
    Input:
        - data: data we have right now, without access to S_n and kj
        - biomarker: a string of biomarker name
    Output:
        mean and std of theta and phi
    """
    # two empty clusters to strore measurements
    clusters = [[] for _ in range(2)]
    # dataframe for this biomarker
    biomarker_df = data[data['biomarker'] == biomarker].reset_index(drop=True)
    measurements = np.array(biomarker_df['measurement'])
    # reshape to satisfy sklearn requirements
    measurements = measurements.reshape(-1, 1)
    # dataframe for non-diseased participants
    healthy_df = biomarker_df[biomarker_df['Diseased'] == False]
    kmeans = kmeans_setup.fit(measurements)
    predictions = kmeans.predict(measurements)
    # to store measurements into their cluster
    for i, prediction in enumerate(predictions):
        clusters[prediction].append(measurements[i][0])
    # which cluster are healthy participants in
    healthy_predictions = kmeans.predict(measurements[healthy_df.index])
    # the mode of the above predictions will be the phi cluster index
    phi_cluster_idx = mode(healthy_predictions, keepdims=False).mode
    theta_cluster_idx = 1 - phi_cluster_idx
    theta_mean, theta_std = np.mean(clusters[theta_cluster_idx]), np.std(clusters[theta_cluster_idx])
    phi_mean, phi_std = np.mean(clusters[phi_cluster_idx]), np.std(clusters[phi_cluster_idx])
    return theta_mean, theta_std, phi_mean, phi_std

kmeans_setup = KMeans(n_clusters=2, random_state=0, n_init="auto")
# empty list of dictionaries to store the estimates 
means_stds_estimate_dict_list = []
for biomarker in biomarkers:
    dic = {'biomarker': biomarker}
    theta_mean, theta_std, phi_mean, phi_std = get_theta_phi(data, biomarker, kmeans_setup)
    dic['theta_mean'] = theta_mean
    dic['theta_std'] = theta_std
    dic['phi_mean'] = phi_mean
    dic['phi_std'] = phi_std
    means_stds_estimate_dict_list.append(dic)
estimate_df = pd.DataFrame(means_stds_estimate_dict_list)
estimate_df.to_csv("data/estimate_means_stds_kmeans.csv", index=False)

charts = []
for n in biomarkers: 
    theta_mean, theta_std, theta_mean_est, theta_std_est, phi_mean, phi_std, phi_mean_est, phi_std_est = obtain_theta_phi_params(
    n, estimate_df, truth)
    mean1, std1 = theta_mean, theta_std
    mean2, std2 = theta_mean_est, theta_std_est

    # Generating points on the x axis
    x_thetas = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_thetas, 'pdf': norm.pdf(x_thetas, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_theta = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Theta"))
    ).properties(
        title=f'{n}, Theta',
        width=200,
        height=200
        )

    mean1, std1 = phi_mean, phi_std
    mean2, std2 = phi_mean_est, phi_std_est

    # Generating points on the x axis
    x_phis = np.linspace(min(mean1 - 3*std1, mean2 - 3*std2), 
                    max(mean1 + 3*std1, mean2 + 3*std2), 1000)

    # Creating DataFrames for each distribution
    df1 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean1, std1), 'Distribution': 'Actual'})
    df2 = pd.DataFrame({'x': x_phis, 'pdf': norm.pdf(x_phis, mean2, std2), 'Distribution': 'Estimated'})

    # Combining the DataFrames
    df3 = pd.concat([df1, df2])

    # Altair plot
    chart_phi = alt.Chart(df3).mark_line().encode(
        x='x',
        y='pdf',
        color=alt.Color('Distribution:N', legend=alt.Legend(title="Phi"))
    ).properties(
        title=f'{n}, Phi',
        width=200,
        height=200
        )
    
    # Concatenate theta and phi charts horizontally
    hconcat_chart = alt.hconcat(chart_theta, chart_phi).resolve_scale(color="independent")

    # Append the concatenated chart to the list of charts
    charts.append(hconcat_chart)
# Concatenate all the charts vertically
final_chart = alt.vconcat(*charts).properties(title = "Comparing Theta and Phi Distributions Using K-Means Clustering")

# Display the final chart
final_chart.display()

Estimating distribution parameters

Contents

Estimating distribution parameters#

Conjugate priors#

Implementing estimation#

Using K-means#