Stroke prediction dataset (Source: Kaggle)
Goal: Predicting event of a stroke (target/ground truth) depending on several features such as gender, age…smoking status
There are both numerical (age) and non-numerical (work_type)
For a Machine Learning model non-numerical data has to be converted to numerical
Even among numerical data there are two data types:
%config InlineBackend.figure_format = 'svg'
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
Import the data in '.csv' or '.xlsx' or '.txt' format
filename='healthcare-dataset-stroke-data.csv'
data=pd.read_csv(filename,index_col=0)
np.random.seed(42)
data=data.sample(frac=1)
data=data[:500]
We take only the first 500 of the shuffled dataset for the purpose of smooth running of the tutorial
Look at the first 10 rows (data points) in the table
data.head(10)
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 40041 | Male | 31.00 | 0 | 0 | No | Self-employed | Rural | 64.85 | 23.0 | Unknown | 0 |
| 55244 | Male | 40.00 | 0 | 0 | Yes | Self-employed | Rural | 65.29 | 28.3 | never smoked | 0 |
| 70992 | Female | 8.00 | 0 | 0 | No | children | Urban | 74.42 | 22.5 | Unknown | 0 |
| 38207 | Female | 79.00 | 1 | 0 | Yes | Self-employed | Rural | 76.64 | 19.5 | never smoked | 0 |
| 8541 | Female | 75.00 | 0 | 0 | Yes | Govt_job | Rural | 94.77 | 27.2 | never smoked | 0 |
| 2467 | Female | 79.00 | 1 | 0 | Yes | Self-employed | Rural | 92.43 | NaN | never smoked | 0 |
| 19828 | Female | 56.00 | 1 | 0 | Yes | Private | Rural | 97.37 | 34.1 | smokes | 0 |
| 621 | Male | 69.00 | 0 | 0 | Yes | Private | Rural | 101.52 | 26.8 | smokes | 0 |
| 54975 | Male | 7.00 | 0 | 0 | No | Self-employed | Rural | 64.06 | 18.9 | Unknown | 0 |
| 57485 | Female | 1.48 | 0 | 0 | No | children | Rural | 55.51 | 18.5 | Unknown | 0 |
Look at the data dimensions
data.shape
(5110, 11)
Look at column names in the data
data.columns
Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
'smoking_status', 'stroke'],
dtype='object')
Look at the distribution of unavailable values over data columns. We have to address the issue of unavailable data using a 'preprocessing' step discussed later
data.isna().sum()
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 201 smoking_status 0 stroke 0 dtype: int64
For working with ML algorithms we need only numberical data. This is why we check whether there is some non numerical data so that we can convert them to numerical data before using ML.
data.dtypes
gender object age float64 hypertension int64 heart_disease int64 ever_married object work_type object Residence_type object avg_glucose_level float64 bmi float64 smoking_status object stroke int64 dtype: object
Approaches to make the crude real-world data suitable for machine learning:
data['gender'].value_counts()
Female 2994 Male 2115 Other 1 Name: gender, dtype: int64
data['gender'][:10]
id 40041 Male 55244 Male 70992 Female 38207 Female 8541 Female 2467 Female 19828 Female 621 Male 54975 Male 57485 Female Name: gender, dtype: object
cleanup_nums = {'gender': {"Female": 1, "Male": 0, "Other": -1}}
data.replace(cleanup_nums, inplace=True)
data['gender'][:10]
id 40041 0 55244 0 70992 1 38207 1 8541 1 2467 1 19828 1 621 0 54975 0 57485 1 Name: gender, dtype: int64
data['ever_married'].value_counts()
Yes 3353 No 1757 Name: ever_married, dtype: int64
cleanup_nums = {'ever_married': {"Yes": 1, "No": 0}}
data.replace(cleanup_nums, inplace=True)
data['work_type'].value_counts()
Private 2925 Self-employed 819 children 687 Govt_job 657 Never_worked 22 Name: work_type, dtype: int64
cleanup_nums = {'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1}}
data.replace(cleanup_nums, inplace=True)
data['Residence_type'].value_counts()
Urban 2596 Rural 2514 Name: Residence_type, dtype: int64
cleanup_nums = {'Residence_type': {"Urban": 0, "Rural": 1}}
data.replace(cleanup_nums, inplace=True)
data['smoking_status'].value_counts()
never smoked 1892 Unknown 1544 formerly smoked 885 smokes 789 Name: smoking_status, dtype: int64
cleanup_nums = {'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}}
data.replace(cleanup_nums, inplace=True)
It is a good practice to record the pre-processing non-numeric to numeric transformation in form of a dictionary. We can use this later during data vizualization.
rev_dict={'gender': {"Female": 1, "Male": 0, "Other": -1},'ever_married': {"Yes": 1, "No": 0},'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1},
'Residence_type': {"Urban": 0, "Rural": 1},'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}, 'hypertension':{"Yes": 1, "No": 0},
'heart_disease': {"Yes": 1, "No": 0}, 'stroke':{"Yes": 1, "No": 0}}
Finally we can recheck whether there is any non-numeric feature left in the data:
data.dtypes
gender int64 age float64 hypertension int64 heart_disease int64 ever_married int64 work_type int64 Residence_type int64 avg_glucose_level float64 bmi float64 smoking_status int64 stroke int64 dtype: object
There are various techniques to impute missing values. For example choosing column mean or mode, or interpolate among the closest available data points. For time constraint we are not exploring the details in this tutorial.
from fdc.missingValues import fix_missing_values
data = fix_missing_values(data, 4)
We can finally recheck if there is any missing value left in the data
print(data.isna().sum())
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
It is important to quantify the information content in features and to remove features with low information content
We now calculate the information content of categorical and continuous features sparately.
nom_features=['gender','ever_married','work_type','Residence_type']
ord_features=['hypertension','heart_disease','stroke','smoking_status']
cont_features=['age','avg_glucose_level', 'bmi']
For categorical features we used the Shannon entropy.
from scipy.stats import entropy
cat_feature_entropies=[]
for feature in nom_features+ord_features:
x=entropy(data[feature].value_counts(), base=2)/(len(data[feature].value_counts().index)-1)
cat_feature_entropies.append(x)
print(feature+' '+str(x))
cat_feature_entropies=np.array(cat_feature_entropies)
gender 0.4905211962124562 ever_married 0.9284419193163191 work_type 0.42189950924311775 Residence_type 0.9998142413320442 hypertension 0.4608789353279434 heart_disease 0.30320009106772317 stroke 0.28096875511730324 smoking_status 0.6355622137734004
For contineous data we first normalize tha data (each feature mean is transformed to zero and each feature variance is transformed to one). Then we calculate the range (absolute value of max-min) for every feature
cont_feature_spread=[]
for feature in cont_features:
standardardized=(data[feature]-data[feature].mean())/data[feature].std()
x=abs(standardardized.max()-standardardized.min())
cont_feature_spread.append(x)
print(feature+' '+str(x))
cont_feature_spread=np.array(cont_feature_spread)
age 3.6227515072910723 avg_glucose_level 4.783634486327282 bmi 11.243565413481782
Finally we recheck the preprocessed data
data.head(10)
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 40041 | 0 | 31.00 | 0 | 0 | 0 | 1 | 1 | 64.85 | 23.000000 | -1 | 0 |
| 55244 | 0 | 40.00 | 0 | 0 | 1 | 1 | 1 | 65.29 | 28.300000 | 0 | 0 |
| 70992 | 1 | 8.00 | 0 | 0 | 0 | 2 | 0 | 74.42 | 22.500000 | -1 | 0 |
| 38207 | 1 | 79.00 | 1 | 0 | 1 | 1 | 1 | 76.64 | 19.500000 | 0 | 0 |
| 8541 | 1 | 75.00 | 0 | 0 | 1 | 3 | 1 | 94.77 | 27.200000 | 0 | 0 |
| 2467 | 1 | 79.00 | 1 | 0 | 1 | 1 | 1 | 92.43 | 27.233333 | 0 | 0 |
| 19828 | 1 | 56.00 | 1 | 0 | 1 | 0 | 1 | 97.37 | 34.100000 | 2 | 0 |
| 621 | 0 | 69.00 | 0 | 0 | 1 | 0 | 1 | 101.52 | 26.800000 | 2 | 0 |
| 54975 | 0 | 7.00 | 0 | 0 | 0 | 1 | 1 | 64.06 | 18.900000 | -1 | 0 |
| 57485 | 1 | 1.48 | 0 | 0 | 0 | 2 | 1 | 55.51 | 18.500000 | -1 | 0 |
data.shape
(5110, 11)
parameter settings eps=1.5 min_samples=20
There are four steps of using a ML model in Scikit-learn
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples = 20)
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)
Now we look at the distribution of clusters obtained by DBSCAN under the specific parameter settings
for i in range(len(values_dbscan)):
print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 870 Cluster 1 counts: 339 Cluster 2 counts: 390 Cluster 3 counts: 431 Cluster 4 counts: 113 Cluster 5 counts: 50 Cluster 6 counts: 447 Cluster 7 counts: 645 Cluster 8 counts: 345 Cluster 9 counts: 63 Cluster 10 counts: 671 Cluster 11 counts: 134 Cluster 12 counts: 39 Cluster 13 counts: 399 Cluster 14 counts: 73 Cluster 15 counts: 35 Cluster 16 counts: 41 Cluster 17 counts: 25
Note that if we change the parameter settings the model prediction changes
parameter settings eps=2 min_samples=10
dbscan = DBSCAN(eps=2, min_samples = 10)
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)
for i in range(len(values_dbscan)):
print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 316 Cluster 1 counts: 360 Cluster 2 counts: 475 Cluster 3 counts: 463 Cluster 4 counts: 87 Cluster 5 counts: 780 Cluster 6 counts: 468 Cluster 7 counts: 42 Cluster 8 counts: 360 Cluster 9 counts: 499 Cluster 10 counts: 824 Cluster 11 counts: 28 Cluster 12 counts: 98 Cluster 13 counts: 61 Cluster 14 counts: 46 Cluster 15 counts: 79 Cluster 16 counts: 24 Cluster 17 counts: 20 Cluster 18 counts: 30 Cluster 19 counts: 24 Cluster 20 counts: 26
parameter settings n_clusters=5,affinity="euclidean",linkage="ward"
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=5,affinity="euclidean",linkage="ward")
clusters_agg=agg.fit_predict((data-data.mean())/data.std())
(values_agg,counts_agg) = np.unique(clusters_agg,return_counts=True)
for i in range(len(values_agg)):
print('Cluster '+str(values_agg[i]+1)+' counts: '+str(counts_agg[i]))
Cluster 1 counts: 1564 Cluster 2 counts: 249 Cluster 3 counts: 2694 Cluster 4 counts: 374 Cluster 5 counts: 229
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
low_dim_PCA=pca.fit_transform((data-data.mean())/data.std())
result_PCA = pd.DataFrame(data = low_dim_PCA ,
columns = ['PCA_1', 'PCA_2'])
result_PCA['DBSCAN_clusters']=clusters_dbscan+1
result_PCA['Agglomerative_clusters']=clusters_agg+1
result_PCA['stroke']=np.array(data['stroke'])
result_PCA.head(10)
| PCA_1 | PCA_2 | DBSCAN_clusters | Agglomerative_clusters | stroke | |
|---|---|---|---|---|---|
| 0 | -2.048564 | 0.693222 | 1 | 1 | 0 |
| 1 | -0.287871 | -0.118575 | 2 | 3 | 0 |
| 2 | -2.702238 | 0.325357 | 3 | 1 | 0 |
| 3 | 1.198394 | 0.294427 | 4 | 4 | 0 |
| 4 | 0.307710 | 0.036912 | 5 | 3 | 0 |
| 5 | 1.646739 | 0.153579 | 4 | 4 | 0 |
| 6 | 2.240902 | -0.878169 | 4 | 4 | 0 |
| 7 | 1.316128 | -0.628434 | 2 | 3 | 0 |
| 8 | -2.802535 | 0.826701 | 1 | 1 | 0 |
| 9 | -3.167844 | 0.319876 | 6 | 1 | 0 |
import seaborn as sns
from matplotlib import pyplot as plt
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="PCA_1", y="PCA_2",
data=result_PCA,
fit_reg=False,
legend=True,
hue='DBSCAN_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="PCA_1", y="PCA_2",
data=result_PCA,
fit_reg=False,
legend=True,
hue='Agglomerative_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="PCA_1", y="PCA_2",
data=result_PCA,
fit_reg=False,
legend=True,
hue='stroke', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
import umap.umap_ as umap
2022-09-28 17:08:25.401727: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-09-28 17:08:25.401746: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
low_dim_UMAP=umap.UMAP(n_neighbors=5, min_dist=0.1, n_components=2, random_state=42).fit_transform((data-data.mean())/data.std())
result_UMAP = pd.DataFrame(data = low_dim_UMAP ,
columns = ['UMAP_1', 'UMAP_2'])
result_UMAP['DBSCAN_clusters']=clusters_dbscan+1
result_UMAP['Agglomerative_clusters']=clusters_agg+1
result_UMAP['stroke']=np.array(data['stroke'])
result_UMAP.head(10)
| UMAP_1 | UMAP_2 | DBSCAN_clusters | Agglomerative_clusters | stroke | |
|---|---|---|---|---|---|
| 0 | 3.452700 | 11.053363 | 1 | 1 | 0 |
| 1 | -5.089006 | 9.837020 | 2 | 3 | 0 |
| 2 | 6.440556 | -8.797561 | 3 | 1 | 0 |
| 3 | -10.304930 | 1.716699 | 4 | 4 | 0 |
| 4 | 1.955747 | -8.052000 | 5 | 3 | 0 |
| 5 | -10.170257 | 1.757546 | 4 | 4 | 0 |
| 6 | -10.213698 | 2.286936 | 4 | 4 | 0 |
| 7 | -4.746089 | 7.223530 | 2 | 3 | 0 |
| 8 | 3.480995 | 11.137195 | 1 | 1 | 0 |
| 9 | 5.691496 | -6.034472 | 6 | 1 | 0 |
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='DBSCAN_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='Agglomerative_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='stroke', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
smoking_status_dict=rev_dict['smoking_status']
smoking_status_dict=dict((v,k) for k,v in smoking_status_dict.items())
smoking_status_labels=[]
for i in range(len(data['smoking_status'])):
smoking_status_labels.append(smoking_status_dict[np.array(data['smoking_status'])[i]])
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['smoking_status']=np.array(smoking_status_labels)
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='smoking_status', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
Residence_type_dict=rev_dict['Residence_type']
Residence_type_dict=dict((v,k) for k,v in Residence_type_dict.items())
Residence_type_labels=[]
for i in range(len(data['Residence_type'])):
Residence_type_labels.append(Residence_type_dict[np.array(data['Residence_type'])[i]])
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['Residence_type']=np.array(Residence_type_labels)
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='Residence_type', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
Bej, S., Sarkar, J., Biswas, S. et al. Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach. Nutr. Diabetes 12, 27 (2022). https://doi.org/10.1038/s41387-022-00206-2
from scipy.spatial import distance
array=np.arange(11)-5
distance_euc=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.euclidean(array[i],array[j])
d.append(x)
distance_euc.append(d)
distance_euc=np.array(distance_euc)/np.max(distance_euc)
#distance_euc=np.divide(distance_euc.T,np.max(distance_euc,axis=1)).T
distance_can=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.canberra(array[i],array[j])
d.append(x)
distance_can.append(d)
distance_can=np.array(distance_can)/np.max(distance_can)
#distance_can=np.divide(distance_can.T,np.max(distance_can,axis=1)).T
distance_canx=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.canberra(1,np.abs(array[i]-array[j])+1)
d.append(np.sqrt(x))
distance_canx.append(d)
distance_canx=np.array(distance_canx)/np.max(distance_canx)
distance_ham=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.hamming(array[i],array[j])
d.append(x)
distance_ham.append(d)
distance_ham=np.array(distance_ham)/np.max(distance_ham)
#distance_ham=np.divide(distance_ham.T,np.max(distance_ham,axis=1)).T
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_euc, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Euclidean', fontsize=20)
plt.show()
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_ham, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Hamming',fontsize=20)
plt.show()
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_can, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra',fontsize=20)
plt.show()
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_canx, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra modified',fontsize=20)
plt.show()
from fdc.fdc import canberra_modified, FDC, Clustering
fdc = FDC(clustering_cont=Clustering('euclidean')
, clustering_ord=Clustering(canberra_modified)
, clustering_nom=Clustering('hamming', max_components=1)
, visual=False
, use_pandas_output=True
, with_2d_embedding=True
)
fdc.selectFeatures(continueous=cont_features, nomial=nom_features, ordinal=ord_features)
entire_data_FDC_emb_five, entire_data_FDC_emb_two = fdc.normalize((data-data.mean())/data.std())
entire_data_FDC_emb_five.shape
(5110, 5)
entire_data_FDC_emb_two.shape
(5110, 2)
dbscan = DBSCAN(eps=1, min_samples = 10)
clusters_dbscan_FDC=dbscan.fit_predict(entire_data_FDC_emb_five)
(values_dbscan_FDC,counts_dbscan_FDC) = np.unique(clusters_dbscan_FDC,return_counts=True)
for i in range(len(values_dbscan_FDC)):
print('Cluster '+str(values_dbscan_FDC[i]+1)+' counts: '+str(counts_dbscan_FDC[i]))
Cluster 0 counts: 32 Cluster 1 counts: 1414 Cluster 2 counts: 1732 Cluster 3 counts: 153 Cluster 4 counts: 58 Cluster 5 counts: 854 Cluster 6 counts: 787 Cluster 7 counts: 80
agg = AgglomerativeClustering(n_clusters=8,affinity="euclidean",linkage="ward")
clusters_agg_FDC=agg.fit_predict(entire_data_FDC_emb_five)
(values_agg_FDC,counts_agg_FDC) = np.unique(clusters_agg_FDC,return_counts=True)
for i in range(len(values_agg_FDC)):
print('Cluster '+str(values_agg_FDC[i])+' counts: '+str(counts_agg_FDC[i]))
Cluster 0 counts: 582 Cluster 1 counts: 443 Cluster 2 counts: 637 Cluster 3 counts: 886 Cluster 4 counts: 881 Cluster 5 counts: 577 Cluster 6 counts: 664 Cluster 7 counts: 440
entire_data_FDC_emb_two['DBSCAN_clusters']=clusters_dbscan_FDC+1
entire_data_FDC_emb_two['Agglomerative_clusters']=clusters_agg_FDC+1
entire_data_FDC_emb_two['stroke']=np.array(data['stroke'])
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise", "lightgreen", 'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
xName = "UMAP_0" # "FDC_1"
yName = "UMAP_1" # "FDC_2"
sns.lmplot( x=xName, y=yName,
data=entire_data_FDC_emb_two,
fit_reg=False,
legend=True,
hue='DBSCAN_clusters', # color by cluster
scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise", "lightgreen", 'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x=xName, y=yName,
data=entire_data_FDC_emb_two,
fit_reg=False,
legend=True,
hue='Agglomerative_clusters', # color by cluster
scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x=xName, y=yName,
data=entire_data_FDC_emb_two,
fit_reg=False,
legend=True,
hue='stroke', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
data['Clusters']=np.array(clusters_dbscan_FDC)
cluster_df_list=[]
for cluster in values_dbscan_FDC:
cluster_df=data.loc[data['Clusters'] == cluster].drop(columns=['Clusters'])
cluster_df.columns=list(data.columns)[:-1]
cluster_df_list.append(cluster_df)
cluster_df_list=cluster_df_list[1:]
data=data.drop(['Clusters'],axis=1)
import matplotlib as mpl
from scipy import stats
def feature_p_val(feature,cluster_df_list,var_type):
%config InlineBackend.figure_format = 'svg'
num_clusters=len(cluster_df_list)
p_list_all_clusters=[]
for i in range(num_clusters):
p_list=[]
for j in range(num_clusters):
p=p_val(cluster_df_list[i],cluster_df_list[j],feature,var_type)
p_list.append(p)
p_list=np.array(p_list)
p_list_all_clusters.append(p_list)
return(np.array(p_list_all_clusters))
def p_val(clustera,clusterb,feature,var_type):
if var_type=='cont':
p=stats.ttest_ind(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
else:
p=ranksums(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
return p
class MidpointNormalize(mpl.colors.Normalize):
def __init__(self, vmin, vmax, midpoint=0, clip=False):
self.midpoint = midpoint
mpl.colors.Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
normalized_min = max(0, 1 / 2 * (1 - abs((self.midpoint - self.vmin) / (self.midpoint - self.vmax))))
normalized_max = min(1, 1 / 2 * (1 + abs((self.vmax - self.midpoint) / (self.midpoint - self.vmin))))
normalized_mid = 0.5
x, y = [self.vmin, self.midpoint, self.vmax], [normalized_min, normalized_mid, normalized_max]
return np.ma.masked_array(np.interp(value, x, y))
def p_map(feature,var_type):
heatmap, ax = plt.subplots(figsize=(8, 8), dpi=600)
norm = MidpointNormalize(vmin=0, vmax=1, midpoint=0.5)
im = ax.imshow(feature_p_val(feature,cluster_df_list,var_type), cmap='coolwarm', norm=norm)
ax.set_xticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])
ax.set_yticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])
for y in range(len(cluster_df_list)):
for x in range(len(cluster_df_list)):
plt.text(x , y , '%.2f' % feature_p_val(feature,cluster_df_list, var_type)[y, x], horizontalalignment='center',verticalalignment='center',fontsize=8)
cbar = heatmap.colorbar(im)
cbar.ax.set_ylabel('p-value')
plt.title(feature.upper(), fontsize=16)
print('\n')
plt.show()
return
for feature in cont_features:
#print(feature.upper())
p_map(feature,'cont')
from scipy.stats import ranksums
for feature in ord_features+nom_features:
#print(feature.upper())
p_map(feature,'ord')
def vizx(feature_list, cluster_df_list, main_data,umap_data,cont_features):
%config InlineBackend.figure_format = 'svg'
vizlimit=15
plt.rcParams["figure.figsize"] = (12,6)
for feature in feature_list:
print('Feature name:', feature.upper())
print('\n')
if len(main_data[feature].value_counts())<=vizlimit:
cluster_counter=1
for cluster in range(len(cluster_df_list)):
print('Cluster '+ str(cluster_counter)+ ' frequeny distribution')
if feature in list(rev_dict.keys()):
feat_keys=rev_dict[feature]
r=dict(zip(feat_keys.values(), feat_keys.keys()))
print(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
else:
print(cluster_df_list[cluster][feature].value_counts())
cluster_counter=cluster_counter+1
print('\n')
print('\n')
print('\n')
col=sns.color_palette("Set2")
cluster_bar=[]
for cluster in range(len(cluster_df_list)):
if len(main_data[feature].value_counts())<=vizlimit:
if feature in list(rev_dict.keys()):
y=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
x=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts().index)
cluster_bar.append([x,y])
else:
y=np.array(cluster_df_list[cluster][feature].value_counts().sort_index())
x=np.array(cluster_df_list[cluster][feature].value_counts().sort_index().index)
cluster_bar.append([x,y])
cluster_bar=np.array(cluster_bar)
rows=3
columns=3
if len(main_data[feature].value_counts())<=vizlimit:
figx, ax = plt.subplots(rows, columns)
figx.set_size_inches(10.5, 28.5)
cluster_in_subplot_axis_dict=np.array([[0,0],[0,1],[0,2],[1,0],[1,1],[1,2],[2,0],[1,1],[2,2]])
c=0
for i in range(rows):
for j in range(columns):
if c>=len(cluster_df_list):
break
ax[i,j].bar(cluster_bar[c,0],cluster_bar[c,1],color=col)
ax[i,j].tick_params(axis='x', which='major', labelsize=8, rotation= 90)
ax[i,j].set_title('Cluster: '+str(c+1))
c=c+1
means=[]
sds=[]
cluster_labels=[]
cluster_counter=1
for cluster in range(len(cluster_df_list)):
if feature in cont_features:
print('Cluster '+ str(cluster_counter)+ ' summary statistics')
print('\n')
cm=cluster_df_list[cluster][feature].mean()
cs=cluster_df_list[cluster][feature].std()
print('feature mean:', cm)
print('feature standard deviation:', cs)
print('feature median:', cluster_df_list[cluster][feature].median())
print('\n')
means.append(cm)
sds.append(cs)
cluster_labels.append('C'+str(cluster_counter))
cluster_counter=cluster_counter+1
means=np.array(means)
sds=np.array(sds)
cluster_labels=np.array(cluster_labels)
print('\n')
print('Distribution of feature across clusters')
if feature in cont_features:
fig, ax7 = plt.subplots()
ax7.bar(cluster_labels,means,yerr=sds,color=sns.color_palette("Set3"))
ax7.tick_params(axis='both', which='major', labelsize=10)
plt.xlabel(feature, fontsize=15)
plt.show()
print('\n')
print('\n')
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise", "lightgreen", 'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
if feature not in cont_features:
print('Feature distribution in UMAP embedding')
if feature in list(rev_dict.keys()):
umap_data[feature]=np.array(main_data.replace({feature:r})[feature])
else:
umap_data[feature]=np.array(main_data[feature])
sns.lmplot( x=xName, y=yName,
data=umap_data,
fit_reg=False,
legend=True,
hue=feature, # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
#plt.savefig('clusters_umap_fin.png', dpi=700, bbox_inches='tight')
print('\n')
print('\n')
vizx(['age','Residence_type','smoking_status']
, cluster_df_list
, data
, entire_data_FDC_emb_two
, cont_features
)
Feature name: AGE Cluster 1 summary statistics feature mean: 27.04998585572844 feature standard deviation: 23.264206803328854 feature median: 19.0 Cluster 2 summary statistics feature mean: 45.47863741339492 feature standard deviation: 19.681933722805734 feature median: 45.0 Cluster 3 summary statistics feature mean: 60.95424836601307 feature standard deviation: 14.745356034894618 feature median: 62.0 Cluster 4 summary statistics feature mean: 53.60344827586207 feature standard deviation: 14.44765464505889 feature median: 55.0 Cluster 5 summary statistics feature mean: 50.2423887587822 feature standard deviation: 17.944777395881175 feature median: 50.0 Cluster 6 summary statistics feature mean: 52.98856416772554 feature standard deviation: 17.592391408434754 feature median: 54.0 Cluster 7 summary statistics feature mean: 66.3125 feature standard deviation: 11.705882703727118 feature median: 69.5 Distribution of feature across clusters
Feature name: RESIDENCE_TYPE Cluster 1 frequeny distribution Urban 717 Rural 697 Name: Residence_type, dtype: int64 Cluster 2 frequeny distribution Urban 867 Rural 865 Name: Residence_type, dtype: int64 Cluster 3 frequeny distribution Rural 84 Urban 69 Name: Residence_type, dtype: int64 Cluster 4 frequeny distribution Urban 30 Rural 28 Name: Residence_type, dtype: int64 Cluster 5 frequeny distribution Urban 452 Rural 402 Name: Residence_type, dtype: int64 Cluster 6 frequeny distribution Urban 404 Rural 383 Name: Residence_type, dtype: int64 Cluster 7 frequeny distribution Rural 43 Urban 37 Name: Residence_type, dtype: int64 Distribution of feature across clusters Feature distribution in UMAP embedding
Feature name: SMOKING_STATUS Cluster 1 frequeny distribution Unknown 1414 Name: smoking_status, dtype: int64 Cluster 2 frequeny distribution never smoked 1603 formerly smoked 55 Unknown 38 smokes 36 Name: smoking_status, dtype: int64 Cluster 3 frequeny distribution never smoked 135 formerly smoked 11 Unknown 4 smokes 3 Name: smoking_status, dtype: int64 Cluster 4 frequeny distribution smokes 58 Name: smoking_status, dtype: int64 Cluster 5 frequeny distribution smokes 674 never smoked 76 formerly smoked 65 Unknown 39 Name: smoking_status, dtype: int64 Cluster 6 frequeny distribution formerly smoked 750 Unknown 37 Name: smoking_status, dtype: int64 Cluster 7 frequeny distribution never smoked 73 formerly smoked 4 smokes 3 Name: smoking_status, dtype: int64 Distribution of feature across clusters Feature distribution in UMAP embedding