Stroke prediction dataset (Source: Kaggle)
Goal: Predicting event of a stroke (target/ground truth) depending on several features such as gender, age…smoking status
There are both numerical (age) and non-numerical (work_type)
For a Machine Learning model non-numerical data has to be converted to numerical
Even among numerical data there are two data types:
%config InlineBackend.figure_format = 'svg'
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
Import the data in '.csv' or '.xlsx' or '.txt' format
filename='healthcare-dataset-stroke-data.csv'
data=pd.read_csv(filename,index_col=0)
np.random.seed(42)
data=data.sample(frac=1)
data=data[:500]
We take only the first 500 of the shuffled dataset for the purpose of smooth running of the tutorial
Look at the first 10 rows (data points) in the table
data.head(10)
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 40041 | Male | 31.00 | 0 | 0 | No | Self-employed | Rural | 64.85 | 23.0 | Unknown | 0 |
| 55244 | Male | 40.00 | 0 | 0 | Yes | Self-employed | Rural | 65.29 | 28.3 | never smoked | 0 |
| 70992 | Female | 8.00 | 0 | 0 | No | children | Urban | 74.42 | 22.5 | Unknown | 0 |
| 38207 | Female | 79.00 | 1 | 0 | Yes | Self-employed | Rural | 76.64 | 19.5 | never smoked | 0 |
| 8541 | Female | 75.00 | 0 | 0 | Yes | Govt_job | Rural | 94.77 | 27.2 | never smoked | 0 |
| 2467 | Female | 79.00 | 1 | 0 | Yes | Self-employed | Rural | 92.43 | NaN | never smoked | 0 |
| 19828 | Female | 56.00 | 1 | 0 | Yes | Private | Rural | 97.37 | 34.1 | smokes | 0 |
| 621 | Male | 69.00 | 0 | 0 | Yes | Private | Rural | 101.52 | 26.8 | smokes | 0 |
| 54975 | Male | 7.00 | 0 | 0 | No | Self-employed | Rural | 64.06 | 18.9 | Unknown | 0 |
| 57485 | Female | 1.48 | 0 | 0 | No | children | Rural | 55.51 | 18.5 | Unknown | 0 |
Look at the data dimensions
data.shape
(500, 11)
Look at column names in the data
data.columns
Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
'smoking_status', 'stroke'],
dtype='object')
Look at the distribution of unavailable values over data columns. We have to address the issue of unavailable data using a 'preprocessing' step discussed later
data.isna().sum()
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 23 smoking_status 0 stroke 0 dtype: int64
For working with ML algorithms we need only numberical data. This is why we check whether there is some non numerical data so that we can convert them to numerical data before using ML.
data.dtypes
gender object age float64 hypertension int64 heart_disease int64 ever_married object work_type object Residence_type object avg_glucose_level float64 bmi float64 smoking_status object stroke int64 dtype: object
Approaches to make the crude real-world data suitable for machine learning:
data['gender'].value_counts()
Female 293 Male 206 Other 1 Name: gender, dtype: int64
data['gender'][:10]
id 40041 Male 55244 Male 70992 Female 38207 Female 8541 Female 2467 Female 19828 Female 621 Male 54975 Male 57485 Female Name: gender, dtype: object
cleanup_nums = {'gender': {"Female": 1, "Male": 0, "Other": -1}}
data.replace(cleanup_nums, inplace=True)
data['gender'][:10]
id 40041 0 55244 0 70992 1 38207 1 8541 1 2467 1 19828 1 621 0 54975 0 57485 1 Name: gender, dtype: int64
data['ever_married'].value_counts()
Yes 333 No 167 Name: ever_married, dtype: int64
cleanup_nums = {'ever_married': {"Yes": 1, "No": 0}}
data.replace(cleanup_nums, inplace=True)
data['work_type'].value_counts()
Private 296 Self-employed 81 children 68 Govt_job 55 Name: work_type, dtype: int64
cleanup_nums = {'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1}}
data.replace(cleanup_nums, inplace=True)
data['Residence_type'].value_counts()
Rural 256 Urban 244 Name: Residence_type, dtype: int64
cleanup_nums = {'Residence_type': {"Urban": 0, "Rural": 1}}
data.replace(cleanup_nums, inplace=True)
data['smoking_status'].value_counts()
never smoked 184 Unknown 139 smokes 93 formerly smoked 84 Name: smoking_status, dtype: int64
cleanup_nums = {'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}}
data.replace(cleanup_nums, inplace=True)
It is a good practice to record the pre-processing non-numeric to numeric transformation in form of a dictionary. We can use this later during data vizualization.
rev_dict={'gender': {"Female": 1, "Male": 0, "Other": -1},'ever_married': {"Yes": 1, "No": 0},'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1},
'Residence_type': {"Urban": 0, "Rural": 1},'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}, 'hypertension':{"Yes": 1, "No": 0},
'heart_disease': {"Yes": 1, "No": 0}, 'stroke':{"Yes": 1, "No": 0}}
Finally we can recheck whether there is any non-numeric feature left in the data:
data.dtypes
gender int64 age float64 hypertension int64 heart_disease int64 ever_married int64 work_type int64 Residence_type int64 avg_glucose_level float64 bmi float64 smoking_status int64 stroke int64 dtype: object
There are various techniques to impute missing values. For example choosing column mean or mode, or interpolate among the closest available data points. For time constraint we are not exploring the details in this tutorial.
dense_data_pool=list(data.isna().sum().index[data.isna().sum()<4])
dense_data=data[dense_data_pool]
dense_data=dense_data.dropna()
data=data.loc[np.array(dense_data.index)]
from scipy.spatial import distance
distance_matrix=[]
for i in range(len(np.array(dense_data))):
dist=[]
for j in range(len(np.array(dense_data))):
d=distance.euclidean(np.array(dense_data)[i],np.array(dense_data)[j])
dist.append(d)
neb_list=np.array(dense_data.index)[np.argsort(dist)]
distance_matrix.append(neb_list)
distance_matrix=np.array(distance_matrix)
missing_value_list = [x for x in list(data.columns) if x not in dense_data_pool]
total_impute_master=[]
for f in range(len(missing_value_list)):
missing_value_indices=data[data[missing_value_list[f]].isnull()].index.tolist()
feature_impute_master=[]
for index in missing_value_indices:
index_in_dist_mat=np.where(distance_matrix[:,0]==index)[0][0]
value_list=[]
neb_index_counter=1
while len(value_list)<6:
neb_index=distance_matrix[index_in_dist_mat][neb_index_counter]
neb_index_counter=neb_index_counter+1
impute_value=data.loc[[neb_index]][missing_value_list[f]]
if float(impute_value)!=float(impute_value):
pass
else:
value_list.append(float(impute_value))
feature_impute_master.append(np.array(value_list))
total_impute_master.append(np.array(feature_impute_master))
total_impute_mater=np.array(total_impute_master)
total_impute=[]
for i in range(len(total_impute_master)):
feature_impute=[]
for j in range(len(total_impute_master[i])):
intcounter=0
for k in range(len(total_impute_master[i][j])):
if total_impute_master[i][j][k]-int(total_impute_master[i][j][k])==0:
intcounter=intcounter+1
if intcounter==len(total_impute_master[i][j]):
imputed_value=np.bincount(total_impute_master[i][j].astype(int)).argmax()
else:
imputed_value=np.mean(total_impute_master[i][j])
feature_impute.append(np.array(imputed_value))
total_impute.append(np.array(feature_impute))
for f in range(len(missing_value_list)):
missing_value_indices=data[data[missing_value_list[f]].isnull()].index.tolist()
for i in range(len(missing_value_indices)):
data.at[missing_value_indices[i], missing_value_list[f]]=total_impute[f][i]
We can finally recheck if there is any missing value left in the data
data.isna().sum()
gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
It is important to quantify the information content in features and to remove features with low information content
We now calculate the information content of categorical and continuous features sparately.
nom_features=['gender','ever_married','work_type','Residence_type']
ord_features=['hypertension','heart_disease','stroke','smoking_status']
cont_features=['age','avg_glucose_level', 'bmi']
For categorical features we used the Shannon entropy.
from scipy.stats import entropy
cat_feature_entropies=[]
for feature in nom_features+ord_features:
x=entropy(data[feature].value_counts(), base=2)/(len(data[feature].value_counts().index)-1)
cat_feature_entropies.append(x)
print(feature+' '+str(x))
cat_feature_entropies=np.array(cat_feature_entropies)
gender 0.49840927538558183 ever_married 0.9189610585062931 work_type 0.5382958922815603 Residence_type 0.9995844639313985 hypertension 0.462623491762443 heart_disease 0.34312290713202037 stroke 0.31135737042928313 smoking_status 0.6426183335743486
For contineous data we first normalize tha data (each feature mean is transformed to zero and each feature variance is transformed to one). Then we calculate the range (absolute value of max-min) for every feature
cont_feature_spread=[]
for feature in cont_features:
standardardized=(data[feature]-data[feature].mean())/data[feature].std()
x=abs(standardardized.max()-standardardized.min())
cont_feature_spread.append(x)
print(feature+' '+str(x))
cont_feature_spread=np.array(cont_feature_spread)
age 3.661163007128385 avg_glucose_level 4.7189204035826195 bmi 7.815372101445886
Finally we recheck the preprocessed data
data.head(10)
| gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||
| 40041 | 0 | 31.00 | 0 | 0 | 0 | 1 | 1 | 64.85 | 23.000000 | -1 | 0 |
| 55244 | 0 | 40.00 | 0 | 0 | 1 | 1 | 1 | 65.29 | 28.300000 | 0 | 0 |
| 70992 | 1 | 8.00 | 0 | 0 | 0 | 2 | 0 | 74.42 | 22.500000 | -1 | 0 |
| 38207 | 1 | 79.00 | 1 | 0 | 1 | 1 | 1 | 76.64 | 19.500000 | 0 | 0 |
| 8541 | 1 | 75.00 | 0 | 0 | 1 | 3 | 1 | 94.77 | 27.200000 | 0 | 0 |
| 2467 | 1 | 79.00 | 1 | 0 | 1 | 1 | 1 | 92.43 | 28.083333 | 0 | 0 |
| 19828 | 1 | 56.00 | 1 | 0 | 1 | 0 | 1 | 97.37 | 34.100000 | 2 | 0 |
| 621 | 0 | 69.00 | 0 | 0 | 1 | 0 | 1 | 101.52 | 26.800000 | 2 | 0 |
| 54975 | 0 | 7.00 | 0 | 0 | 0 | 1 | 1 | 64.06 | 18.900000 | -1 | 0 |
| 57485 | 1 | 1.48 | 0 | 0 | 0 | 2 | 1 | 55.51 | 18.500000 | -1 | 0 |
data.shape
(500, 11)
parameter settings eps=1.5 min_samples=20
There are four steps of using a ML model in Scikit-learn
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples = 20)
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)
Now we look at the distribution of clusters obtained by DBSCAN under the specific parameter settings
for i in range(len(values_dbscan)):
print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 329 Cluster 1 counts: 34 Cluster 2 counts: 53 Cluster 3 counts: 37 Cluster 4 counts: 47
Note that if we change the parameter settings the model prediction changes
parameter settings eps=2 min_samples=10
dbscan = DBSCAN(eps=2, min_samples = 10)
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)
for i in range(len(values_dbscan)):
print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 133 Cluster 1 counts: 23 Cluster 2 counts: 85 Cluster 3 counts: 94 Cluster 4 counts: 67 Cluster 5 counts: 33 Cluster 6 counts: 65
parameter settings n_clusters=5,affinity="euclidean",linkage="ward"
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=5,affinity="euclidean",linkage="ward")
clusters_agg=agg.fit_predict((data-data.mean())/data.std())
(values_agg,counts_agg) = np.unique(clusters_agg,return_counts=True)
for i in range(len(values_agg)):
print('Cluster '+str(values_agg[i]+1)+' counts: '+str(counts_agg[i]))
Cluster 1 counts: 256 Cluster 2 counts: 153 Cluster 3 counts: 28 Cluster 4 counts: 26 Cluster 5 counts: 37
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
low_dim_PCA=pca.fit_transform((data-data.mean())/data.std())
result_PCA = pd.DataFrame(data = low_dim_PCA ,
columns = ['PCA_1', 'PCA_2'])
result_PCA['DBSCAN_clusters']=clusters_dbscan+1
result_PCA['Agglomerative_clusters']=clusters_agg+1
result_PCA['stroke']=np.array(data['stroke'])
result_PCA.head(10)
| PCA_1 | PCA_2 | DBSCAN_clusters | Agglomerative_clusters | stroke | |
|---|---|---|---|---|---|
| 0 | -1.967599 | 0.224082 | 1 | 2 | 0 |
| 1 | -0.184169 | -0.562288 | 2 | 1 | 0 |
| 2 | -2.879908 | 0.704221 | 3 | 2 | 0 |
| 3 | 0.916206 | -0.483744 | 0 | 5 | 0 |
| 4 | 0.053280 | -0.214408 | 4 | 1 | 0 |
| 5 | 1.418248 | -0.577006 | 0 | 5 | 0 |
| 6 | 2.038984 | -1.518028 | 0 | 5 | 0 |
| 7 | 1.444600 | -1.048502 | 2 | 1 | 0 |
| 8 | -2.727780 | 0.335220 | 1 | 2 | 0 |
| 9 | -3.343449 | 0.073061 | 3 | 2 | 0 |
import seaborn as sns
from matplotlib import pyplot as plt
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="PCA_1", y="PCA_2",
data=result_PCA,
fit_reg=False,
legend=True,
hue='DBSCAN_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="PCA_1", y="PCA_2",
data=result_PCA,
fit_reg=False,
legend=True,
hue='Agglomerative_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="PCA_1", y="PCA_2",
data=result_PCA,
fit_reg=False,
legend=True,
hue='stroke', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
import umap.umap_ as umap
2022-09-28 17:40:16.240725: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-09-28 17:40:16.240743: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
low_dim_UMAP=umap.UMAP(n_neighbors=5, min_dist=0.1, n_components=2, random_state=42).fit_transform((data-data.mean())/data.std())
result_UMAP = pd.DataFrame(data = low_dim_UMAP ,
columns = ['UMAP_1', 'UMAP_2'])
result_UMAP['DBSCAN_clusters']=clusters_dbscan+1
result_UMAP['Agglomerative_clusters']=clusters_agg+1
result_UMAP['stroke']=np.array(data['stroke'])
result_UMAP.head(10)
| UMAP_1 | UMAP_2 | DBSCAN_clusters | Agglomerative_clusters | stroke | |
|---|---|---|---|---|---|
| 0 | 5.640380 | -2.627495 | 1 | 2 | 0 |
| 1 | 0.234862 | 4.020671 | 2 | 1 | 0 |
| 2 | 14.669820 | 18.462421 | 3 | 2 | 0 |
| 3 | 2.900954 | 14.992955 | 0 | 5 | 0 |
| 4 | 19.073860 | 7.344819 | 4 | 1 | 0 |
| 5 | 2.926433 | 14.942604 | 0 | 5 | 0 |
| 6 | 3.173262 | 14.292583 | 0 | 5 | 0 |
| 7 | -0.270748 | 4.891105 | 2 | 1 | 0 |
| 8 | 5.536045 | -2.657136 | 1 | 2 | 0 |
| 9 | 18.841158 | -3.674576 | 3 | 2 | 0 |
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='DBSCAN_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='Agglomerative_clusters', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='stroke', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
smoking_status_dict=rev_dict['smoking_status']
smoking_status_dict=dict((v,k) for k,v in smoking_status_dict.items())
smoking_status_labels=[]
for i in range(len(data['smoking_status'])):
smoking_status_labels.append(smoking_status_dict[np.array(data['smoking_status'])[i]])
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['smoking_status']=np.array(smoking_status_labels)
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='smoking_status', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
Residence_type_dict=rev_dict['Residence_type']
Residence_type_dict=dict((v,k) for k,v in Residence_type_dict.items())
Residence_type_labels=[]
for i in range(len(data['Residence_type'])):
Residence_type_labels.append(Residence_type_dict[np.array(data['Residence_type'])[i]])
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['Residence_type']=np.array(Residence_type_labels)
sns.lmplot( x="UMAP_1", y="UMAP_2",
data=result_UMAP,
fit_reg=False,
legend=True,
hue='Residence_type', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
Bej, S., Sarkar, J., Biswas, S. et al. Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach. Nutr. Diabetes 12, 27 (2022). https://doi.org/10.1038/s41387-022-00206-2
array=np.arange(11)-5
distance_euc=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.euclidean(array[i],array[j])
d.append(x)
distance_euc.append(d)
distance_euc=np.array(distance_euc)/np.max(distance_euc)
#distance_euc=np.divide(distance_euc.T,np.max(distance_euc,axis=1)).T
distance_can=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.canberra(array[i],array[j])
d.append(x)
distance_can.append(d)
distance_can=np.array(distance_can)/np.max(distance_can)
#distance_can=np.divide(distance_can.T,np.max(distance_can,axis=1)).T
distance_canx=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.canberra(1,np.abs(array[i]-array[j])+1)
d.append(np.sqrt(x))
distance_canx.append(d)
distance_canx=np.array(distance_canx)/np.max(distance_canx)
distance_ham=[]
for i in range(len(array)):
d=[]
for j in range(len(array)):
x=distance.hamming(array[i],array[j])
d.append(x)
distance_ham.append(d)
distance_ham=np.array(distance_ham)/np.max(distance_ham)
#distance_ham=np.divide(distance_ham.T,np.max(distance_ham,axis=1)).T
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_euc, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Euclidean', fontsize=20)
plt.show()
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_ham, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Hamming',fontsize=20)
plt.show()
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_can, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra',fontsize=20)
plt.show()
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_canx, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra modified',fontsize=20)
plt.show()
def canberra_modified(a,b):
from scipy.spatial import distance
return np.sqrt(distance.canberra(1,np.abs(a-b)+1))
def feature_clustering(UMAP_neb,min_dist_UMAP, metric, data, visual):
import umap.umap_ as umap
np.random.seed(42)
data_embedded = umap.UMAP(n_neighbors=UMAP_neb, min_dist=min_dist_UMAP, n_components=2, metric=metric, random_state=42).fit_transform(data)
data_embedded[:,0]=(data_embedded[:,0]- np.mean(data_embedded[:,0]))/np.std(data_embedded[:,0])
data_embedded[:,1]=(data_embedded[:,1]- np.mean(data_embedded[:,1]))/np.std(data_embedded[:,1])
result = pd.DataFrame(data = data_embedded ,
columns = ['UMAP_0', 'UMAP_1'])
if visual==1:
sns.lmplot( x="UMAP_0", y="UMAP_1",data=result,fit_reg=False,legend=False,scatter_kws={"s": 20},palette=customPalette_set1) # specify the point size
#plt.savefig('clusters_umap_all.png', dpi=700, bbox_inches='tight')
plt.show()
else:
pass
return result
def FDC(data,cont_list,nom_list,ord_list,cont_metric, ord_metric, nom_metric, drop_nominal, visual):
np.random.seed(42)
colors_set1 = ["lightcoral", "lightseagreen", "mediumorchid", "orange", "burlywood", "cornflowerblue", "plum", "yellowgreen"]
customPalette_set1 = sns.set_palette(sns.color_palette(colors_set1))
cont_df=data[cont_list]
nom_df=data[nom_list]
ord_df=data[ord_list]
cont_emb=feature_clustering(30,0.1, cont_metric, cont_df, 0) #Reducing continueous features into 2dim
ord_emb=feature_clustering(30,0.1, ord_metric, ord_df, 0) #Reducing ordinal features into 2dim
nom_emb=feature_clustering(30,0.1, nom_metric, nom_df, 0) #Reducing nominal features into 2dim
if drop_nominal==1:
result_concat=pd.concat([ord_emb, cont_emb, nom_emb.drop(['UMAP_1'],axis=1)],axis=1) #concatinating all reduced dimensions to get 5D embeddings(1D from nominal)
else:
result_concat=pd.concat([ord_emb, cont_emb, nom_emb],axis=1)
data_embedded = umap.UMAP(n_neighbors=30, min_dist=0.001, n_components=2, metric='euclidean', random_state=42).fit_transform(result_concat) #reducing 5D embeddings to 2D using UMAP
result_reduced = pd.DataFrame(data = data_embedded ,
columns = ['FDC_1', 'FDC_2'])
if visual==1:
sns.lmplot( x="FDC_1", y="FDC_2",data=result_reduced,fit_reg=False,legend=False,scatter_kws={"s": 20},palette=customPalette_set1) # specify the point size
plt.show()
#plt.savefig('clusters_umap_all.png', dpi=700, bbox_inches='tight')
else:
pass
return result_concat, result_reduced #returns both 5D and 2D embeddings
entire_data_FDC_emb_five,entire_data_FDC_emb_two=FDC((data-data.mean())/data.std(),cont_features,nom_features,ord_features,'euclidean',canberra_modified,'hamming',1,0)
entire_data_FDC_emb_five.shape
(500, 5)
entire_data_FDC_emb_two.shape
(500, 2)
dbscan = DBSCAN(eps=1, min_samples = 10)
clusters_dbscan_FDC=dbscan.fit_predict(entire_data_FDC_emb_five)
(values_dbscan_FDC,counts_dbscan_FDC) = np.unique(clusters_dbscan_FDC,return_counts=True)
for i in range(len(values_dbscan_FDC)):
print('Cluster '+str(values_dbscan_FDC[i]+1)+' counts: '+str(counts_dbscan_FDC[i]))
Cluster 0 counts: 66 Cluster 1 counts: 104 Cluster 2 counts: 135 Cluster 3 counts: 65 Cluster 4 counts: 33 Cluster 5 counts: 57 Cluster 6 counts: 11 Cluster 7 counts: 19 Cluster 8 counts: 10
agg = AgglomerativeClustering(n_clusters=8,affinity="euclidean",linkage="ward")
clusters_agg_FDC=agg.fit_predict(entire_data_FDC_emb_five)
(values_agg_FDC,counts_agg_FDC) = np.unique(clusters_agg_FDC,return_counts=True)
for i in range(len(values_agg_FDC)):
print('Cluster '+str(values_agg_FDC[i])+' counts: '+str(counts_agg_FDC[i]))
Cluster 0 counts: 45 Cluster 1 counts: 67 Cluster 2 counts: 53 Cluster 3 counts: 66 Cluster 4 counts: 75 Cluster 5 counts: 71 Cluster 6 counts: 69 Cluster 7 counts: 54
entire_data_FDC_emb_two['DBSCAN_clusters']=clusters_dbscan_FDC+1
entire_data_FDC_emb_two['Agglomerative_clusters']=clusters_agg_FDC+1
entire_data_FDC_emb_two['stroke']=np.array(data['stroke'])
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise", "lightgreen", 'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="FDC_1", y="FDC_2",
data=entire_data_FDC_emb_two,
fit_reg=False,
legend=True,
hue='DBSCAN_clusters', # color by cluster
scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise", "lightgreen", 'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="FDC_1", y="FDC_2",
data=entire_data_FDC_emb_two,
fit_reg=False,
legend=True,
hue='Agglomerative_clusters', # color by cluster
scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
sns.lmplot( x="FDC_1", y="FDC_2",
data=entire_data_FDC_emb_two,
fit_reg=False,
legend=True,
hue='stroke', # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
data['Clusters']=np.array(clusters_dbscan_FDC)
cluster_df_list=[]
for cluster in values_dbscan_FDC:
cluster_df=data.loc[data['Clusters'] == cluster].drop(columns=['Clusters'])
cluster_df.columns=list(data.columns)[:-1]
cluster_df_list.append(cluster_df)
cluster_df_list=cluster_df_list[1:]
data=data.drop(['Clusters'],axis=1)
import matplotlib as mpl
from scipy import stats
def feature_p_val(feature,cluster_df_list,var_type):
%config InlineBackend.figure_format = 'svg'
num_clusters=len(cluster_df_list)
p_list_all_clusters=[]
for i in range(num_clusters):
p_list=[]
for j in range(num_clusters):
p=p_val(cluster_df_list[i],cluster_df_list[j],feature,var_type)
p_list.append(p)
p_list=np.array(p_list)
p_list_all_clusters.append(p_list)
return(np.array(p_list_all_clusters))
def p_val(clustera,clusterb,feature,var_type):
if var_type=='cont':
p=stats.ttest_ind(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
else:
p=ranksums(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
return p
class MidpointNormalize(mpl.colors.Normalize):
def __init__(self, vmin, vmax, midpoint=0, clip=False):
self.midpoint = midpoint
mpl.colors.Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
normalized_min = max(0, 1 / 2 * (1 - abs((self.midpoint - self.vmin) / (self.midpoint - self.vmax))))
normalized_max = min(1, 1 / 2 * (1 + abs((self.vmax - self.midpoint) / (self.midpoint - self.vmin))))
normalized_mid = 0.5
x, y = [self.vmin, self.midpoint, self.vmax], [normalized_min, normalized_mid, normalized_max]
return np.ma.masked_array(np.interp(value, x, y))
def p_map(feature,var_type):
heatmap, ax = plt.subplots(figsize=(8, 8), dpi=600)
norm = MidpointNormalize(vmin=0, vmax=1, midpoint=0.5)
im = ax.imshow(feature_p_val(feature,cluster_df_list,var_type), cmap='coolwarm', norm=norm)
ax.set_xticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])
ax.set_yticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])
for y in range(len(cluster_df_list)):
for x in range(len(cluster_df_list)):
plt.text(x , y , '%.2f' % feature_p_val(feature,cluster_df_list, var_type)[y, x], horizontalalignment='center',verticalalignment='center',fontsize=8)
cbar = heatmap.colorbar(im)
cbar.ax.set_ylabel('p-value')
plt.title(feature.upper(), fontsize=16)
print('\n')
plt.show()
return
for feature in cont_features:
#print(feature.upper())
p_map(feature,'cont')
from scipy.stats import ranksums
for feature in ord_features+nom_features:
#print(feature.upper())
p_map(feature,'ord')
def vizx(feature_list, cluster_df_list, main_data,umap_data,cont_features):
%config InlineBackend.figure_format = 'svg'
vizlimit=15
plt.rcParams["figure.figsize"] = (12,6)
for feature in feature_list:
print('Feature name:', feature.upper())
print('\n')
if len(main_data[feature].value_counts())<=vizlimit:
cluster_counter=1
for cluster in range(len(cluster_df_list)):
print('Cluster '+ str(cluster_counter)+ ' frequeny distribution')
if feature in list(rev_dict.keys()):
feat_keys=rev_dict[feature]
r=dict(zip(feat_keys.values(), feat_keys.keys()))
print(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
else:
print(cluster_df_list[cluster][feature].value_counts())
cluster_counter=cluster_counter+1
print('\n')
print('\n')
print('\n')
col=sns.color_palette("Set2")
cluster_bar=[]
for cluster in range(len(cluster_df_list)):
if len(main_data[feature].value_counts())<=vizlimit:
if feature in list(rev_dict.keys()):
y=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
x=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts().index)
cluster_bar.append([x,y])
else:
y=np.array(cluster_df_list[cluster][feature].value_counts().sort_index())
x=np.array(cluster_df_list[cluster][feature].value_counts().sort_index().index)
cluster_bar.append([x,y])
cluster_bar=np.array(cluster_bar)
rows=3
columns=3
if len(main_data[feature].value_counts())<=vizlimit:
figx, ax = plt.subplots(rows, columns)
figx.set_size_inches(10.5, 28.5)
cluster_in_subplot_axis_dict=np.array([[0,0],[0,1],[0,2],[1,0],[1,1],[1,2],[2,0],[1,1],[2,2]])
c=0
for i in range(rows):
for j in range(columns):
if c >= len(cluster_df_list):
break
ax[i,j].bar(cluster_bar[c,0],cluster_bar[c,1],color=col)
ax[i,j].tick_params(axis='x', which='major', labelsize=8, rotation= 90)
ax[i,j].set_title('Cluster: '+str(c+1))
c=c+1
means=[]
sds=[]
cluster_labels=[]
cluster_counter=1
for cluster in range(len(cluster_df_list)):
if feature in cont_features:
print('Cluster '+ str(cluster_counter)+ ' summary statistics')
print('\n')
cm=cluster_df_list[cluster][feature].mean()
cs=cluster_df_list[cluster][feature].std()
print('feature mean:', cm)
print('feature standard deviation:', cs)
print('feature median:', cluster_df_list[cluster][feature].median())
print('\n')
means.append(cm)
sds.append(cs)
cluster_labels.append('C'+str(cluster_counter))
cluster_counter=cluster_counter+1
means=np.array(means)
sds=np.array(sds)
cluster_labels=np.array(cluster_labels)
print('\n')
print('Distribution of feature across clusters')
if feature in cont_features:
fig, ax7 = plt.subplots()
ax7.bar(cluster_labels,means,yerr=sds,color=sns.color_palette("Set3"))
ax7.tick_params(axis='both', which='major', labelsize=10)
plt.xlabel(feature, fontsize=15)
plt.show()
print('\n')
print('\n')
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise", "lightgreen", 'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
if feature not in cont_features:
print('Feature distribution in UMAP embedding')
if feature in list(rev_dict.keys()):
umap_data[feature]=np.array(main_data.replace({feature:r})[feature])
else:
umap_data[feature]=np.array(main_data[feature])
sns.lmplot( x="FDC_1", y="FDC_2",
data=umap_data,
fit_reg=False,
legend=True,
hue=feature, # color by cluster
scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
#plt.savefig('clusters_umap_fin.png', dpi=700, bbox_inches='tight')
print('\n')
print('\n')
vizx(['age','Residence_type','smoking_status'], cluster_df_list, data, entire_data_FDC_emb_two ,cont_features)
Feature name: AGE Cluster 1 summary statistics feature mean: 17.735384615384618 feature standard deviation: 17.664356570634503 feature median: 10.0 Cluster 2 summary statistics feature mean: 42.4 feature standard deviation: 18.325090865364917 feature median: 40.0 Cluster 3 summary statistics feature mean: 63.53846153846154 feature standard deviation: 12.43995192298402 feature median: 64.0 Cluster 4 summary statistics feature mean: 49.878787878787875 feature standard deviation: 17.122276381510968 feature median: 53.0 Cluster 5 summary statistics feature mean: 40.6140350877193 feature standard deviation: 14.848947035738778 feature median: 43.0 Cluster 6 summary statistics feature mean: 68.63636363636364 feature standard deviation: 13.574039393435747 feature median: 70.0 Cluster 7 summary statistics feature mean: 61.94736842105263 feature standard deviation: 8.585994307336456 feature median: 62.0 Cluster 8 summary statistics feature mean: 66.2 feature standard deviation: 12.934020600296293 feature median: 69.5 Distribution of feature across clusters
Feature name: RESIDENCE_TYPE Cluster 1 frequeny distribution Rural 55 Urban 49 Name: Residence_type, dtype: int64 Cluster 2 frequeny distribution Rural 77 Urban 58 Name: Residence_type, dtype: int64 Cluster 3 frequeny distribution Urban 36 Rural 29 Name: Residence_type, dtype: int64 Cluster 4 frequeny distribution Rural 33 Name: Residence_type, dtype: int64 Cluster 5 frequeny distribution Rural 30 Urban 27 Name: Residence_type, dtype: int64 Cluster 6 frequeny distribution Rural 11 Name: Residence_type, dtype: int64 Cluster 7 frequeny distribution Urban 19 Name: Residence_type, dtype: int64 Cluster 8 frequeny distribution Urban 10 Name: Residence_type, dtype: int64 Distribution of feature across clusters Feature distribution in UMAP embedding
Feature name: SMOKING_STATUS Cluster 1 frequeny distribution Unknown 104 Name: smoking_status, dtype: int64 Cluster 2 frequeny distribution never smoked 135 Name: smoking_status, dtype: int64 Cluster 3 frequeny distribution never smoked 27 smokes 17 formerly smoked 11 Unknown 10 Name: smoking_status, dtype: int64 Cluster 4 frequeny distribution formerly smoked 33 Name: smoking_status, dtype: int64 Cluster 5 frequeny distribution smokes 57 Name: smoking_status, dtype: int64 Cluster 6 frequeny distribution never smoked 6 Unknown 2 formerly smoked 2 smokes 1 Name: smoking_status, dtype: int64 Cluster 7 frequeny distribution formerly smoked 19 Name: smoking_status, dtype: int64 Cluster 8 frequeny distribution formerly smoked 5 never smoked 3 Unknown 1 smokes 1 Name: smoking_status, dtype: int64 Distribution of feature across clusters Feature distribution in UMAP embedding