image-4.png

Outline of the tutorial ¶


  • Several aspects of tabular data and different data types
  • Converting raw data to refined data which can again be used for Machine Learning
  • A brief intuition behind the undelying philosopy of Machine Learning
  • Qantifying similarity/dissimilarity among data points and its inherent importance
  • Finding similar groups of patients from high dimensional patient data
  • Visualizing similar groups of patients in lower dimensions
  • A unbiased (in context of diverse feature types in patient data) patient statification approach

The general structure of tabular data¶

Stroke prediction dataset (Source: Kaggle)


image-2.png


Goal: Predicting event of a stroke (target/ground truth) depending on several features such as gender, age…smoking status

  • There are both numerical (age) and non-numerical (work_type)

  • For a Machine Learning model non-numerical data has to be converted to numerical

  • Even among numerical data there are two data types:

    • Continuous (can obtain any real value in a range) e.g. avg_glucose_level
    • Categorical (obtains only discrete values) e.g. heart_disese, ever_married, work_type
    • Further, there are two types of categorical variables:
      • Ordinal Ordinal (has an intrinsic sense of order associated to it) e.g. heart_disease
      • Nominal (has no intrinsic sense of order associated to it) e.g. ever_married
  • Not all values are available in real-life data – e.g. BMI for patient id: 51676

Importing the data and first look¶

Importing a python library¶

In [1]:
%config InlineBackend.figure_format = 'svg'
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
import pandas as pd
import numpy as np

Importing and looking at the data¶

Import the data in '.csv' or '.xlsx' or '.txt' format

In [4]:
filename='healthcare-dataset-stroke-data.csv'
data=pd.read_csv(filename,index_col=0)
In [5]:
np.random.seed(42)
data=data.sample(frac=1)
In [6]:
data=data[:500]

We take only the first 500 of the shuffled dataset for the purpose of smooth running of the tutorial

Look at the first 10 rows (data points) in the table

In [7]:
data.head(10)
Out[7]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
40041 Male 31.00 0 0 No Self-employed Rural 64.85 23.0 Unknown 0
55244 Male 40.00 0 0 Yes Self-employed Rural 65.29 28.3 never smoked 0
70992 Female 8.00 0 0 No children Urban 74.42 22.5 Unknown 0
38207 Female 79.00 1 0 Yes Self-employed Rural 76.64 19.5 never smoked 0
8541 Female 75.00 0 0 Yes Govt_job Rural 94.77 27.2 never smoked 0
2467 Female 79.00 1 0 Yes Self-employed Rural 92.43 NaN never smoked 0
19828 Female 56.00 1 0 Yes Private Rural 97.37 34.1 smokes 0
621 Male 69.00 0 0 Yes Private Rural 101.52 26.8 smokes 0
54975 Male 7.00 0 0 No Self-employed Rural 64.06 18.9 Unknown 0
57485 Female 1.48 0 0 No children Rural 55.51 18.5 Unknown 0

Look at the data dimensions

In [8]:
data.shape
Out[8]:
(5110, 11)

Look at column names in the data

In [9]:
data.columns
Out[9]:
Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

Look at the distribution of unavailable values over data columns. We have to address the issue of unavailable data using a 'preprocessing' step discussed later

In [10]:
data.isna().sum()
Out[10]:
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

For working with ML algorithms we need only numberical data. This is why we check whether there is some non numerical data so that we can convert them to numerical data before using ML.

In [11]:
data.dtypes
Out[11]:
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

Pre-processing: The boring yet essential step¶


Approaches to make the crude real-world data suitable for machine learning:

  • Handling values that are not available (removing too many can cause serious data loss)
  • Feature selection (an umbrella term of several techniques with a crucial motivation)

image-4.png


  • Removing redundant features, e.g. if BMI is a feature, then there is no need to consider height and weight separately, since BMI is calculated using height and weight.
  • Highly co-related features is also indicative of redundancy

Replacing categorical data with numerical values¶

In [12]:
data['gender'].value_counts()
Out[12]:
Female    2994
Male      2115
Other        1
Name: gender, dtype: int64
In [13]:
data['gender'][:10]
Out[13]:
id
40041      Male
55244      Male
70992    Female
38207    Female
8541     Female
2467     Female
19828    Female
621        Male
54975      Male
57485    Female
Name: gender, dtype: object
In [14]:
cleanup_nums = {'gender': {"Female": 1, "Male": 0, "Other": -1}}
data.replace(cleanup_nums, inplace=True)
In [15]:
data['gender'][:10]
Out[15]:
id
40041    0
55244    0
70992    1
38207    1
8541     1
2467     1
19828    1
621      0
54975    0
57485    1
Name: gender, dtype: int64
In [16]:
data['ever_married'].value_counts()
Out[16]:
Yes    3353
No     1757
Name: ever_married, dtype: int64
In [17]:
cleanup_nums = {'ever_married': {"Yes": 1, "No": 0}}
data.replace(cleanup_nums, inplace=True)
In [18]:
data['work_type'].value_counts()
Out[18]:
Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: work_type, dtype: int64
In [19]:
cleanup_nums = {'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1}}
data.replace(cleanup_nums, inplace=True)
In [20]:
data['Residence_type'].value_counts()
Out[20]:
Urban    2596
Rural    2514
Name: Residence_type, dtype: int64
In [21]:
cleanup_nums = {'Residence_type': {"Urban": 0, "Rural": 1}}
data.replace(cleanup_nums, inplace=True)
In [22]:
data['smoking_status'].value_counts()
Out[22]:
never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64
In [23]:
cleanup_nums = {'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}}
data.replace(cleanup_nums, inplace=True)

It is a good practice to record the pre-processing non-numeric to numeric transformation in form of a dictionary. We can use this later during data vizualization.

In [24]:
rev_dict={'gender': {"Female": 1, "Male": 0, "Other": -1},'ever_married': {"Yes": 1, "No": 0},'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1},
         'Residence_type': {"Urban": 0, "Rural": 1},'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}, 'hypertension':{"Yes": 1, "No": 0},
         'heart_disease': {"Yes": 1, "No": 0}, 'stroke':{"Yes": 1, "No": 0}}

Finally we can recheck whether there is any non-numeric feature left in the data:

In [25]:
data.dtypes
Out[25]:
gender                 int64
age                  float64
hypertension           int64
heart_disease          int64
ever_married           int64
work_type              int64
Residence_type         int64
avg_glucose_level    float64
bmi                  float64
smoking_status         int64
stroke                 int64
dtype: object

Imputing missing values¶

There are various techniques to impute missing values. For example choosing column mean or mode, or interpolate among the closest available data points. For time constraint we are not exploring the details in this tutorial.

In [26]:
from fdc.missingValues import fix_missing_values
data = fix_missing_values(data, 4)

We can finally recheck if there is any missing value left in the data

In [27]:
print(data.isna().sum())
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

Information content in features¶


It is important to quantify the information content in features and to remove features with low information content


  • Features with more variance have a high information content and should be retained during feature selection

image-4.png


  • Information content in categorical variables can also be measured by Shannon Entropy, which quantifies the diversity of the information present in a feature

Quantifying information content of features¶

We now calculate the information content of categorical and continuous features sparately.

In [28]:
nom_features=['gender','ever_married','work_type','Residence_type']
ord_features=['hypertension','heart_disease','stroke','smoking_status']
cont_features=['age','avg_glucose_level', 'bmi']

For categorical features we used the Shannon entropy.

In [29]:
from scipy.stats import entropy
cat_feature_entropies=[]
for feature in nom_features+ord_features:
    x=entropy(data[feature].value_counts(), base=2)/(len(data[feature].value_counts().index)-1)
    cat_feature_entropies.append(x)
    print(feature+' '+str(x))
cat_feature_entropies=np.array(cat_feature_entropies)
gender 0.4905211962124562
ever_married 0.9284419193163191
work_type 0.42189950924311775
Residence_type 0.9998142413320442
hypertension 0.4608789353279434
heart_disease 0.30320009106772317
stroke 0.28096875511730324
smoking_status 0.6355622137734004

For contineous data we first normalize tha data (each feature mean is transformed to zero and each feature variance is transformed to one). Then we calculate the range (absolute value of max-min) for every feature

In [30]:
cont_feature_spread=[]
for feature in cont_features:
    standardardized=(data[feature]-data[feature].mean())/data[feature].std()
    x=abs(standardardized.max()-standardardized.min())
    cont_feature_spread.append(x)
    print(feature+'  '+str(x))
cont_feature_spread=np.array(cont_feature_spread)
age  3.6227515072910723
avg_glucose_level  4.783634486327282
bmi  11.243565413481782

Finally we recheck the preprocessed data

In [31]:
data.head(10)
Out[31]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
40041 0 31.00 0 0 0 1 1 64.85 23.000000 -1 0
55244 0 40.00 0 0 1 1 1 65.29 28.300000 0 0
70992 1 8.00 0 0 0 2 0 74.42 22.500000 -1 0
38207 1 79.00 1 0 1 1 1 76.64 19.500000 0 0
8541 1 75.00 0 0 1 3 1 94.77 27.200000 0 0
2467 1 79.00 1 0 1 1 1 92.43 27.233333 0 0
19828 1 56.00 1 0 1 0 1 97.37 34.100000 2 0
621 0 69.00 0 0 1 0 1 101.52 26.800000 2 0
54975 0 7.00 0 0 0 1 1 64.06 18.900000 -1 0
57485 1 1.48 0 0 0 2 1 55.51 18.500000 -1 0
In [32]:
data.shape
Out[32]:
(5110, 11)

ML in a nutshell¶

image.png

Similarity measures¶


Numerous similarity measures exist and are useful in various situations


image-4.png

Clustering algorithms¶


  • The goal of clustering is to find mutually similar (based on some similarity measure) groups in the data -Each algorithm assigns a cluster to every data point


image-3.png

Finding clusters in the data¶

DBSCAN Clustering¶

parameter settings eps=1.5 min_samples=20

There are four steps of using a ML model in Scikit-learn

  • Import the model from the library
  • Specify model parameters (the decision making of the model depend on the choice of parameters)
  • Fit the data with the model
  • Use the fitted model for prediction
In [33]:
from sklearn.cluster import DBSCAN
In [34]:
dbscan = DBSCAN(eps=1.5, min_samples = 20)
In [35]:
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
In [36]:
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)

Now we look at the distribution of clusters obtained by DBSCAN under the specific parameter settings

In [37]:
for i in range(len(values_dbscan)):
    print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 870
Cluster 1 counts: 339
Cluster 2 counts: 390
Cluster 3 counts: 431
Cluster 4 counts: 113
Cluster 5 counts: 50
Cluster 6 counts: 447
Cluster 7 counts: 645
Cluster 8 counts: 345
Cluster 9 counts: 63
Cluster 10 counts: 671
Cluster 11 counts: 134
Cluster 12 counts: 39
Cluster 13 counts: 399
Cluster 14 counts: 73
Cluster 15 counts: 35
Cluster 16 counts: 41
Cluster 17 counts: 25

Note that if we change the parameter settings the model prediction changes

parameter settings eps=2 min_samples=10

In [38]:
dbscan = DBSCAN(eps=2, min_samples = 10)
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)
for i in range(len(values_dbscan)):
    print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 316
Cluster 1 counts: 360
Cluster 2 counts: 475
Cluster 3 counts: 463
Cluster 4 counts: 87
Cluster 5 counts: 780
Cluster 6 counts: 468
Cluster 7 counts: 42
Cluster 8 counts: 360
Cluster 9 counts: 499
Cluster 10 counts: 824
Cluster 11 counts: 28
Cluster 12 counts: 98
Cluster 13 counts: 61
Cluster 14 counts: 46
Cluster 15 counts: 79
Cluster 16 counts: 24
Cluster 17 counts: 20
Cluster 18 counts: 30
Cluster 19 counts: 24
Cluster 20 counts: 26

Agglomerative Clustering¶

parameter settings n_clusters=5,affinity="euclidean",linkage="ward"

In [39]:
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=5,affinity="euclidean",linkage="ward")
clusters_agg=agg.fit_predict((data-data.mean())/data.std())
(values_agg,counts_agg) = np.unique(clusters_agg,return_counts=True)
for i in range(len(values_agg)):
    print('Cluster '+str(values_agg[i]+1)+' counts: '+str(counts_agg[i]))
Cluster 1 counts: 1564
Cluster 2 counts: 249
Cluster 3 counts: 2694
Cluster 4 counts: 374
Cluster 5 counts: 229

Diversity in Clustering algorithms¶


image.png

Vizualizing high dimensional data using Dimension reduction¶


  • The strategy is to build projection of a high dimensional data into lower dimensions maintaining the information content in terms of similarity among data points as intact as possible
  • Recall that, the more spread in a data along some direction, the more information content: Hence PCA

image-4.png

Principal Component Analysis (PCA)¶

In [40]:
from sklearn.decomposition import PCA
In [41]:
pca = PCA(n_components=2)
In [42]:
low_dim_PCA=pca.fit_transform((data-data.mean())/data.std())
In [43]:
result_PCA = pd.DataFrame(data = low_dim_PCA , 
        columns = ['PCA_1', 'PCA_2'])
In [44]:
result_PCA['DBSCAN_clusters']=clusters_dbscan+1
result_PCA['Agglomerative_clusters']=clusters_agg+1
result_PCA['stroke']=np.array(data['stroke'])
In [45]:
result_PCA.head(10)
Out[45]:
PCA_1 PCA_2 DBSCAN_clusters Agglomerative_clusters stroke
0 -2.048564 0.693222 1 1 0
1 -0.287871 -0.118575 2 3 0
2 -2.702238 0.325357 3 1 0
3 1.198394 0.294427 4 4 0
4 0.307710 0.036912 5 3 0
5 1.646739 0.153579 4 4 0
6 2.240902 -0.878169 4 4 0
7 1.316128 -0.628434 2 3 0
8 -2.802535 0.826701 1 1 0
9 -3.167844 0.319876 6 1 0
In [46]:
import seaborn as sns
from matplotlib import pyplot as plt
Vizualiting clusters obtained by DBSCAN on the PCA embeddings¶
In [47]:
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="PCA_1", y="PCA_2",
  data=result_PCA, 
  fit_reg=False, 
  legend=True,
  hue='DBSCAN_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:20.982483 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting clusters obtained by Agglomerative clustering on the PCA embeddings¶
In [48]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="PCA_1", y="PCA_2",
  data=result_PCA, 
  fit_reg=False, 
  legend=True,
  hue='Agglomerative_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:21.680066 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of target variable on PCA embedding¶
In [49]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="PCA_1", y="PCA_2",
  data=result_PCA, 
  fit_reg=False, 
  legend=True,
  hue='stroke', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:22.108216 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

Non-linear dimension reduction strategies through Manifold learning¶


image-2.png

Uniform Manifold Approximation and Projection (UMAP)¶

In [50]:
import umap.umap_ as umap
2022-09-28 17:08:25.401727: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-28 17:08:25.401746: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
In [51]:
low_dim_UMAP=umap.UMAP(n_neighbors=5, min_dist=0.1, n_components=2, random_state=42).fit_transform((data-data.mean())/data.std())
In [52]:
result_UMAP = pd.DataFrame(data = low_dim_UMAP , 
        columns = ['UMAP_1', 'UMAP_2'])
In [53]:
result_UMAP['DBSCAN_clusters']=clusters_dbscan+1
result_UMAP['Agglomerative_clusters']=clusters_agg+1
result_UMAP['stroke']=np.array(data['stroke'])
In [54]:
result_UMAP.head(10)
Out[54]:
UMAP_1 UMAP_2 DBSCAN_clusters Agglomerative_clusters stroke
0 3.452700 11.053363 1 1 0
1 -5.089006 9.837020 2 3 0
2 6.440556 -8.797561 3 1 0
3 -10.304930 1.716699 4 4 0
4 1.955747 -8.052000 5 3 0
5 -10.170257 1.757546 4 4 0
6 -10.213698 2.286936 4 4 0
7 -4.746089 7.223530 2 3 0
8 3.480995 11.137195 1 1 0
9 5.691496 -6.034472 6 1 0
Vizualiting clusters obtained by DBSCAN on the UMAP embeddings¶
In [55]:
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='DBSCAN_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:45.470695 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting clusters obtained by Agglomerative clusters on the UMAP embeddings¶
In [56]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='Agglomerative_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:46.190909 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of target variable on UMAP embedding¶
In [57]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='stroke', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:46.735063 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of smoking status on UMAP embedding¶
In [58]:
smoking_status_dict=rev_dict['smoking_status']
smoking_status_dict=dict((v,k) for k,v in smoking_status_dict.items())

smoking_status_labels=[]
for i in range(len(data['smoking_status'])):
    smoking_status_labels.append(smoking_status_dict[np.array(data['smoking_status'])[i]])
    
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['smoking_status']=np.array(smoking_status_labels)

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='smoking_status', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
    
2022-09-28T17:08:47.375085 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of residence type on UMAP embedding¶
In [59]:
Residence_type_dict=rev_dict['Residence_type']
Residence_type_dict=dict((v,k) for k,v in Residence_type_dict.items())

Residence_type_labels=[]
for i in range(len(data['Residence_type'])):
    Residence_type_labels.append(Residence_type_dict[np.array(data['Residence_type'])[i]])
    
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['Residence_type']=np.array(Residence_type_labels)

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='Residence_type', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:08:47.943279 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

Feature-type Distributed Clustering (FDC)¶

image.png

Bej, S., Sarkar, J., Biswas, S. et al. Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach. Nutr. Diabetes 12, 27 (2022). https://doi.org/10.1038/s41387-022-00206-2

In [60]:
from scipy.spatial import distance

array=np.arange(11)-5

distance_euc=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.euclidean(array[i],array[j])
        d.append(x)
    distance_euc.append(d)
distance_euc=np.array(distance_euc)/np.max(distance_euc)
#distance_euc=np.divide(distance_euc.T,np.max(distance_euc,axis=1)).T

distance_can=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.canberra(array[i],array[j])
        d.append(x)
    distance_can.append(d)
distance_can=np.array(distance_can)/np.max(distance_can)
#distance_can=np.divide(distance_can.T,np.max(distance_can,axis=1)).T

distance_canx=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.canberra(1,np.abs(array[i]-array[j])+1)
        d.append(np.sqrt(x))
    distance_canx.append(d)
distance_canx=np.array(distance_canx)/np.max(distance_canx)

distance_ham=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.hamming(array[i],array[j])
        d.append(x)
    distance_ham.append(d)
distance_ham=np.array(distance_ham)/np.max(distance_ham)
#distance_ham=np.divide(distance_ham.T,np.max(distance_ham,axis=1)).T
In [61]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_euc, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Euclidean', fontsize=20)
plt.show()
2022-09-28T17:08:48.746137 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [62]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_ham, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Hamming',fontsize=20)
plt.show()
2022-09-28T17:08:49.503808 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [63]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_can, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra',fontsize=20)
plt.show()
2022-09-28T17:08:50.457276 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [64]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_canx, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra modified',fontsize=20)
plt.show()
2022-09-28T17:08:51.285624 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [70]:
from fdc.fdc import canberra_modified, FDC, Clustering

fdc = FDC(clustering_cont=Clustering('euclidean')
          , clustering_ord=Clustering(canberra_modified)
          , clustering_nom=Clustering('hamming', max_components=1)
          , visual=False
          , use_pandas_output=True
          , with_2d_embedding=True
          )

fdc.selectFeatures(continueous=cont_features, nomial=nom_features, ordinal=ord_features)

entire_data_FDC_emb_five, entire_data_FDC_emb_two = fdc.normalize((data-data.mean())/data.std())
Five dimensional reduction of the original high dimensional data data using FDC¶
In [71]:
entire_data_FDC_emb_five.shape
Out[71]:
(5110, 5)
Two dimensional reduction of Five dimensional FDC embedding data created to vizualize the results¶
In [72]:
entire_data_FDC_emb_two.shape
Out[72]:
(5110, 2)
Finding clusters in the reduced 5-D FDC-embedding using DBSCAN¶
In [73]:
dbscan = DBSCAN(eps=1, min_samples = 10)
clusters_dbscan_FDC=dbscan.fit_predict(entire_data_FDC_emb_five)
(values_dbscan_FDC,counts_dbscan_FDC) = np.unique(clusters_dbscan_FDC,return_counts=True)
for i in range(len(values_dbscan_FDC)):
    print('Cluster '+str(values_dbscan_FDC[i]+1)+' counts: '+str(counts_dbscan_FDC[i]))
Cluster 0 counts: 32
Cluster 1 counts: 1414
Cluster 2 counts: 1732
Cluster 3 counts: 153
Cluster 4 counts: 58
Cluster 5 counts: 854
Cluster 6 counts: 787
Cluster 7 counts: 80
Finding clusters in the reduced 5-D FDC-embedding using Agglomerative clustering¶
In [74]:
agg = AgglomerativeClustering(n_clusters=8,affinity="euclidean",linkage="ward")
clusters_agg_FDC=agg.fit_predict(entire_data_FDC_emb_five)
(values_agg_FDC,counts_agg_FDC) = np.unique(clusters_agg_FDC,return_counts=True)
for i in range(len(values_agg_FDC)):
    print('Cluster '+str(values_agg_FDC[i])+' counts: '+str(counts_agg_FDC[i]))
Cluster 0 counts: 582
Cluster 1 counts: 443
Cluster 2 counts: 637
Cluster 3 counts: 886
Cluster 4 counts: 881
Cluster 5 counts: 577
Cluster 6 counts: 664
Cluster 7 counts: 440
In [75]:
entire_data_FDC_emb_two['DBSCAN_clusters']=clusters_dbscan_FDC+1
entire_data_FDC_emb_two['Agglomerative_clusters']=clusters_agg_FDC+1
entire_data_FDC_emb_two['stroke']=np.array(data['stroke'])
Vizualiting clusters obtained by DBSCAN on the FDC embeddings¶
In [78]:
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise",  "lightgreen",  'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

xName = "UMAP_0" # "FDC_1"
yName = "UMAP_1" # "FDC_2"

sns.lmplot( x=xName, y=yName,
  data=entire_data_FDC_emb_two, 
  fit_reg=False, 
  legend=True,
  hue='DBSCAN_clusters', # color by cluster
  scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:29:46.443211 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting clusters obtained by Agglomerative clustering on the FDC embeddings¶
In [79]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise",  "lightgreen",  'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x=xName, y=yName,
  data=entire_data_FDC_emb_two, 
  fit_reg=False, 
  legend=True,
  hue='Agglomerative_clusters', # color by cluster
  scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:30:13.607434 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of target variable on FDC embedding¶
In [80]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x=xName, y=yName,
  data=entire_data_FDC_emb_two, 
  fit_reg=False, 
  legend=True,
  hue='stroke', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:30:19.178989 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Clearly Cluster 3,5,6 and 8 obtained from the DBSCAN clustering are the ones that correspond to the target variable¶

Interpretation of the analysis using Hypothesis testing¶

Analysis of distribution of features over clusters¶
In [81]:
data['Clusters']=np.array(clusters_dbscan_FDC)
In [82]:
cluster_df_list=[]
for cluster in values_dbscan_FDC:
    cluster_df=data.loc[data['Clusters'] == cluster].drop(columns=['Clusters'])
    cluster_df.columns=list(data.columns)[:-1]
    cluster_df_list.append(cluster_df)
In [83]:
cluster_df_list=cluster_df_list[1:]
In [84]:
data=data.drop(['Clusters'],axis=1)
In [85]:
import matplotlib as mpl
from scipy import stats


def feature_p_val(feature,cluster_df_list,var_type):
    %config InlineBackend.figure_format = 'svg'
    num_clusters=len(cluster_df_list)
    p_list_all_clusters=[]
    for i in range(num_clusters):
        p_list=[]
        for j in range(num_clusters):
            p=p_val(cluster_df_list[i],cluster_df_list[j],feature,var_type)
            p_list.append(p)
        p_list=np.array(p_list)
        p_list_all_clusters.append(p_list)  
    return(np.array(p_list_all_clusters))

def p_val(clustera,clusterb,feature,var_type):
    if var_type=='cont':
        p=stats.ttest_ind(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
    else:
        p=ranksums(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
    return p

class MidpointNormalize(mpl.colors.Normalize):
    def __init__(self, vmin, vmax, midpoint=0, clip=False):
        self.midpoint = midpoint
        mpl.colors.Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        normalized_min = max(0, 1 / 2 * (1 - abs((self.midpoint - self.vmin) / (self.midpoint - self.vmax))))
        normalized_max = min(1, 1 / 2 * (1 + abs((self.vmax - self.midpoint) / (self.midpoint - self.vmin))))
        normalized_mid = 0.5
        x, y = [self.vmin, self.midpoint, self.vmax], [normalized_min, normalized_mid, normalized_max]
        return np.ma.masked_array(np.interp(value, x, y))

def p_map(feature,var_type):
    heatmap, ax = plt.subplots(figsize=(8, 8), dpi=600)
    norm = MidpointNormalize(vmin=0, vmax=1, midpoint=0.5)
    im = ax.imshow(feature_p_val(feature,cluster_df_list,var_type), cmap='coolwarm', norm=norm)
    ax.set_xticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])
    ax.set_yticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])

    for y in range(len(cluster_df_list)):
        for x in range(len(cluster_df_list)):
            plt.text(x , y , '%.2f' % feature_p_val(feature,cluster_df_list, var_type)[y, x], horizontalalignment='center',verticalalignment='center',fontsize=8)

    cbar = heatmap.colorbar(im)
    cbar.ax.set_ylabel('p-value')
    plt.title(feature.upper(), fontsize=16)
    print('\n')
    plt.show()
    
    return
In [86]:
for feature in cont_features:
    #print(feature.upper())
    p_map(feature,'cont')

2022-09-28T17:31:04.649744 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:07.401984 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:10.028428 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [87]:
from scipy.stats import ranksums
for feature in ord_features+nom_features:
    #print(feature.upper())
    p_map(feature,'ord')

2022-09-28T17:31:12.777741 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:15.796997 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:18.578372 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:21.264526 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:23.877096 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:26.704440 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:29.377141 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:31:32.053417 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

Vizualizing the distribution of significant features over clusters¶

In [92]:
def vizx(feature_list, cluster_df_list, main_data,umap_data,cont_features):
    %config InlineBackend.figure_format = 'svg'
    vizlimit=15
    plt.rcParams["figure.figsize"] = (12,6)
    
    
    
    for feature in feature_list:
        print('Feature name:', feature.upper())
        print('\n')
    
        if len(main_data[feature].value_counts())<=vizlimit:
            cluster_counter=1
            for cluster in range(len(cluster_df_list)):
                print('Cluster '+ str(cluster_counter)+ ' frequeny distribution')
                if feature in list(rev_dict.keys()):
                    feat_keys=rev_dict[feature]
                    r=dict(zip(feat_keys.values(), feat_keys.keys()))
                    print(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
                else:
                    print(cluster_df_list[cluster][feature].value_counts())
                cluster_counter=cluster_counter+1
                print('\n')
        
        print('\n')
        print('\n')
        
        col=sns.color_palette("Set2")
        
        cluster_bar=[]     
        for cluster in range(len(cluster_df_list)):
            if len(main_data[feature].value_counts())<=vizlimit:
                if feature in list(rev_dict.keys()):
                    
                    
                    y=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
                    x=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts().index)
                    cluster_bar.append([x,y])
                    
                   
                else:
                   
                    y=np.array(cluster_df_list[cluster][feature].value_counts().sort_index())
                    x=np.array(cluster_df_list[cluster][feature].value_counts().sort_index().index)
                    cluster_bar.append([x,y])
            
                
                
        cluster_bar=np.array(cluster_bar)
        
        rows=3
        columns=3
    
        if len(main_data[feature].value_counts())<=vizlimit:
            figx, ax = plt.subplots(rows, columns)
            figx.set_size_inches(10.5, 28.5)
            cluster_in_subplot_axis_dict=np.array([[0,0],[0,1],[0,2],[1,0],[1,1],[1,2],[2,0],[1,1],[2,2]])
            c=0
            for i in range(rows):
                for j in range(columns):
                    if c>=len(cluster_df_list):
                        break
                    ax[i,j].bar(cluster_bar[c,0],cluster_bar[c,1],color=col)
                    ax[i,j].tick_params(axis='x', which='major', labelsize=8, rotation= 90)
                    ax[i,j].set_title('Cluster: '+str(c+1))
                    c=c+1
            
        means=[]
        sds=[]
        cluster_labels=[]
        cluster_counter=1
        for cluster in range(len(cluster_df_list)):
            if feature in cont_features:
                print('Cluster '+ str(cluster_counter)+ ' summary statistics')
                print('\n')
                cm=cluster_df_list[cluster][feature].mean()
                cs=cluster_df_list[cluster][feature].std()
                print('feature mean:', cm)
                print('feature standard deviation:', cs)
                print('feature median:', cluster_df_list[cluster][feature].median())
                print('\n')
                means.append(cm)
                sds.append(cs)
                cluster_labels.append('C'+str(cluster_counter))
            cluster_counter=cluster_counter+1
            
            
        means=np.array(means)
        sds=np.array(sds)
        cluster_labels=np.array(cluster_labels)
        
        print('\n')  
        
        print('Distribution of feature across clusters')
        if feature in cont_features:   
            fig, ax7 = plt.subplots()
            ax7.bar(cluster_labels,means,yerr=sds,color=sns.color_palette("Set3"))
            ax7.tick_params(axis='both', which='major', labelsize=10)
            plt.xlabel(feature, fontsize=15)        
            plt.show()
        
        print('\n')
        print('\n')
        
        colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise",  "lightgreen",  'burlywood','lightsteelblue']
        customPalette_set = sns.set_palette(sns.color_palette(colors_set))
        
        if feature not in cont_features:
            print('Feature distribution in UMAP embedding')
            if feature in list(rev_dict.keys()):
                umap_data[feature]=np.array(main_data.replace({feature:r})[feature])
            else:
                umap_data[feature]=np.array(main_data[feature])
            sns.lmplot( x=xName, y=yName,
              data=umap_data, 
              fit_reg=False, 
              legend=True,
              hue=feature, # color by cluster
              scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
            plt.show()
            #plt.savefig('clusters_umap_fin.png', dpi=700, bbox_inches='tight')
        
        print('\n')
        print('\n')
        
In [93]:
vizx(['age','Residence_type','smoking_status']
     , cluster_df_list
     , data
     , entire_data_FDC_emb_two
     , cont_features
     )
Feature name: AGE






Cluster 1 summary statistics


feature mean: 27.04998585572844
feature standard deviation: 23.264206803328854
feature median: 19.0


Cluster 2 summary statistics


feature mean: 45.47863741339492
feature standard deviation: 19.681933722805734
feature median: 45.0


Cluster 3 summary statistics


feature mean: 60.95424836601307
feature standard deviation: 14.745356034894618
feature median: 62.0


Cluster 4 summary statistics


feature mean: 53.60344827586207
feature standard deviation: 14.44765464505889
feature median: 55.0


Cluster 5 summary statistics


feature mean: 50.2423887587822
feature standard deviation: 17.944777395881175
feature median: 50.0


Cluster 6 summary statistics


feature mean: 52.98856416772554
feature standard deviation: 17.592391408434754
feature median: 54.0


Cluster 7 summary statistics


feature mean: 66.3125
feature standard deviation: 11.705882703727118
feature median: 69.5




Distribution of feature across clusters
2022-09-28T17:44:34.432688 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/







Feature name: RESIDENCE_TYPE


Cluster 1 frequeny distribution
Urban    717
Rural    697
Name: Residence_type, dtype: int64


Cluster 2 frequeny distribution
Urban    867
Rural    865
Name: Residence_type, dtype: int64


Cluster 3 frequeny distribution
Rural    84
Urban    69
Name: Residence_type, dtype: int64


Cluster 4 frequeny distribution
Urban    30
Rural    28
Name: Residence_type, dtype: int64


Cluster 5 frequeny distribution
Urban    452
Rural    402
Name: Residence_type, dtype: int64


Cluster 6 frequeny distribution
Urban    404
Rural    383
Name: Residence_type, dtype: int64


Cluster 7 frequeny distribution
Rural    43
Urban    37
Name: Residence_type, dtype: int64








Distribution of feature across clusters




Feature distribution in UMAP embedding
2022-09-28T17:44:35.249761 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
2022-09-28T17:44:35.523539 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/



Feature name: SMOKING_STATUS


Cluster 1 frequeny distribution
Unknown    1414
Name: smoking_status, dtype: int64


Cluster 2 frequeny distribution
never smoked       1603
formerly smoked      55
Unknown              38
smokes               36
Name: smoking_status, dtype: int64


Cluster 3 frequeny distribution
never smoked       135
formerly smoked     11
Unknown              4
smokes               3
Name: smoking_status, dtype: int64


Cluster 4 frequeny distribution
smokes    58
Name: smoking_status, dtype: int64


Cluster 5 frequeny distribution
smokes             674
never smoked        76
formerly smoked     65
Unknown             39
Name: smoking_status, dtype: int64


Cluster 6 frequeny distribution
formerly smoked    750
Unknown             37
Name: smoking_status, dtype: int64


Cluster 7 frequeny distribution
never smoked       73
formerly smoked     4
smokes              3
Name: smoking_status, dtype: int64








Distribution of feature across clusters




Feature distribution in UMAP embedding
2022-09-28T17:44:37.199078 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
2022-09-28T17:44:37.542942 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/



Take-home messages¶


  • The idea of machine learning is to represent real-world data, be it tabular, images, texts in terms of numbers
  • These representations have to be designed in such a way that the set of numbers representing similar objects are closer to each other and vice-versa. Machine Learning algorithms are designed to achieve this basic goal
  • Tabular data used for clinical decision making using Machine Learning has diverse feature types
  • It is crucial to analyze the patterns present in the data to achieve explainable decision-making
  • Pre-processing is necessary to select relevant features, impue missing values and convert the raw data into a refined numeric table that is further used for Machine Learning
  • Clustering algorithms help us group similar data points from high dimensional data
  • Since we cannot vizualize high dimensional data, Dimension reduction algorithms are used to create lower dimensional embeddings of the data that again helps us in vizualization
  • A basic strategy is to observe the data in the lower dimension that captures the maximum information (high variance). This is achieved by PCA
  • There are non-linear strategies of dimension reduction that rely on non-linear measures of similarity, such as UMAP.
  • The Feature-type Distributed Clustering workflow can further improve clustering on tabular patient data, by taking into account separate similarity measures for different variable types

image.png