image-4.png

Outline of the tutorial ¶


  • Several aspects of tabular data and different data types
  • Converting raw data to refined data which can again be used for Machine Learning
  • A brief intuition behind the undelying philosopy of Machine Learning
  • Qantifying similarity/dissimilarity among data points and its inherent importance
  • Finding similar groups of patients from high dimensional patient data
  • Visualizing similar groups of patients in lower dimensions
  • A unbiased (in context of diverse feature types in patient data) patient statification approach

The general structure of tabular data¶

Stroke prediction dataset (Source: Kaggle)


image-2.png


Goal: Predicting event of a stroke (target/ground truth) depending on several features such as gender, age…smoking status

  • There are both numerical (age) and non-numerical (work_type)

  • For a Machine Learning model non-numerical data has to be converted to numerical

  • Even among numerical data there are two data types:

    • Continuous (can obtain any real value in a range) e.g. avg_glucose_level
    • Categorical (obtains only discrete values) e.g. heart_disese, ever_married, work_type
    • Further, there are two types of categorical variables:
      • Ordinal Ordinal (has an intrinsic sense of order associated to it) e.g. heart_disease
      • Nominal (has no intrinsic sense of order associated to it) e.g. ever_married
  • Not all values are available in real-life data – e.g. BMI for patient id: 51676

Importing the data and first look¶

Importing a python library¶

In [1]:
%config InlineBackend.figure_format = 'svg'
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
import pandas as pd
import numpy as np

Importing and looking at the data¶

Import the data in '.csv' or '.xlsx' or '.txt' format

In [4]:
filename='healthcare-dataset-stroke-data.csv'
data=pd.read_csv(filename,index_col=0)
In [5]:
np.random.seed(42)
data=data.sample(frac=1)
In [6]:
data=data[:500]

We take only the first 500 of the shuffled dataset for the purpose of smooth running of the tutorial

Look at the first 10 rows (data points) in the table

In [7]:
data.head(10)
Out[7]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
40041 Male 31.00 0 0 No Self-employed Rural 64.85 23.0 Unknown 0
55244 Male 40.00 0 0 Yes Self-employed Rural 65.29 28.3 never smoked 0
70992 Female 8.00 0 0 No children Urban 74.42 22.5 Unknown 0
38207 Female 79.00 1 0 Yes Self-employed Rural 76.64 19.5 never smoked 0
8541 Female 75.00 0 0 Yes Govt_job Rural 94.77 27.2 never smoked 0
2467 Female 79.00 1 0 Yes Self-employed Rural 92.43 NaN never smoked 0
19828 Female 56.00 1 0 Yes Private Rural 97.37 34.1 smokes 0
621 Male 69.00 0 0 Yes Private Rural 101.52 26.8 smokes 0
54975 Male 7.00 0 0 No Self-employed Rural 64.06 18.9 Unknown 0
57485 Female 1.48 0 0 No children Rural 55.51 18.5 Unknown 0

Look at the data dimensions

In [8]:
data.shape
Out[8]:
(500, 11)

Look at column names in the data

In [9]:
data.columns
Out[9]:
Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

Look at the distribution of unavailable values over data columns. We have to address the issue of unavailable data using a 'preprocessing' step discussed later

In [10]:
data.isna().sum()
Out[10]:
gender                0
age                   0
hypertension          0
heart_disease         0
ever_married          0
work_type             0
Residence_type        0
avg_glucose_level     0
bmi                  23
smoking_status        0
stroke                0
dtype: int64

For working with ML algorithms we need only numberical data. This is why we check whether there is some non numerical data so that we can convert them to numerical data before using ML.

In [11]:
data.dtypes
Out[11]:
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

Pre-processing: The boring yet essential step¶


Approaches to make the crude real-world data suitable for machine learning:

  • Handling values that are not available (removing too many can cause serious data loss)
  • Feature selection (an umbrella term of several techniques with a crucial motivation)

image-4.png


  • Removing redundant features, e.g. if BMI is a feature, then there is no need to consider height and weight separately, since BMI is calculated using height and weight.
  • Highly co-related features is also indicative of redundancy

Replacing categorical data with numerical values¶

In [12]:
data['gender'].value_counts()
Out[12]:
Female    293
Male      206
Other       1
Name: gender, dtype: int64
In [13]:
data['gender'][:10]
Out[13]:
id
40041      Male
55244      Male
70992    Female
38207    Female
8541     Female
2467     Female
19828    Female
621        Male
54975      Male
57485    Female
Name: gender, dtype: object
In [14]:
cleanup_nums = {'gender': {"Female": 1, "Male": 0, "Other": -1}}
data.replace(cleanup_nums, inplace=True)
In [15]:
data['gender'][:10]
Out[15]:
id
40041    0
55244    0
70992    1
38207    1
8541     1
2467     1
19828    1
621      0
54975    0
57485    1
Name: gender, dtype: int64
In [16]:
data['ever_married'].value_counts()
Out[16]:
Yes    333
No     167
Name: ever_married, dtype: int64
In [17]:
cleanup_nums = {'ever_married': {"Yes": 1, "No": 0}}
data.replace(cleanup_nums, inplace=True)
In [18]:
data['work_type'].value_counts()
Out[18]:
Private          296
Self-employed     81
children          68
Govt_job          55
Name: work_type, dtype: int64
In [19]:
cleanup_nums = {'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1}}
data.replace(cleanup_nums, inplace=True)
In [20]:
data['Residence_type'].value_counts()
Out[20]:
Rural    256
Urban    244
Name: Residence_type, dtype: int64
In [21]:
cleanup_nums = {'Residence_type': {"Urban": 0, "Rural": 1}}
data.replace(cleanup_nums, inplace=True)
In [22]:
data['smoking_status'].value_counts()
Out[22]:
never smoked       184
Unknown            139
smokes              93
formerly smoked     84
Name: smoking_status, dtype: int64
In [23]:
cleanup_nums = {'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}}
data.replace(cleanup_nums, inplace=True)

It is a good practice to record the pre-processing non-numeric to numeric transformation in form of a dictionary. We can use this later during data vizualization.

In [24]:
rev_dict={'gender': {"Female": 1, "Male": 0, "Other": -1},'ever_married': {"Yes": 1, "No": 0},'work_type': {"Private": 0, "Self-employed": 1, 'children':2, 'Govt_job':3, 'Never_worked':-1},
         'Residence_type': {"Urban": 0, "Rural": 1},'smoking_status': {"never smoked": 0, "Unknown": -1, 'formerly smoked': 1, 'smokes':2}, 'hypertension':{"Yes": 1, "No": 0},
         'heart_disease': {"Yes": 1, "No": 0}, 'stroke':{"Yes": 1, "No": 0}}

Finally we can recheck whether there is any non-numeric feature left in the data:

In [25]:
data.dtypes
Out[25]:
gender                 int64
age                  float64
hypertension           int64
heart_disease          int64
ever_married           int64
work_type              int64
Residence_type         int64
avg_glucose_level    float64
bmi                  float64
smoking_status         int64
stroke                 int64
dtype: object

Imputing missing values¶

There are various techniques to impute missing values. For example choosing column mean or mode, or interpolate among the closest available data points. For time constraint we are not exploring the details in this tutorial.

In [26]:
dense_data_pool=list(data.isna().sum().index[data.isna().sum()<4])
dense_data=data[dense_data_pool]
dense_data=dense_data.dropna()
data=data.loc[np.array(dense_data.index)]
In [27]:
from scipy.spatial import distance
distance_matrix=[]
for i in range(len(np.array(dense_data))):
    dist=[]
    for j in range(len(np.array(dense_data))):
        d=distance.euclidean(np.array(dense_data)[i],np.array(dense_data)[j])
        dist.append(d)
        neb_list=np.array(dense_data.index)[np.argsort(dist)]
    distance_matrix.append(neb_list)
distance_matrix=np.array(distance_matrix)
In [28]:
missing_value_list = [x for x in list(data.columns) if x not in dense_data_pool]
In [29]:
total_impute_master=[]
for f in range(len(missing_value_list)):
    missing_value_indices=data[data[missing_value_list[f]].isnull()].index.tolist()
    feature_impute_master=[]
    for index in missing_value_indices:
        index_in_dist_mat=np.where(distance_matrix[:,0]==index)[0][0]
        value_list=[]
        neb_index_counter=1
        while len(value_list)<6:
            neb_index=distance_matrix[index_in_dist_mat][neb_index_counter]
            neb_index_counter=neb_index_counter+1
            impute_value=data.loc[[neb_index]][missing_value_list[f]]
            if float(impute_value)!=float(impute_value):
                pass
            else:
                value_list.append(float(impute_value))
        feature_impute_master.append(np.array(value_list))
    total_impute_master.append(np.array(feature_impute_master))
total_impute_mater=np.array(total_impute_master)
In [30]:
total_impute=[]
for i in range(len(total_impute_master)):
    feature_impute=[]
    for j in range(len(total_impute_master[i])):
        intcounter=0
        for k in range(len(total_impute_master[i][j])):
            if total_impute_master[i][j][k]-int(total_impute_master[i][j][k])==0:
                intcounter=intcounter+1
        if intcounter==len(total_impute_master[i][j]):
            imputed_value=np.bincount(total_impute_master[i][j].astype(int)).argmax()
        else:
            imputed_value=np.mean(total_impute_master[i][j])
        feature_impute.append(np.array(imputed_value))
    total_impute.append(np.array(feature_impute))
    
In [31]:
for f in range(len(missing_value_list)):
    missing_value_indices=data[data[missing_value_list[f]].isnull()].index.tolist()
    for i in range(len(missing_value_indices)):
        data.at[missing_value_indices[i], missing_value_list[f]]=total_impute[f][i]

We can finally recheck if there is any missing value left in the data

In [32]:
data.isna().sum()
Out[32]:
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

Information content in features¶


It is important to quantify the information content in features and to remove features with low information content


  • Features with more variance have a high information content and should be retained during feature selection

image-4.png


  • Information content in categorical variables can also be measured by Shannon Entropy, which quantifies the diversity of the information present in a feature

Quantifying information content of features¶

We now calculate the information content of categorical and continuous features sparately.

In [33]:
nom_features=['gender','ever_married','work_type','Residence_type']
ord_features=['hypertension','heart_disease','stroke','smoking_status']
cont_features=['age','avg_glucose_level', 'bmi']

For categorical features we used the Shannon entropy.

In [34]:
from scipy.stats import entropy
cat_feature_entropies=[]
for feature in nom_features+ord_features:
    x=entropy(data[feature].value_counts(), base=2)/(len(data[feature].value_counts().index)-1)
    cat_feature_entropies.append(x)
    print(feature+' '+str(x))
cat_feature_entropies=np.array(cat_feature_entropies)
gender 0.49840927538558183
ever_married 0.9189610585062931
work_type 0.5382958922815603
Residence_type 0.9995844639313985
hypertension 0.462623491762443
heart_disease 0.34312290713202037
stroke 0.31135737042928313
smoking_status 0.6426183335743486

For contineous data we first normalize tha data (each feature mean is transformed to zero and each feature variance is transformed to one). Then we calculate the range (absolute value of max-min) for every feature

In [35]:
cont_feature_spread=[]
for feature in cont_features:
    standardardized=(data[feature]-data[feature].mean())/data[feature].std()
    x=abs(standardardized.max()-standardardized.min())
    cont_feature_spread.append(x)
    print(feature+'  '+str(x))
cont_feature_spread=np.array(cont_feature_spread)
age  3.661163007128385
avg_glucose_level  4.7189204035826195
bmi  7.815372101445886

Finally we recheck the preprocessed data

In [36]:
data.head(10)
Out[36]:
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
id
40041 0 31.00 0 0 0 1 1 64.85 23.000000 -1 0
55244 0 40.00 0 0 1 1 1 65.29 28.300000 0 0
70992 1 8.00 0 0 0 2 0 74.42 22.500000 -1 0
38207 1 79.00 1 0 1 1 1 76.64 19.500000 0 0
8541 1 75.00 0 0 1 3 1 94.77 27.200000 0 0
2467 1 79.00 1 0 1 1 1 92.43 28.083333 0 0
19828 1 56.00 1 0 1 0 1 97.37 34.100000 2 0
621 0 69.00 0 0 1 0 1 101.52 26.800000 2 0
54975 0 7.00 0 0 0 1 1 64.06 18.900000 -1 0
57485 1 1.48 0 0 0 2 1 55.51 18.500000 -1 0
In [37]:
data.shape
Out[37]:
(500, 11)

ML in a nutshell¶

image.png

Similarity measures¶


Numerous similarity measures exist and are useful in various situations


image-4.png

Clustering algorithms¶


  • The goal of clustering is to find mutually similar (based on some similarity measure) groups in the data -Each algorithm assigns a cluster to every data point


image-3.png

Finding clusters in the data¶

DBSCAN Clustering¶

parameter settings eps=1.5 min_samples=20

There are four steps of using a ML model in Scikit-learn

  • Import the model from the library
  • Specify model parameters (the decision making of the model depend on the choice of parameters)
  • Fit the data with the model
  • Use the fitted model for prediction
In [38]:
from sklearn.cluster import DBSCAN
In [39]:
dbscan = DBSCAN(eps=1.5, min_samples = 20)
In [40]:
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
In [41]:
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)

Now we look at the distribution of clusters obtained by DBSCAN under the specific parameter settings

In [42]:
for i in range(len(values_dbscan)):
    print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 329
Cluster 1 counts: 34
Cluster 2 counts: 53
Cluster 3 counts: 37
Cluster 4 counts: 47

Note that if we change the parameter settings the model prediction changes

parameter settings eps=2 min_samples=10

In [43]:
dbscan = DBSCAN(eps=2, min_samples = 10)
clusters_dbscan=dbscan.fit_predict((data-data.mean())/data.std())
(values_dbscan,counts_dbscan) = np.unique(clusters_dbscan,return_counts=True)
for i in range(len(values_dbscan)):
    print('Cluster '+str(values_dbscan[i]+1)+' counts: '+str(counts_dbscan[i]))
Cluster 0 counts: 133
Cluster 1 counts: 23
Cluster 2 counts: 85
Cluster 3 counts: 94
Cluster 4 counts: 67
Cluster 5 counts: 33
Cluster 6 counts: 65

Agglomerative Clustering¶

parameter settings n_clusters=5,affinity="euclidean",linkage="ward"

In [44]:
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=5,affinity="euclidean",linkage="ward")
clusters_agg=agg.fit_predict((data-data.mean())/data.std())
(values_agg,counts_agg) = np.unique(clusters_agg,return_counts=True)
for i in range(len(values_agg)):
    print('Cluster '+str(values_agg[i]+1)+' counts: '+str(counts_agg[i]))
Cluster 1 counts: 256
Cluster 2 counts: 153
Cluster 3 counts: 28
Cluster 4 counts: 26
Cluster 5 counts: 37

Diversity in Clustering algorithms¶


image.png

Vizualizing high dimensional data using Dimension reduction¶


  • The strategy is to build projection of a high dimensional data into lower dimensions maintaining the information content in terms of similarity among data points as intact as possible
  • Recall that, the more spread in a data along some direction, the more information content: Hence PCA

image-4.png

Principal Component Analysis (PCA)¶

In [45]:
from sklearn.decomposition import PCA
In [46]:
pca = PCA(n_components=2)
In [47]:
low_dim_PCA=pca.fit_transform((data-data.mean())/data.std())
In [48]:
result_PCA = pd.DataFrame(data = low_dim_PCA , 
        columns = ['PCA_1', 'PCA_2'])
In [49]:
result_PCA['DBSCAN_clusters']=clusters_dbscan+1
result_PCA['Agglomerative_clusters']=clusters_agg+1
result_PCA['stroke']=np.array(data['stroke'])
In [50]:
result_PCA.head(10)
Out[50]:
PCA_1 PCA_2 DBSCAN_clusters Agglomerative_clusters stroke
0 -1.967599 0.224082 1 2 0
1 -0.184169 -0.562288 2 1 0
2 -2.879908 0.704221 3 2 0
3 0.916206 -0.483744 0 5 0
4 0.053280 -0.214408 4 1 0
5 1.418248 -0.577006 0 5 0
6 2.038984 -1.518028 0 5 0
7 1.444600 -1.048502 2 1 0
8 -2.727780 0.335220 1 2 0
9 -3.343449 0.073061 3 2 0
In [51]:
import seaborn as sns
from matplotlib import pyplot as plt
Vizualiting clusters obtained by DBSCAN on the PCA embeddings¶
In [52]:
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="PCA_1", y="PCA_2",
  data=result_PCA, 
  fit_reg=False, 
  legend=True,
  hue='DBSCAN_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:12.029952 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting clusters obtained by Agglomerative clustering on the PCA embeddings¶
In [53]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="PCA_1", y="PCA_2",
  data=result_PCA, 
  fit_reg=False, 
  legend=True,
  hue='Agglomerative_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:12.428467 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of target variable on PCA embedding¶
In [54]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="PCA_1", y="PCA_2",
  data=result_PCA, 
  fit_reg=False, 
  legend=True,
  hue='stroke', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:12.780398 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

Non-linear dimension reduction strategies through Manifold learning¶


image-2.png

Uniform Manifold Approximation and Projection (UMAP)¶

In [55]:
import umap.umap_ as umap
2022-09-28 17:40:16.240725: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-28 17:40:16.240743: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
In [56]:
low_dim_UMAP=umap.UMAP(n_neighbors=5, min_dist=0.1, n_components=2, random_state=42).fit_transform((data-data.mean())/data.std())
In [57]:
result_UMAP = pd.DataFrame(data = low_dim_UMAP , 
        columns = ['UMAP_1', 'UMAP_2'])
In [58]:
result_UMAP['DBSCAN_clusters']=clusters_dbscan+1
result_UMAP['Agglomerative_clusters']=clusters_agg+1
result_UMAP['stroke']=np.array(data['stroke'])
In [59]:
result_UMAP.head(10)
Out[59]:
UMAP_1 UMAP_2 DBSCAN_clusters Agglomerative_clusters stroke
0 5.640380 -2.627495 1 2 0
1 0.234862 4.020671 2 1 0
2 14.669820 18.462421 3 2 0
3 2.900954 14.992955 0 5 0
4 19.073860 7.344819 4 1 0
5 2.926433 14.942604 0 5 0
6 3.173262 14.292583 0 5 0
7 -0.270748 4.891105 2 1 0
8 5.536045 -2.657136 1 2 0
9 18.841158 -3.674576 3 2 0
Vizualiting clusters obtained by DBSCAN on the UMAP embeddings¶
In [60]:
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='DBSCAN_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:23.176603 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting clusters obtained by Agglomerative clusters on the UMAP embeddings¶
In [61]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='Agglomerative_clusters', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:23.679633 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of target variable on UMAP embedding¶
In [62]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='stroke', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:23.993311 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of smoking status on UMAP embedding¶
In [63]:
smoking_status_dict=rev_dict['smoking_status']
smoking_status_dict=dict((v,k) for k,v in smoking_status_dict.items())

smoking_status_labels=[]
for i in range(len(data['smoking_status'])):
    smoking_status_labels.append(smoking_status_dict[np.array(data['smoking_status'])[i]])
    
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['smoking_status']=np.array(smoking_status_labels)

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='smoking_status', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
    
2022-09-28T17:40:24.389768 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of residence type on UMAP embedding¶
In [64]:
Residence_type_dict=rev_dict['Residence_type']
Residence_type_dict=dict((v,k) for k,v in Residence_type_dict.items())

Residence_type_labels=[]
for i in range(len(data['Residence_type'])):
    Residence_type_labels.append(Residence_type_dict[np.array(data['Residence_type'])[i]])
    
colors_set = ['lightgrey','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))
result_UMAP['Residence_type']=np.array(Residence_type_labels)

sns.lmplot( x="UMAP_1", y="UMAP_2",
  data=result_UMAP, 
  fit_reg=False, 
  legend=True,
  hue='Residence_type', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:24.746281 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

Feature-type Distributed Clustering (FDC)¶

image.png

Bej, S., Sarkar, J., Biswas, S. et al. Identification and epidemiological characterization of Type-2 diabetes sub-population using an unsupervised machine learning approach. Nutr. Diabetes 12, 27 (2022). https://doi.org/10.1038/s41387-022-00206-2

In [65]:
array=np.arange(11)-5

distance_euc=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.euclidean(array[i],array[j])
        d.append(x)
    distance_euc.append(d)
distance_euc=np.array(distance_euc)/np.max(distance_euc)
#distance_euc=np.divide(distance_euc.T,np.max(distance_euc,axis=1)).T

distance_can=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.canberra(array[i],array[j])
        d.append(x)
    distance_can.append(d)
distance_can=np.array(distance_can)/np.max(distance_can)
#distance_can=np.divide(distance_can.T,np.max(distance_can,axis=1)).T

distance_canx=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.canberra(1,np.abs(array[i]-array[j])+1)
        d.append(np.sqrt(x))
    distance_canx.append(d)
distance_canx=np.array(distance_canx)/np.max(distance_canx)

distance_ham=[]
for i in range(len(array)):
    d=[]
    for j in range(len(array)):
        x=distance.hamming(array[i],array[j])
        d.append(x)
    distance_ham.append(d)
distance_ham=np.array(distance_ham)/np.max(distance_ham)
#distance_ham=np.divide(distance_ham.T,np.max(distance_ham,axis=1)).T
In [66]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_euc, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Euclidean', fontsize=20)
plt.show()
2022-09-28T17:40:25.309591 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [67]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_ham, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Hamming',fontsize=20)
plt.show()
2022-09-28T17:40:25.954178 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [68]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_can, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra',fontsize=20)
plt.show()
2022-09-28T17:40:26.592438 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [69]:
plt.subplots(figsize=(8, 6), dpi=600)
ax1 = sns.heatmap(distance_canx, cmap="YlGnBu", annot=True)
ax1.set_xticklabels(array)
ax1.set_yticklabels(array)
plt.title('Canberra modified',fontsize=20)
plt.show()
2022-09-28T17:40:27.213907 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [70]:
def canberra_modified(a,b):
    from scipy.spatial import distance
    return np.sqrt(distance.canberra(1,np.abs(a-b)+1))
In [71]:
def feature_clustering(UMAP_neb,min_dist_UMAP, metric, data, visual):
    import umap.umap_ as umap
    np.random.seed(42)
    data_embedded = umap.UMAP(n_neighbors=UMAP_neb, min_dist=min_dist_UMAP, n_components=2, metric=metric, random_state=42).fit_transform(data)
    data_embedded[:,0]=(data_embedded[:,0]- np.mean(data_embedded[:,0]))/np.std(data_embedded[:,0])
    data_embedded[:,1]=(data_embedded[:,1]- np.mean(data_embedded[:,1]))/np.std(data_embedded[:,1])
    result = pd.DataFrame(data = data_embedded , 
        columns = ['UMAP_0', 'UMAP_1'])
    if visual==1:
        sns.lmplot( x="UMAP_0", y="UMAP_1",data=result,fit_reg=False,legend=False,scatter_kws={"s": 20},palette=customPalette_set1) # specify the point size
        #plt.savefig('clusters_umap_all.png', dpi=700, bbox_inches='tight')
        plt.show()
    else:
        pass
    return result



def FDC(data,cont_list,nom_list,ord_list,cont_metric, ord_metric, nom_metric, drop_nominal, visual):
    np.random.seed(42)
    colors_set1 = ["lightcoral", "lightseagreen", "mediumorchid", "orange", "burlywood", "cornflowerblue", "plum", "yellowgreen"]
    customPalette_set1 = sns.set_palette(sns.color_palette(colors_set1))
    cont_df=data[cont_list]
    nom_df=data[nom_list]
    ord_df=data[ord_list]
    cont_emb=feature_clustering(30,0.1, cont_metric, cont_df, 0) #Reducing continueous features into 2dim
    ord_emb=feature_clustering(30,0.1, ord_metric, ord_df, 0) #Reducing ordinal features into 2dim
    nom_emb=feature_clustering(30,0.1, nom_metric, nom_df, 0) #Reducing nominal features into 2dim
    if drop_nominal==1:
        result_concat=pd.concat([ord_emb, cont_emb, nom_emb.drop(['UMAP_1'],axis=1)],axis=1) #concatinating all reduced dimensions to get 5D embeddings(1D from nominal)
    else:
        result_concat=pd.concat([ord_emb, cont_emb, nom_emb],axis=1)
    data_embedded = umap.UMAP(n_neighbors=30, min_dist=0.001, n_components=2, metric='euclidean', random_state=42).fit_transform(result_concat) #reducing 5D embeddings to 2D using UMAP
    result_reduced = pd.DataFrame(data = data_embedded , 
        columns = ['FDC_1', 'FDC_2'])
    
    if visual==1:
        sns.lmplot( x="FDC_1", y="FDC_2",data=result_reduced,fit_reg=False,legend=False,scatter_kws={"s": 20},palette=customPalette_set1) # specify the point size
        plt.show()
        #plt.savefig('clusters_umap_all.png', dpi=700, bbox_inches='tight')
    else:
        pass
    return result_concat, result_reduced #returns both 5D and 2D embeddings
In [72]:
entire_data_FDC_emb_five,entire_data_FDC_emb_two=FDC((data-data.mean())/data.std(),cont_features,nom_features,ord_features,'euclidean',canberra_modified,'hamming',1,0)
Five dimensional reduction of the original high dimensional data data using FDC¶
In [73]:
entire_data_FDC_emb_five.shape
Out[73]:
(500, 5)
Two dimensional reduction of Five dimensional FDC embedding data created to vizualize the results¶
In [74]:
entire_data_FDC_emb_two.shape
Out[74]:
(500, 2)
Finding clusters in the reduced 5-D FDC-embedding using DBSCAN¶
In [75]:
dbscan = DBSCAN(eps=1, min_samples = 10)
clusters_dbscan_FDC=dbscan.fit_predict(entire_data_FDC_emb_five)
(values_dbscan_FDC,counts_dbscan_FDC) = np.unique(clusters_dbscan_FDC,return_counts=True)
for i in range(len(values_dbscan_FDC)):
    print('Cluster '+str(values_dbscan_FDC[i]+1)+' counts: '+str(counts_dbscan_FDC[i]))
Cluster 0 counts: 66
Cluster 1 counts: 104
Cluster 2 counts: 135
Cluster 3 counts: 65
Cluster 4 counts: 33
Cluster 5 counts: 57
Cluster 6 counts: 11
Cluster 7 counts: 19
Cluster 8 counts: 10
Finding clusters in the reduced 5-D FDC-embedding using Agglomerative clustering¶
In [76]:
agg = AgglomerativeClustering(n_clusters=8,affinity="euclidean",linkage="ward")
clusters_agg_FDC=agg.fit_predict(entire_data_FDC_emb_five)
(values_agg_FDC,counts_agg_FDC) = np.unique(clusters_agg_FDC,return_counts=True)
for i in range(len(values_agg_FDC)):
    print('Cluster '+str(values_agg_FDC[i])+' counts: '+str(counts_agg_FDC[i]))
Cluster 0 counts: 45
Cluster 1 counts: 67
Cluster 2 counts: 53
Cluster 3 counts: 66
Cluster 4 counts: 75
Cluster 5 counts: 71
Cluster 6 counts: 69
Cluster 7 counts: 54
In [77]:
entire_data_FDC_emb_two['DBSCAN_clusters']=clusters_dbscan_FDC+1
entire_data_FDC_emb_two['Agglomerative_clusters']=clusters_agg_FDC+1
entire_data_FDC_emb_two['stroke']=np.array(data['stroke'])
Vizualiting clusters obtained by DBSCAN on the FDC embeddings¶
In [78]:
colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise",  "lightgreen",  'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="FDC_1", y="FDC_2",
  data=entire_data_FDC_emb_two, 
  fit_reg=False, 
  legend=True,
  hue='DBSCAN_clusters', # color by cluster
  scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:45.750563 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting clusters obtained by Agglomerative clustering on the FDC embeddings¶
In [79]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise",  "lightgreen",  'burlywood','lightsteelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="FDC_1", y="FDC_2",
  data=entire_data_FDC_emb_two, 
  fit_reg=False, 
  legend=True,
  hue='Agglomerative_clusters', # color by cluster
  scatter_kws={"s": 20}, palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:46.204948 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Vizualiting distribution of target variable on FDC embedding¶
In [80]:
colors_set = ['lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue']
customPalette_set = sns.set_palette(sns.color_palette(colors_set))

sns.lmplot( x="FDC_1", y="FDC_2",
  data=entire_data_FDC_emb_two, 
  fit_reg=False, 
  legend=True,
  hue='stroke', # color by cluster
  scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
plt.show()
2022-09-28T17:40:46.543525 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
Clearly Cluster 3,5,6 and 8 obtained from the DBSCAN clustering are the ones that correspond to the target variable¶

Interpretation of the analysis using Hypothesis testing¶

Analysis of distribution of features over clusters¶
In [81]:
data['Clusters']=np.array(clusters_dbscan_FDC)
In [82]:
cluster_df_list=[]
for cluster in values_dbscan_FDC:
    cluster_df=data.loc[data['Clusters'] == cluster].drop(columns=['Clusters'])
    cluster_df.columns=list(data.columns)[:-1]
    cluster_df_list.append(cluster_df)
In [83]:
cluster_df_list=cluster_df_list[1:]
In [84]:
data=data.drop(['Clusters'],axis=1)
In [85]:
import matplotlib as mpl
from scipy import stats


def feature_p_val(feature,cluster_df_list,var_type):
    %config InlineBackend.figure_format = 'svg'
    num_clusters=len(cluster_df_list)
    p_list_all_clusters=[]
    for i in range(num_clusters):
        p_list=[]
        for j in range(num_clusters):
            p=p_val(cluster_df_list[i],cluster_df_list[j],feature,var_type)
            p_list.append(p)
        p_list=np.array(p_list)
        p_list_all_clusters.append(p_list)  
    return(np.array(p_list_all_clusters))

def p_val(clustera,clusterb,feature,var_type):
    if var_type=='cont':
        p=stats.ttest_ind(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
    else:
        p=ranksums(np.array(clustera[feature]), np.array(clusterb[feature])).pvalue
    return p

class MidpointNormalize(mpl.colors.Normalize):
    def __init__(self, vmin, vmax, midpoint=0, clip=False):
        self.midpoint = midpoint
        mpl.colors.Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        normalized_min = max(0, 1 / 2 * (1 - abs((self.midpoint - self.vmin) / (self.midpoint - self.vmax))))
        normalized_max = min(1, 1 / 2 * (1 + abs((self.vmax - self.midpoint) / (self.midpoint - self.vmin))))
        normalized_mid = 0.5
        x, y = [self.vmin, self.midpoint, self.vmax], [normalized_min, normalized_mid, normalized_max]
        return np.ma.masked_array(np.interp(value, x, y))

def p_map(feature,var_type):
    heatmap, ax = plt.subplots(figsize=(8, 8), dpi=600)
    norm = MidpointNormalize(vmin=0, vmax=1, midpoint=0.5)
    im = ax.imshow(feature_p_val(feature,cluster_df_list,var_type), cmap='coolwarm', norm=norm)
    ax.set_xticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])
    ax.set_yticklabels(['','C1','C2','C3','C4','C5','C6','C7','C8','C9'])

    for y in range(len(cluster_df_list)):
        for x in range(len(cluster_df_list)):
            plt.text(x , y , '%.2f' % feature_p_val(feature,cluster_df_list, var_type)[y, x], horizontalalignment='center',verticalalignment='center',fontsize=8)

    cbar = heatmap.colorbar(im)
    cbar.ax.set_ylabel('p-value')
    plt.title(feature.upper(), fontsize=16)
    print('\n')
    plt.show()
    
    return
In [86]:
for feature in cont_features:
    #print(feature.upper())
    p_map(feature,'cont')

2022-09-28T17:40:48.826068 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:40:51.934330 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:40:54.840308 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
In [87]:
from scipy.stats import ranksums
for feature in ord_features+nom_features:
    #print(feature.upper())
    p_map(feature,'ord')

2022-09-28T17:40:57.501462 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:00.102700 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:02.742576 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:05.337656 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:07.932282 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:10.650990 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:13.314227 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

2022-09-28T17:41:15.905138 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/

Vizualizing the distribution of significant features over clusters¶

In [90]:
def vizx(feature_list, cluster_df_list, main_data,umap_data,cont_features):
    %config InlineBackend.figure_format = 'svg'
    vizlimit=15
    plt.rcParams["figure.figsize"] = (12,6)
    
    
    
    for feature in feature_list:
        print('Feature name:', feature.upper())
        print('\n')
    
        if len(main_data[feature].value_counts())<=vizlimit:
            cluster_counter=1
            for cluster in range(len(cluster_df_list)):
                print('Cluster '+ str(cluster_counter)+ ' frequeny distribution')
                if feature in list(rev_dict.keys()):
                    feat_keys=rev_dict[feature]
                    r=dict(zip(feat_keys.values(), feat_keys.keys()))
                    print(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
                else:
                    print(cluster_df_list[cluster][feature].value_counts())
                cluster_counter=cluster_counter+1
                print('\n')
        
        print('\n')
        print('\n')
        
        col=sns.color_palette("Set2")
        
        cluster_bar=[]     
        for cluster in range(len(cluster_df_list)):
            if len(main_data[feature].value_counts())<=vizlimit:
                if feature in list(rev_dict.keys()):
                    
                    
                    y=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts())
                    x=np.array(cluster_df_list[cluster].replace({feature:r})[feature].value_counts().index)
                    cluster_bar.append([x,y])
                    
                   
                else:
                   
                    y=np.array(cluster_df_list[cluster][feature].value_counts().sort_index())
                    x=np.array(cluster_df_list[cluster][feature].value_counts().sort_index().index)
                    cluster_bar.append([x,y])
            
                
                
        cluster_bar=np.array(cluster_bar)
        
        rows=3
        columns=3
    
        if len(main_data[feature].value_counts())<=vizlimit:
            figx, ax = plt.subplots(rows, columns)
            figx.set_size_inches(10.5, 28.5)
            cluster_in_subplot_axis_dict=np.array([[0,0],[0,1],[0,2],[1,0],[1,1],[1,2],[2,0],[1,1],[2,2]])
            c=0
            for i in range(rows):
                for j in range(columns):
                    if c >= len(cluster_df_list):
                        break
                    ax[i,j].bar(cluster_bar[c,0],cluster_bar[c,1],color=col)
                    ax[i,j].tick_params(axis='x', which='major', labelsize=8, rotation= 90)
                    ax[i,j].set_title('Cluster: '+str(c+1))
                    c=c+1
            
        means=[]
        sds=[]
        cluster_labels=[]
        cluster_counter=1
        for cluster in range(len(cluster_df_list)):
            if feature in cont_features:
                print('Cluster '+ str(cluster_counter)+ ' summary statistics')
                print('\n')
                cm=cluster_df_list[cluster][feature].mean()
                cs=cluster_df_list[cluster][feature].std()
                print('feature mean:', cm)
                print('feature standard deviation:', cs)
                print('feature median:', cluster_df_list[cluster][feature].median())
                print('\n')
                means.append(cm)
                sds.append(cs)
                cluster_labels.append('C'+str(cluster_counter))
            cluster_counter=cluster_counter+1
            
            
        means=np.array(means)
        sds=np.array(sds)
        cluster_labels=np.array(cluster_labels)
        
        print('\n')  
        
        print('Distribution of feature across clusters')
        if feature in cont_features:   
            fig, ax7 = plt.subplots()
            ax7.bar(cluster_labels,means,yerr=sds,color=sns.color_palette("Set3"))
            ax7.tick_params(axis='both', which='major', labelsize=10)
            plt.xlabel(feature, fontsize=15)        
            plt.show()
        
        print('\n')
        print('\n')
        
        colors_set = ['lightgray','lightcoral','cornflowerblue','orange','mediumorchid', 'lightseagreen','olive', 'chocolate','steelblue',"paleturquoise",  "lightgreen",  'burlywood','lightsteelblue']
        customPalette_set = sns.set_palette(sns.color_palette(colors_set))
        
        if feature not in cont_features:
            print('Feature distribution in UMAP embedding')
            if feature in list(rev_dict.keys()):
                umap_data[feature]=np.array(main_data.replace({feature:r})[feature])
            else:
                umap_data[feature]=np.array(main_data[feature])
            sns.lmplot( x="FDC_1", y="FDC_2",
              data=umap_data, 
              fit_reg=False, 
              legend=True,
              hue=feature, # color by cluster
              scatter_kws={"s": 20},palette=customPalette_set) # specify the point size
            plt.show()
            #plt.savefig('clusters_umap_fin.png', dpi=700, bbox_inches='tight')
        
        print('\n')
        print('\n')
        
In [91]:
vizx(['age','Residence_type','smoking_status'], cluster_df_list, data, entire_data_FDC_emb_two ,cont_features)
Feature name: AGE






Cluster 1 summary statistics


feature mean: 17.735384615384618
feature standard deviation: 17.664356570634503
feature median: 10.0


Cluster 2 summary statistics


feature mean: 42.4
feature standard deviation: 18.325090865364917
feature median: 40.0


Cluster 3 summary statistics


feature mean: 63.53846153846154
feature standard deviation: 12.43995192298402
feature median: 64.0


Cluster 4 summary statistics


feature mean: 49.878787878787875
feature standard deviation: 17.122276381510968
feature median: 53.0


Cluster 5 summary statistics


feature mean: 40.6140350877193
feature standard deviation: 14.848947035738778
feature median: 43.0


Cluster 6 summary statistics


feature mean: 68.63636363636364
feature standard deviation: 13.574039393435747
feature median: 70.0


Cluster 7 summary statistics


feature mean: 61.94736842105263
feature standard deviation: 8.585994307336456
feature median: 62.0


Cluster 8 summary statistics


feature mean: 66.2
feature standard deviation: 12.934020600296293
feature median: 69.5




Distribution of feature across clusters
2022-09-28T17:46:06.413720 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/







Feature name: RESIDENCE_TYPE


Cluster 1 frequeny distribution
Rural    55
Urban    49
Name: Residence_type, dtype: int64


Cluster 2 frequeny distribution
Rural    77
Urban    58
Name: Residence_type, dtype: int64


Cluster 3 frequeny distribution
Urban    36
Rural    29
Name: Residence_type, dtype: int64


Cluster 4 frequeny distribution
Rural    33
Name: Residence_type, dtype: int64


Cluster 5 frequeny distribution
Rural    30
Urban    27
Name: Residence_type, dtype: int64


Cluster 6 frequeny distribution
Rural    11
Name: Residence_type, dtype: int64


Cluster 7 frequeny distribution
Urban    19
Name: Residence_type, dtype: int64


Cluster 8 frequeny distribution
Urban    10
Name: Residence_type, dtype: int64








Distribution of feature across clusters




Feature distribution in UMAP embedding
2022-09-28T17:46:07.093812 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
2022-09-28T17:46:07.246970 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/



Feature name: SMOKING_STATUS


Cluster 1 frequeny distribution
Unknown    104
Name: smoking_status, dtype: int64


Cluster 2 frequeny distribution
never smoked    135
Name: smoking_status, dtype: int64


Cluster 3 frequeny distribution
never smoked       27
smokes             17
formerly smoked    11
Unknown            10
Name: smoking_status, dtype: int64


Cluster 4 frequeny distribution
formerly smoked    33
Name: smoking_status, dtype: int64


Cluster 5 frequeny distribution
smokes    57
Name: smoking_status, dtype: int64


Cluster 6 frequeny distribution
never smoked       6
Unknown            2
formerly smoked    2
smokes             1
Name: smoking_status, dtype: int64


Cluster 7 frequeny distribution
formerly smoked    19
Name: smoking_status, dtype: int64


Cluster 8 frequeny distribution
formerly smoked    5
never smoked       3
Unknown            1
smokes             1
Name: smoking_status, dtype: int64








Distribution of feature across clusters




Feature distribution in UMAP embedding
2022-09-28T17:46:08.076382 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/
2022-09-28T17:46:08.292066 image/svg+xml Matplotlib v3.4.3, https://matplotlib.org/



Take-home messages¶


  • The idea of machine learning is to represent real-world data, be it tabular, images, texts in terms of numbers
  • These representations have to be designed in such a way that the set of numbers representing similar objects are closer to each other and vice-versa. Machine Learning algorithms are designed to achieve this basic goal
  • Tabular data used for clinical decision making using Machine Learning has diverse feature types
  • It is crucial to analyze the patterns present in the data to achieve explainable decision-making
  • Pre-processing is necessary to select relevant features, impue missing values and convert the raw data into a refined numeric table that is further used for Machine Learning
  • Clustering algorithms help us group similar data points from high dimensional data
  • Since we cannot vizualize high dimensional data, Dimension reduction algorithms are used to create lower dimensional embeddings of the data that again helps us in vizualization
  • A basic strategy is to observe the data in the lower dimension that captures the maximum information (high variance). This is achieved by PCA
  • There are non-linear strategies of dimension reduction that rely on non-linear measures of similarity, such as UMAP.
  • The Feature-type Distributed Clustering workflow can further improve clustering on tabular patient data, by taking into account separate similarity measures for different variable types

image.png