Quick Reminder: Clustering

Quick Reminder: Clustering

Last updated:

Traditional (non-hierachical) Clusters, such as K-Means:

Need to be given the number N of clusters and the initial cluster positions (centroids).

Hierarchical clusters

No need to inform number of clusters and positions, but you need to inform the linkage type.

Can be agglomerative (bottom-up) or divisive (top-down).

Linkage

It's the measure of dissimilarity (distance) between clusters.

  • single linkage: distance between two groups is the smallest distance between two points in these groups.

    • elements at opposite ends of a cluster may be much farther from each other than to elements of other clusters.
  • complete linkage: distance between two groups is the largest distance between two points in these groups.

    • Favours compact clusters with small diameters over long, straggly clusters.
    • Sensitive to outliers.
  • average linkage: distance between two groups is the average distance between two points in these groups.

  • ward linkage: distance between two groups is the difference between the sum of the squared distances of all points within each group.

    • Similar to K-means

Dendrograms

It's a way of representing hierarchical clusters.

Y-axis indicates dissimilarity.

E.g.,in the following picture:

  • the dissimilarity between android and all other concepts is a little over 1600.

  • the dissimilarity between php and javascript is around 1400.

  • the dissimilarity between c# and java is a little over 1200.

dendrogram Sample dendrogram (single linkage) of programming concepts.
Created using scipy.cluster.hierarchy.dendrogram


References

Resources

Dialogue & Discussion