Distance measure determines the similarity between two elements and it influences the shape of the clusters.
Some of the ways we can calculate distance measures include:
- Euclidean distance measure
- Squared Euclidean distance measure
- Manhattan distance measure
- Cosine distance measure
Euclidean Distance Measure
The most common method to calculate distance measures is to determine the distance between the two points. Let’s say we have a point P and point Q: the Euclidean distance is the direct straight-line distance between the two points.
The formula for distance between two points is shown below:
As this is the sum of more than two dimensions, we calculate the distance between each of the different dimensions squared and then take the square root of that to get the actual distance between them.
Squared Euclidean Distance Measurement
This is identical to the Euclidean measurement method, except we don’t take the square root at the end. The formula is shown below:
Depending on whether the points are farther apart or closer together, then the difference in distances can be computed faster by using squared Euclidean distance measurement.
While this method gives us the exact distance, it won’t make a difference when calculating which is smaller and which is larger. Removing the square root can make the computation faster.
Manhattan Distance Measurement
This method is a simple sum of horizontal and vertical components or the distance between two points measured along axes at right angles.
The formula is shown below:
This method is different because you’re not looking at the direct line, and in certain cases, the individual distances measured will give you a better result.
Most of the time, you’ll go with the Euclidean squared method because it’s faster. But when using the Manhattan distance, you measure either the X difference or the Y difference and take the absolute value of it.
Cosine Distance Measure
The cosine distance similarity measures the angle between the two vectors. The formula is:
As the two vectors separate, the cosine distance becomes greater. This method is similar to the Euclidean distance measure, and you can expect to get similar results with both of them.
Note that the Manhattan measurement method will produce a very different result. You can end up with bias if your data is very skewed or if both sets of values have a dramatic size difference.
Let us now take a detailed look at the types of hierarchical clustering, starting with agglomerative clustering.