Distance Metrics in Machine Learning In Python IN SHORT
Distance Metrics
It is used in both supervised and unsupervised learning, generally to calculate the similarity between data points.
Types of Distance Metrics in Machine Learning
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance
- Hamming Distance
- Cosine Distance
Few Machine learning algorithm uses Distance Metrics
- Clustering Algorithms (For example, K Means etc.)
- Classification Algorithms (For example, KNN Classification etc.)
Euclidean Distance
Euclidean Distance represents the shortest distance between two points.
Most machine learning algorithms including K-Means use this distance metric to measure the similarity between observations.
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance
- If Lamda = 1, then it calculates Manhatten Distance
- If Lamda = 2, then it calculates Euclidean Distance
In the SciPy package, the p parameter of the Minkowski Distance metric of the SciPy package
- When the order(p) = 1, it will represent Manhattan Distance
- When the order(p) = 2, it will represent Euclidean Distance
Hamming Distance
Hamming Distance measures the similarity between two strings of the same length
The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different.
Let’s say we have two strings: “euclidean” and “manhattan”
Since the length of these strings is equal, we can calculate the Hamming Distance. We will go character by character and match the strings. The first character of both the strings (e and m respectively) is different. Similarly, the second character of both the strings (u and a) is different. and so on.
Look carefully — seven characters are different whereas two characters (the last two characters) are similar: “euclide — an” and “manhatt — an”. Hence, the Hamming Distance here will be 7.
Note that the larger the Hamming Distance between two strings, the more dissimilar they will be those strings (and vice versa)
Cosine Distance / Cosine Similarity
Cosine similarity is used to determine the similarity between documents or vectors. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.
The relation between cosine similarity and cosine distance can be defined below.
- Similarity decreases when the distance between two vectors increases
- Similarity increases when the distance between two vectors decreases
1 — Cosine Similarity = Cosine Distance, Cosine Similarity = Cos (Theta)
Case 1: When the angle between points P1 & P2 is 45. Degrees then
cosine_similarity = Cos 45 = 0.525
Case 2: When two points P1 & P2 are far from each other and the angle between points is 90 Degrees then
cosine_similarity = Cos 90 = 0
Case 3: When two points P1 & P2 are very near and lies on the same axis as each other and the angle between points is 0 Degree then
cosine_similarity = Cos 0 = 1
Case 4: When points P1 & P2 lies opposite two each other and the angle between points is 180 Degree then
cosine_similarity= Cos 180 = -1
Case 5: When the angle between points P1 & P2 is 270 Degrees then
cosine_similarity= Cos 270 = 0
Case 6: When the angle between points P1 & P2 is 360 Degrees then
cosine_similarity= Cos 360 = 1
Let's pass these values of each angle discussed above and see the Cosine Distance between two points. (1 — Cosine Similarity = Cosine Distance)
Let's replace the values in the above formula.
Case 1: When Cos 45 Degree: Cosine_Distance = 1–0.525 = 0.475
Case 2: When Cos 90 Degree: Cosine_Distance = 1–0 = 1
Case 3: When Cos 0 Degree: Cosine_Distance = 1–1 = 0
Case 4: When Cos 180 Degree: Cosine_Distance = 1–(-1)= 2
Case 5: When Cos 270 Degree: Cosine_Distance = 1–0 = 1
Case 6: When Cos 360 Degree: Cosine_Distance = 1–1 = 0
We can clearly see that when distance is less the similarity is more(points are near to each other) and distance is more, two points are dissimilar (far away from each other).
GitHub Code:
If it was helpful, please give a thumbs up. Thank You and Please follow:
Medium: https://medium.com/@sandipanpaul
GitHub: https://github.com/sandipanpaul21