Method of Measuring Distance

vector norm http://taewan.kim/post/norm/#:~:text=Norm은 벡터의 길이,거리 혹은 Magnitude라고 합니다.&text=p는 Lorm의 차수를 의미합니다. norm은 vector의 크기(magnitude) 또는 길이(length)를 측정하는 방법이다.

$$ ||\textbf{x}||p = (\sum^d{i=1} |x_k|^p)^{1/p}\;\;\;(p>0) $$
- $if\;\;p = \infty\;, \;\;||\textbf{x}||\;=\;max(x_i\;,\;...\;,\;x_n)$
Distance
- sets as vectors → Cosine Distance = 1 - Cosine Similarity
- sets as points → Euclidean Distance(L2 norm) or Manhattan Distance(L1 norm)
- sets as sets → Jaccard Distance = 1 - Jaccard Similarity
Difference between Euclidean Distance and Manhattan Distance

The difference depends on your data.

For high dimensional vectors you might find that Manhattan works better than the Euclidean distance.

The reason for this is quite simple to explain. Consider the case where we use the 𝑙∞l∞ norm that is the Minkowski distance with exponent = infinity. Then the distance is the highest difference between any two dimensions of your vectors. We can see this doesn't make sense in many dimensions as we would be ignoring most of the dimensionality and measuring distance based on a single attribute.

Thus reducing the exponent makes the other features play a bigger role in the distance calculation. The lower the exponent the less relevant a high difference in some given dimension will be.

Something interesting is that distances with exponent <1 might work even better than Manhattan: (||𝑥−𝑦||𝑝)1/𝑝(||x−y||p)1/p with p<1. This distances are curiously not valid metrics as the triangular inequality doesn't hold but they can be used anyway.

Curse of Dimensionality

Screen Shot 2022-09-26 at 9.49.32 PM.png

출처: https://datapedia.tistory.com/15#:~:text=차원의 저주란%2C,성능이 저하되는 현상.&text=차원이 증가함에 따라,좋아지는 현상을 의미합니다.

관측치 대비 차원(변수)의 수가 많아 지면 관측치 간의 거리가 급격하게 증가하면서 데이터를 representation하기 힘들어진다.

Terminology

Centroid vs Clustroid

Screen Shot 2022-10-15 at 2.19.15 PM.png