t-SNE is not a clustering algorithm


It is tempting to create cool visualizations in a paper to attract attention. One such visualization method is t-SNE, which I think is popularized by the famous DQN paper in Nature. However, I believe, most of the time, it is interpreted incorrectly.

If you run the following code snippet to generate a two-dimensional Gaussian distribution and fit t-SNE on 5000 samples generated from it:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE

x = np.random.randn(5000, 2)
z1 = TSNE().fit_transform(x)
z2 = TSNE().fit_transform(x)
z3 = TSNE().fit_transform(x)

fig, ax = plt.subplots(1, 4, figsize=(16, 4))
colors = (np.arctan2(x[:, 1], x[:, 0]) + np.pi) / (2*np.pi)
ax[0].scatter(x[:, 0], x[:, 1], cmap="jet", c=colors)
ax[1].scatter(z1[:, 0], z1[:, 1], cmap="jet", c=colors)
ax[2].scatter(z2[:, 0], z2[:, 1], cmap="jet", c=colors)
ax[3].scatter(z3[:, 0], z3[:, 1], cmap="jet", c=colors)

Three different t-SNE outputs when fit on a 2d Gaussian.

what you will get is something like this. I specifically colored the points to show points near each other in the original sample space (in our case, a 2D Gaussian). At each run, different 'cluster-like-groups' emerge, even though they are quite arbitrary, and you can see some gaps in the embedding space, which might let you think that there is a gap in the manifold in the original sample space, too, even though that is not the case.

Let's fit t-SNE to samples from the uniform distribution:

Three different t-SNE outputs when fit on a 2d uniform distribution

So, in the end, what you can say is that "samples that are nearby in the embedding space are also nearby in the original space", but not "samples that are far from each other are different from each other". t-SNE only preserves the local structure, not the global picture.