Complete Feature Selection Techniques 4 - 3 Dimension Reduction
Summarize math intuition and demonstrate PCA, LDA, MDS, ISOMAP, T-SNE, UMAP for feature dimension reduction
Complete Feature Selection Techniques
- Statistical Test & Analysis
- Correlation Analysis
- Dimension Reduction
- Model Driven
PCA and LDA
Both Principal Component Analysis(PCA) and Linear Discriminant Analysis(LDA) are linear transformation techniques.
Linear transformation can scale and rotate data points from original axis(a.k.a space) into a new axis, and the data distribution will be changed accordingly during the scaling and rotating.
So generally speaking, PCA is to find the transformation(scale and rotate) to maximize dataset variance, and LDA is to find the transformation to maximize dataset separability.
Linear Transformation and Orthogonal Projection
The following is an example of Linear Transformation and Orthogonal Projection.
x is any data point vector in training set X(x1, x2, , , xn) X is d x n, and assume X is mean value centered.
w is any vector in orthogonal transformation matrix W(w1, w2, , , wd) W is d x d and ∥wᵢ∥ = 1
Then orthogonal project x on w is as below, and the project value is x new value on axis w
PCA Transformation Object
PCA object is to find the transformation which can maximize the whole dataset variance.
Based on the variance and orthogonal project definition, for a given new axis w, the variance of transformed training set is
Because the training set X is mean centered, x bar is 0
The ultimate PCA object function is as below, Cₙ is training set X covariance matrix. (Assume X is mean centered)
Resolve the optimization function with Lagrangian multiplier (condition on ∥wᵢ∥ = 1), we can get maximization condition is
So PCA transformation matrix W is eigenvectors of training data(centered) covariance matrix.
Because eigenvectors provide perfect variance partition, the total transformed dataset variance is partitioned by each eigenvector, so we can reduce some relatively low variance(a.k.a less informative) eigenvectors, and this can be used for feature dimension reduction.
LDA Transformation Object
LDA object is to find the transformation which can maximize class separability in the dataset
To maximize class separability, LDA try to find a transformation to achieve
- Separate each class center(a.k.a class mean) as far as possible
- Scale each class range as small as possible, so there is less overlap between different classes
The separation of each class center can be measured by the variance of mean values of all classes
Suppose training set X has n classes, μ is total mean, μᵢ is mean for class i
For a given new axis w, the class center variance is
Because the training set X is mean centered, μ is 0, so
The short expression for class center variance value is below, and LDA expects transformation w enlarge the value.
LDA also requires transformation w reduces the sum of variance of each class.
The variance sum of all class is
And the short expression is as below, and LDA expects transformation w reduce the value.
Ultimately, LDA put the above two requirements together and looking for the transformation w to maximize the below function
Resolve the optimization, maximization condition is
So LDA transformation matrix W is eigenvectors as well, so like PCA we can reduce some relatively small eigenvalue eigenvectors to implement feature dimension reduction.
PCA and LDA Python Example
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
%matplotlib inlineiris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_namespca = PCA(n_components=2) # Keep first 2 PC
X_r = pca.fit(X).transform(X)lda = LinearDiscriminantAnalysis(n_components=2)# Keep first 2 PC
X_r2 = lda.fit(X, y).transform(X)colors = ['navy', 'turquoise', 'darkorange']for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r[y == i, 0],
X_r[y == i, 1],
color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of IRIS dataset')plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0],
X_r2[y == i, 1],
color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of IRIS dataset')
LDA has better separability
MDS and ISOMAP
Multidimensional Scaling(MDS) and Isometric Feature Mapping(ISOMAP) are two very similar non-linear dimension reduction techniques.
The feature of MDS and ISOMAP is
In the dimension reduction process, both of them will target to preserve pair distances for all data points.
It means, MDS and ISOMAP try to find the lower dimensions in which any two data points have same distance as in high dimensions.
The common process for MDS and ISOMAP is as below:
Assume training set X has N instances and each instance has D dimensions
- Construct distance matrix D(N x N), cell Dᵢⱼ represents pair distance of training data point i and j
- Define matrix Y, Y is (N x K) and K< D
- Assume training set X is mapped to Y, because the mapping keeps instance pairwise distance, so ∥Yᵢ - Yⱼ∥² = Dᵢⱼ²
- Define an inner product matrix G = YYᵗ, and computer G (Please check reference 6,7,8 )
- Because G is symmetric matrix, its Eigendecomposition can be
7. Because rank(G) = rank(YYᵗ) = rank(Y) = K, so G only has K non-zero eigenvalues, remove zero eigenvalues and corresponding eigenvectors
8. From G = YYᵗ, the lower dimension(K<N) data points Y (N x K) is as below
Difference Between MDS and ISOMAP
The above process is common for both MDS and ISOMAP, the only difference is, MDS and ISOMAP may use different way to define and calculate distance.
MDS can use Euclidean, Manhattan, Cosine or Custom Distance. The above resolution process is based on Euclidean distance, and for other distance we need construct a object function(e.g. the distance difference between high and low dimensions) and use gradient descend to resolve the lower dimension mapping.
ISOMAP use geodesic distance, it use neighborhood graph connection to find the shortest path between two data points, and use the shortest path as distance(Please check reference 9,11)
MDS and ISOMAP Python Example
MDS vs PCA
from sklearn.manifold import MDSmds = MDS( n_components=2, metric=True) # Keep first 2 PC
mds_r = mds.fit_transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(mds_r[y == i, 0],
mds_r[y == i, 1],
color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('MDS of IRIS dataset')
ISOMAP
from sklearn.manifold import Isomapfig, ax = plt.subplots(1,3,figsize=(15, 5))
for idx, neighbor in enumerate([2, 20, 100]):
isomap = Isomap( n_components=2, n_neighbors=neighbor)
new_X_isomap = isomap.fit_transform(X)
ax[idx].scatter(
new_X_isomap[:,0],
new_X_isomap[:,1],
c=y)
ax[idx].set_title("Isomap (n_neighbors=%d)"%neighbor)plt.show()
T-SNE and UMAP
From a different perspective, T-SNE and UMAP introduce probability distribution to describe data points relative location and approximate low dimension distribution to high dimension distribution to preserve data points local and global relative positions in dimension reduction.
Normally we only use T-SNE and UMAP to visualize high-dimension data in low-dimension graph.
General Idea and Process of T-SNE and UMAP
- For each original high dimension data instance x, construct conditional probability distribution p (Any Other Instance|Current Data Instance) based on the pair distance, closer distances will get higher probability
- Assume y is the data instance x mapping in low dimension space, and the low dimension conditional probability distribution q satisfies p(xⱼ|xᵢ) = q(xⱼ|xᵢ)
- Approximate the low dimension distribution to high dimension distribution, and compute the low dimension via a cost function and gradient descend to find the best low dimension matrix y.
Brief Definition Summary
T-SNE uses Gaussian distribution for High-D distance mapping, t-distribution for Low-D distance mapping, cost function is KL_Divergence.
UMAP use custom distributions for both High-D and Low-D and use Binary Cross Entropy as cost function.
In both T-SNE and UMAP, σᵢ is perplexity which decide the neighborhood size when calculate distance probability, please check reference 14,17 for more details of T-SNE and UMAP parameters.
T-SNE & UMAP Python Example
#pip install umap-learn
import umap
import numpy as np
from sklearn.manifold import TSNE
from sklearn import datasetsdigits = datasets.load_digits()
target_ids = range(len(digits.target_names))tsne = TSNE(n_components=2,
perplexity=50,
random_state=0)
tsne_r = tsne.fit_transform(digits.data)umap_ = umap.UMAP(n_components=2,
n_neighbors= 50,
random_state=42)
umap_r = umap_.fit_transform(digits.data)fig, ax_array = plt.subplots(20, 20)
axes = ax_array.flatten()
for i, ax in enumerate(axes):
ax.imshow(digits.images[i], cmap='gray_r')
plt.setp(axes, xticks=[], yticks=[], frame_on=False)
plt.tight_layout(h_pad=0.5, w_pad=0.01)
plt.show()plt.scatter(tsne_r[:, 0],
tsne_r[:, 1],
c=digits.target,
cmap='Spectral',
s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('T-SNE projection of the Digits dataset')plt.figure()
plt.scatter(umap_r[:, 0],
umap_r[:, 1],
c=digits.target,
cmap='Spectral',
s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the Digits dataset')
REFERENCE
- MLCC 2015 Dimensionality Reduction and PCA
- Machine Intelligence — Lecture 4 (LDA, t-SNE)
- Implementing a Principal Component Analysis (PCA) – in Python, step by step
- Linear discriminant analysis: A detailed tutorial
- Linear Discriminant Analysis– Bit by Bit
- Multidimensional Scaling (MDS)
- Dimensionality Reduction: Multidimensional Scaling
- 机器学习算法-MDS降维算法
- Isometric Feature Mapping (ISOmap)
- 机器学习-isomap降维算法
- Dimensionality Reduction: ISOMAP & LLE
- Dimensionality Reduction: t-SNE
- Introduction to Factor Analysis in Python
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
- UMAP Uniform Manifold Approximation and Projection for Dimension Reduction | SciPy 2018 |
- scRNA-seq: Dimension reduction (PCA, tSNE, UMAP)
- How Exactly UMAP Works
- t-SNE完整笔记