Big Data Algorithm Lesson 1: About Kernel

最近在旁听复旦大学一门大数据算法课，老师比较侧重于底层优化算法，讲的很细致很有意思，于是在知乎上整理了笔记，由于公式太难重新打了，那么就把链接放在下面吧：

Why Kernel

为什么映射到高维空间可以解决线性不可分的情况

如下图，这一组点在二维平面线性不可分，但是若将其映射到三维空间，就可以找到一个平面将其划分开来。
通俗来说，核函数的就是将这些点从低维空间映射到了高维空间，以达到线性可分的效果，但是其理论基础是什么呢？

$\bf{Definition \ 1: }$ If ${ x_0,…,x_N }, x_i \in R^{N+1}$ is affinity independent, they can be linearly seprable in n-dim space.

what is affinity independent:
If ${ (x_1-x_0),…,(x_N-x_0) }$ is linearly independent, then ${ x_0,…,x_N }$ is affinity independent.
what is linearly (strongly) seprable:
For two sets $C_1,C_2 \in R^N$,$\exists w \in R^N$, $s.t.inf_{x\in C_1}(w,x) > sup_{x\in C_2} (w,x)$

这个理论提供了核函数存在的意义，因为映射到高维空间在理论上来说是一个可行的、能够有效帮助解决线性不可分情况的操作。如果要通俗的来理解这个理论，比如对于一个dataset，有500个数据样本，那你必然能够找一个499维的空间，将这500个数据点分开来，而这个499维的空间的坐标轴，实际上就是我们通常所说的抽出来的特征。

为了证明这个理论，我们需要了解以下几个知识点：

convex set: (凸集)

For a set $C$, if $\lambda x+(1-\lambda)y \in C, \ for \ \forall x,y\in C, \lambda \in (0,1) $

convex combination: (凸组合)

$\lambda_1x_1+\cdots+\lambda_nx_n$ is called a convex combination if $\lambda_i \ge 0$ and $\sum \lambda_i = 1$

convex hull: (凸包)

conv(C): the convex hull of a set C is the intersection of all convex sets containing C.

一个集合的凸包指的就是找到的那个最小的集合使得它成为凸集，或者说是包含这个集合的最小的凸集。
conv(C) = {$x$|$x$ can be represented as the convex combination of points in $C$}

即，假设$C={x_0,…,x_m}$, 则其凸包$conv(C)={x|x=\sum \lambda_ix_i$, $\sum \lambda_i = 1$, $\lambda_i \ge 0} $

注：有限的点组成的convex hull一定是个多面体，不会有弧形的部分（因为要保证最小），这个多面体也能被称之为一个simplex-单纯形。

Seperation Theorem: (凸集分离定理)

可以说是线性规划最经典的理论了

假设有两个convex set $C_1,C_2$,$C_1\ne \emptyset $,$C_2\ne \emptyset$, 且两者闭包（closure）的交集等于空，表示为$cl \ C_1 \cap cl \ C_2 = \emptyset$,若$C_1$$C_2$中任意一个集合有界(bounded)，则$\exists w\in R^N, \ s.t. \ inf_{x\in C_1}\langle w , x \rangle \ > \ sup_{x\in C_2}\langle w , x \rangle$

了解了这些之后，我们就能开始着手证明Definition1。

Proof Of Definition 1：

假设有两个集合$S_1$和$S_2$，$S_1 \cup S_2\in {x_0,\cdots x_N}$, $S_1 \cap S_2 = \emptyset$，基本思路是证明这两个集合的凸包满足凸集分离定理的使用条件，然后就可以使用凸集分离定理。

取$C_1\doteq conv(S_1)$, $C_2\doteq conv(S_2)$

证明$C_1$, $C_2$是convex set

凸包一定都是凸集，所以这个本质上等同于证明凸包为什么是凸集，凸包是指多个多个凸集的交集，那么这个证明就转化为了证明多个凸集的交集还是凸集。

Proof:
假设有n个凸集${C_1,\cdots,C_n}$，对于任意两个点$x_1$, $x_2$，若它们在这n个凸集的交集中，则对于任意一个凸集$C_i$,都满足$x_1,x_2\in C_i$，则对于某点$x_3$，$x_3=\lambda_ix_1+(1-\lambda_i)x_2\in C_i, \forall \ \lambda_i\in(0,1), \forall i$，对任意的$i$和$x_1$, $x_2$都满足$x_3\in C_i$，那么$x_3$自然在 ${C_1,\cdots,C_n}$的交集中，那么能证明${C_1,\cdots,C_n}$的交集是凸集。

证明$cl \ C_1 \cap cl \ C_2 = \emptyset$

其思路就是证明$C_1 \cap C_2 = \emptyset$且 $cl \ C_1=C_1$

Proof:

$cl \ C_1=C_1$ ？
假设$\exists y \in C_1 \cap C_2$, 则可得出

$y=\sum_{i\in S_1}\alpha_ix_i=\sum_{j\in S_2}\beta_jx_j$, 且$\sum_i\alpha_i=1, \sum_j\beta_j=1$,联立方程得：

\[\left\{ \begin{aligned} \sum_{i\in S_1}\alpha_ix_i - \sum_{j\in S_2}\beta_jx_j & = 0 \\ \sum_i\alpha_i - \sum_j\beta_j & = 0 \\ \end{aligned} \right.\]

此方程可简化写为：

\[\left\{ \begin{aligned} \sum_{i\in S_1\cup S_2}a_ix_i & = 0 \\ \sum_{i\in S_1\cup S_2}a_i & = 0 \\ \end{aligned} \right.\]

因此，对任意一组${a_i}$,都要能有一组${x_i}$解，使得以上等式成立，接下来我们将以上等式改写以下形式：

\[\begin{bmatrix} x_0 & \cdots & x_N \\ 1 & \cdots & 1 \end{bmatrix} \cdot \begin{bmatrix} a_0\\ a_1 \\ \cdots \\ a_N\end{bmatrix}=0\]

可以发现，$\begin{vmatrix} x_0 & \cdots & x_N \ 1 & \cdots & 1 \end{vmatrix}=\begin{vmatrix} x_0 & x_1-x_0 & \cdots & x_N-x_0 \ 1 & 0 & \cdots & 0 \end{vmatrix}$，而存在条件${ x_0,…,x_N }, x_i \in R^{N+1}$近似独立，因此$\begin{bmatrix} x_1-x_0 & \cdots & x_N-x_0 \end{bmatrix}$的行列式不为零，可逆。

因此当且仅当$\begin{bmatrix} a_0 & a_1 & \cdots & a_N\end{bmatrix}=0$时原式满足，所以不存在这样的点$y$，与最初假设矛盾。

则$C_1 \cap C_2=\emptyset$证毕

综上两点，可证明$cl \ C_1 \cap cl \ C_2 = \emptyset$

因此对于任意的两个set $S_1$和$S_2$，$S_1 \cup S_2\in {x_0,\cdots x_N}$, $S_1 \cap S_2 = \emptyset$，$C_1 = conv(S_1)$和$C_2 = conv(S_2)$都满足使用凸集分离定理的条件，因此对任意$S_1,S_2$都有：$\exists w, \ s.t. \ inf_{x\in C_1}\langle w , x \rangle \ > \ sup_{x\in C_2}\langle w , x \rangle$，因此${ x_0,…,x_N }$在n维空间线性可分。

（同理，我们也可以证明出：n+2个点在n维中线性不可分）

我理解来说，这个理论就是使用kernel这种方法的理论支撑。

Kernel & Feature Map

这个地方仅提供我自己的理解，例如对于高斯核，$G(x_i, x_j)=\exp(-||x_i-x_j||^2)$，我们平时会直接使用这样的变换，但却不知道这样变换的原因所在，而其实从$(x_i,x_j)$到$G(x_i, x_j)$之间还有这样一个流程：

\[(x_i,x_j) \rightarrow K(x_i,x_j) = \langle \phi(x_i), \phi(x_j) \rangle_H\rightarrow G(x_i, x_j)\\\]

因为$(x_i,x_j)$可以被一个映射函数$\phi(\cdot)$映射到一个高维空间，在这个希尔伯特空间有一种内积的操作，内积可以表示$\phi(x_i),\phi(x_j)$两者的相似性，且这个值刚好等于低维空间下$G(x_i, x_j)$的值，因此，这样去做一个kernel操作是合理的。其中，这一整套过程被称为kernel method，而映射的这个过程被称作Feature Map。（因为我之前一直有一个误区，以为kernel指的就是从低维空间映射到高维空间的这么一个操作，其实不然，这个过程只是kernel method的一部分而已）

$\bf{Mercer’s \ theorem :}$ Mercer’s theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions.

对比老师在上课时给出的定义为：

If $G = G^T$ and $G \geq 0$, $\exists \phi(\cdot), \ s.t. \ G_{ij} = K(x_i,x_j) = \langle \phi(x_i) , \phi(x_j) \rangle_H$

其实老师上课讲到的这个定义就是mercer定理了，即只要G满足这两个条件，那么它就能找到这样的映射函数以及一种内积方式，那么就是合理的核了。

Reference：

https://www.zhihu.com/question/24627666
https://mp.weixin.qq.com/s?__biz=MzIxNDIwMTk2OQ==&mid=2649077019&idx=1&sn=e0c4a6c502e3668e1dc410f21e531cfd&scene=0#wechat_redirect
https://blog.csdn.net/wsj998689aa/article/details/47027365
https://blog.csdn.net/wsj998689aa/article/details/40398777
https://www.cnblogs.com/xingshansi/p/6767980.html
https://blog.csdn.net/zhazhiqiang/article/details/19496633
https://blog.csdn.net/cqy_chen/article/details/77932270

Big Data Algorithm Lesson 1: About Kernel

Why do we use kernel?

Why Kernel

Kernel & Feature Map

Reference：

CATALOG

FEATURED TAGS