appendix UML
比较难读,参考知乎上的笔记 UML Reading Notes。
A Gentle Start¶
- Domian Set: \(\mathcal{X}\)
- Label Set: \(\mathcal{Y}\)
- Data Set: \(S = \{(x_{1}, y_{1}), \dots, (x_m, y_m)\}~(\forall~i~\in [m], x_{i}, y_{i} \in \mathcal{X\times Y})\)
- Hypothesis: \(h:\mathcal{X \rightarrow Y} \in \mathcal{H}\)
- Ground truth: \(f:\mathcal{X\rightarrow Y}~(\forall~i~\in~[m],f(x_i) = y_{i})\)
- Subset: \(A \subset \mathcal{X}\)
- Data Distribution: D
- \(D(A)\) 表示对于 \(x \in \mathcal{X}\),\(x \in A\) 的概率
- Error rate: \(L_{\mathcal{D},f}(h)\stackrel{\mathrm{def}}{=}\mathbb{P}_{x\sim\mathcal{D}}[h(x)\neq f(x)]\stackrel{\mathrm{def}}{=}\mathcal{D}(\{x:h(x)\neq f(x)\})\)
- 即 h 在分布 D 上的错误率
-
Empirical Risk: \(L_s(h) = \frac{|\{i \in [m]:h(x_i)\neq y_{i}\}|}{m}\)
- 即 h 在数据集 S 上的错误率
-
Theorem 1.1 (The Realizability Assumption)
- 认为存在 \(h^* \in \mathcal{H}\) 使得 \(L_{D, f}(h^*) = 0 = L_S(h^*)\)
- 即假设在待学习的模型空间中存在一个完美的模型能够拟合 f
- Theorem 1.2
- 对于有限集 \(\mathcal{H}\),如果在 D 上独立同分布的数据集 S 的长度满足 \(m\geq \frac{\log(|\mathcal{H}| / \delta)}{\epsilon}\)
- 那么有 \(1-\delta\) 的概率得到 \(L_{D, f}(h_S) \leq \epsilon\)
- 即在足够大的数据集上学习到的模型有较大概率能够有较好的表现。
PAC learning¶
- Definition 2.1 (PAC Learnability)
- 存在函数 \(m_{\mathcal{H}}: (0, 1)^2 \rightarrow \mathbb{N}\),当 \(|S|=m \geq m_\mathcal{H}(\epsilon, \delta)\) 且有 \(1-\delta\) 的概率得到 h 满足 \(L_{D, f}(h) \leq \epsilon\)
- 则称这个 \(\mathcal{H}\) 是 PAC learnable
- Corollary 2.2
- 每一个有限的 \(\mathcal{H}\) 都 PAC learnable
- \(m_\mathcal{H}(\epsilon, \delta) \leq \left\lceil \frac{\log(|\mathcal{H| / \delta})}{\epsilon} \right\rceil\)
- Definition 2.3 (Agnostic PAC Learnability)
- 如果没有 Theorem 1.1 条件,但
- 存在函数 \(m_{\mathcal{H}}: (0, 1)^2 \rightarrow \mathbb{N}\),当 \(|S|=m \geq m_\mathcal{H}(\epsilon, \delta)\) 且有 \(1-\delta\) 的概率得到 h 满足 \(L_{\mathcal{D}}(h)\leq\min_{h'\in\mathcal{H}}L_{\mathcal{D}}(h')+\epsilon\)
- 则称这个 \(\mathcal{H}\) 是 Agnostic PAC learnable
- 更加一般化的定义
Uniform Convergence¶
- Definition 3.1 (\(\epsilon-Representative\))
- 训练集 S 是 \(\epsilon-Representative\) 当且仅当:
- \(\forall~h~\in~\mathcal{H}, |L_S(h)-L_D(h)| \leq \epsilon\)
- Definition 3.3 (Uniform Convergence)
- \(\mathcal{H}\) 具有一致收敛性当且仅当:
- 存在一个函数 \(m_\mathcal{H}^{UC}: (0, 1)^2 \rightarrow \mathbb{N}\),对于任意 \(\epsilon, \delta, D\),只要 \(|S| = m \geq m_\mathcal{H}^{UC}(\epsilon, \delta)\)
- 那么训练集 S 有 \(1-\delta\) 概率为 \(\epsilon-Representative\)
- 可以推导出:有限假设类 \(\mathcal{H}\) 在 Agnostic PAC 模型下是可学习的
Bias-Complexity tradeoff¶
- Theorem 5.1 (No-Free-Lunch)
- \(\forall m < \frac{|\mathcal{X}|}{2}, \exists ~f \text{ with } L_{D}(f) = 0\) while \(S \sim D^m\): \(L_{D}(A(S))\geq \frac{1}{8}\) with probability of at least \(\frac{1}{7}\)
- 即如果训练数据集太小了,总能找到一个分布使得:有能够很好地学习这个分布的假设,但较大概率会得到较差的结果
- Corollary 5.2
- Let \(\mathcal{X}\) be an infinite domain set and \(\mathcal{H}\) be the set of all \(f:\mathcal{X} \rightarrow \{0, 1\}\). Then \(\mathcal{H}\) is not PAC learnable.
- 也就是说如果不对假设空间 \(\mathcal{H}\) 引入先验知识(以限制空间大小
) ,那么 \(\mathcal{H}\) 是 PAC 不可学习的
- Error decomposition
- \(L_{D}(h_{S}) = \epsilon_{app} + \epsilon_{est}\) where: \(\epsilon_{app} = \min_{h \in\mathcal{H}}L_{D}(h), \epsilon_{est}=L_{D}(h_{S}) - \epsilon_{app}\)
- 近似误差 (Approximation Error, \(\epsilon_{app}\))
- 代表了在 \(\mathcal{H}\) 中能够达到的最小的风险,显然提高 \(\mathcal{H}\) 的复杂度有利于降低 \(\epsilon_{app}\)
- 估计误差 (Estimation Error, \(\epsilon_{est}\))
- 代表了在 \(S\) 上训练得到的假设 \(h_{S}\) 与最优假设之间的风险差距,这源于数据集与真实世界之间信息量的差距
- 显然增大样本量 \(m:=|S|\) (相对而言,就是降低 \(H\) 的复杂度)有利于降低 \(\epsilon_{est}\)
- 这引出了欠拟合与过拟合的问题,需要在偏差和复杂度之间做出权衡
VC-Dimension¶
前面提到
: “每一个有限的 \(\mathcal{H}\) 都 PAC 可学习的”;那无限的 \(\mathcal{H}\) 呢?
-
DEFINITION 6.2 (Restriction of \(\mathcal{H}\) to \(C\))
-
Let \(\mathcal{H}\) be a class of functions from \(\mathcal{X}\) to \(\{0, 1\}\) and let \(C = \{c_1, \dots, c_m\} \subset \mathcal{X}\). The restriction of \(\mathcal{H}\) to \(C\) is the set of functions from \(C\) to \(\{0, 1\}\) that can be derived from \(\mathcal{H}\). That is,
\[ \mathcal{H}_C = \{(h(c_1), \dots, h(c_m)) : h \in \mathcal{H}\}, \] -
where we represent each function from \(C\) to \(\{0, 1\}\) as a vector in \(\{0, 1\}^{|C|}\).
-
If the restriction of \(\mathcal{H}\) to \(C\) is the set of all functions from \(C\) to \(\{0, 1\}\), then we say that \(\mathcal{H}\) shatters the set \(C\). Formally:
- DEFINITION 6.3 (Shattering)
- A hypothesis class \(\mathcal{H}\) shatters a finite set \(C \subset \mathcal{X}\) if the restriction of \(\mathcal{H}\) to \(C\) is the set of all functions from \(C\) to \(\{0, 1\}\). That is, \(|\mathcal{H}_C| = 2^{|C|}\).
- DEFINITION 6.5 (VC-dimension)
- The VC-dimension of a hypothesis class \(\mathcal{H}\), denoted \(\text{VCdim}(\mathcal{H})\), is the maximal size of a set \(C \subset \mathcal{X}\) that can be shattered by \(\mathcal{H}\). If \(\mathcal{H}\) can shatter sets of arbitrarily large size we say that \(\mathcal{H}\) has infinite VC-dimension.
- THEOREM 6.6
- Let \(\mathcal{H}\) be a class of infinite VC-dimension. Then, \(\mathcal{H}\) is not PAC learnable.
- 结论:下面的描述等价
- \(\mathcal{H}\) 是 PAC 可学习的
- \(\mathcal{H}\) 的 VC 维有限
- \(\mathcal{H}\) 具有一致收敛属性
- 任何 ERM(经验风险最小化)规则都是成功的学习器