跳转至

appendix UML

比较难读,参考知乎上的笔记 UML Reading Notes

A Gentle Start

  • Domian Set: \(\mathcal{X}\)
  • Label Set: \(\mathcal{Y}\)
  • Data Set: \(S = \{(x_{1}, y_{1}), \dots, (x_m, y_m)\}~(\forall~i~\in [m], x_{i}, y_{i} \in \mathcal{X\times Y})\)
  • Hypothesis: \(h:\mathcal{X \rightarrow Y} \in \mathcal{H}\)
  • Ground truth: \(f:\mathcal{X\rightarrow Y}~(\forall~i~\in~[m],f(x_i) = y_{i})\)
  • Subset: \(A \subset \mathcal{X}\)
  • Data Distribution: D
    • \(D(A)\) 表示对于 \(x \in \mathcal{X}\)\(x \in A\) 的概率
  • Error rate: \(L_{\mathcal{D},f}(h)\stackrel{\mathrm{def}}{=}\mathbb{P}_{x\sim\mathcal{D}}[h(x)\neq f(x)]\stackrel{\mathrm{def}}{=}\mathcal{D}(\{x:h(x)\neq f(x)\})\)
    • h 在分布 D 上的错误率
  • Empirical Risk: \(L_s(h) = \frac{|\{i \in [m]:h(x_i)\neq y_{i}\}|}{m}\)

    • h 在数据集 S 上的错误率
  • Theorem 1.1 (The Realizability Assumption)

    • 认为存在 \(h^* \in \mathcal{H}\) 使得 \(L_{D, f}(h^*) = 0 = L_S(h^*)\)
    • 即假设在待学习的模型空间中存在一个完美的模型能够拟合 f
  • Theorem 1.2
    • 对于有限集 \(\mathcal{H}\),如果在 D 上独立同分布的数据集 S 的长度满足 \(m\geq \frac{\log(|\mathcal{H}| / \delta)}{\epsilon}\)
    • 那么有 \(1-\delta\) 的概率得到 \(L_{D, f}(h_S) \leq \epsilon\)
    • 即在足够大的数据集上学习到的模型有较大概率能够有较好的表现。

PAC learning

  • Definition 2.1 (PAC Learnability)
    • 存在函数 \(m_{\mathcal{H}}: (0, 1)^2 \rightarrow \mathbb{N}\),当 \(|S|=m \geq m_\mathcal{H}(\epsilon, \delta)\) 且有 \(1-\delta\) 的概率得到 h 满足 \(L_{D, f}(h) \leq \epsilon\)
    • 则称这个 \(\mathcal{H}\) PAC learnable
  • Corollary 2.2
    • 每一个有限的 \(\mathcal{H}\) PAC learnable
    • \(m_\mathcal{H}(\epsilon, \delta) \leq \left\lceil \frac{\log(|\mathcal{H| / \delta})}{\epsilon} \right\rceil\)
  • Definition 2.3 (Agnostic PAC Learnability)
    • 如果没有 Theorem 1.1 条件,但
    • 存在函数 \(m_{\mathcal{H}}: (0, 1)^2 \rightarrow \mathbb{N}\),当 \(|S|=m \geq m_\mathcal{H}(\epsilon, \delta)\) 且有 \(1-\delta\) 的概率得到 h 满足 \(L_{\mathcal{D}}(h)\leq\min_{h'\in\mathcal{H}}L_{\mathcal{D}}(h')+\epsilon\)
    • 则称这个 \(\mathcal{H}\) Agnostic PAC learnable
    • 更加一般化的定义

Uniform Convergence

  • Definition 3.1 (\(\epsilon-Representative\))
    • 训练集 S \(\epsilon-Representative\) 当且仅当:
    • \(\forall~h~\in~\mathcal{H}, |L_S(h)-L_D(h)| \leq \epsilon\)
  • Definition 3.3 (Uniform Convergence)
    • \(\mathcal{H}\) 具有一致收敛性当且仅当:
    • 存在一个函数 \(m_\mathcal{H}^{UC}: (0, 1)^2 \rightarrow \mathbb{N}\),对于任意 \(\epsilon, \delta, D\),只要 \(|S| = m \geq m_\mathcal{H}^{UC}(\epsilon, \delta)\)
    • 那么训练集 S \(1-\delta\) 概率为 \(\epsilon-Representative\)
  • 可以推导出:有限假设类 \(\mathcal{H}\) Agnostic PAC 模型下是可学习的

Bias-Complexity tradeoff

  • Theorem 5.1 (No-Free-Lunch)
    • \(\forall m < \frac{|\mathcal{X}|}{2}, \exists ~f \text{ with } L_{D}(f) = 0\) while \(S \sim D^m\): \(L_{D}(A(S))\geq \frac{1}{8}\) with probability of at least \(\frac{1}{7}\)
    • 即如果训练数据集太小了,总能找到一个分布使得:有能够很好地学习这个分布的假设,但较大概率会得到较差的结果
  • Corollary 5.2
    • Let \(\mathcal{X}\) be an infinite domain set and \(\mathcal{H}\) be the set of all \(f:\mathcal{X} \rightarrow \{0, 1\}\). Then \(\mathcal{H}\) is not PAC learnable.
    • 也就是说如果不对假设空间 \(\mathcal{H}\) 引入先验知识(以限制空间大小,那么 \(\mathcal{H}\) PAC 不可学习的
  • Error decomposition
    • \(L_{D}(h_{S}) = \epsilon_{app} + \epsilon_{est}\) where: \(\epsilon_{app} = \min_{h \in\mathcal{H}}L_{D}(h), \epsilon_{est}=L_{D}(h_{S}) - \epsilon_{app}\)
    • 近似误差 (Approximation Error, \(\epsilon_{app}\))
      • 代表了在 \(\mathcal{H}\) 中能够达到的最小的风险,显然提高 \(\mathcal{H}\) 的复杂度有利于降低 \(\epsilon_{app}\)
    • 估计误差 (Estimation Error, \(\epsilon_{est}\))
      • 代表了在 \(S\) 上训练得到的假设 \(h_{S}\) 与最优假设之间的风险差距,这源于数据集与真实世界之间信息量的差距
      • 显然增大样本量 \(m:=|S|\) (相对而言,就是降低 \(H\) 的复杂度)有利于降低 \(\epsilon_{est}\)
  • 这引出了欠拟合与过拟合的问题,需要在偏差和复杂度之间做出权衡

VC-Dimension

前面提到“每一个有限的 \(\mathcal{H}\) PAC 可学习的”;那无限的 \(\mathcal{H}\) 呢?

  • DEFINITION 6.2 (Restriction of \(\mathcal{H}\) to \(C\))

    • Let \(\mathcal{H}\) be a class of functions from \(\mathcal{X}\) to \(\{0, 1\}\) and let \(C = \{c_1, \dots, c_m\} \subset \mathcal{X}\). The restriction of \(\mathcal{H}\) to \(C\) is the set of functions from \(C\) to \(\{0, 1\}\) that can be derived from \(\mathcal{H}\). That is,

      \[ \mathcal{H}_C = \{(h(c_1), \dots, h(c_m)) : h \in \mathcal{H}\}, \]
    • where we represent each function from \(C\) to \(\{0, 1\}\) as a vector in \(\{0, 1\}^{|C|}\).

If the restriction of \(\mathcal{H}\) to \(C\) is the set of all functions from \(C\) to \(\{0, 1\}\), then we say that \(\mathcal{H}\) shatters the set \(C\). Formally:

  • DEFINITION 6.3 (Shattering)
    • A hypothesis class \(\mathcal{H}\) shatters a finite set \(C \subset \mathcal{X}\) if the restriction of \(\mathcal{H}\) to \(C\) is the set of all functions from \(C\) to \(\{0, 1\}\). That is, \(|\mathcal{H}_C| = 2^{|C|}\).
  • DEFINITION 6.5 (VC-dimension)
    • The VC-dimension of a hypothesis class \(\mathcal{H}\), denoted \(\text{VCdim}(\mathcal{H})\), is the maximal size of a set \(C \subset \mathcal{X}\) that can be shattered by \(\mathcal{H}\). If \(\mathcal{H}\) can shatter sets of arbitrarily large size we say that \(\mathcal{H}\) has infinite VC-dimension.
  • THEOREM 6.6
    • Let \(\mathcal{H}\) be a class of infinite VC-dimension. Then, \(\mathcal{H}\) is not PAC learnable.
  • 结论:下面的描述等价
    • \(\mathcal{H}\) PAC 可学习的
    • \(\mathcal{H}\) VC 维有限
    • \(\mathcal{H}\) 具有一致收敛属性
    • 任何 ERM(经验风险最小化)规则都是成功的学习器

评论