Uniform convergence in probability is a concept in probability theory with applications to statistical learning theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities.
Contents
- Definitions
- Uniform convergence theorem statement
- SauerShelah lemma
- Proof of uniform convergence theorem
- Symmetrization
- Permutations
- Reduction to a finite class
- References
The law of large numbers says that, for each single event, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. But in some applications, we are interested not in a single event but in a whole family of events. We would like to know whether the empirical frequency of every event in the family converges to its theoretical probability simultaneously. The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.
Definitions
For a class of predicates
The theoretical probability of
The Uniform Convergence Theorem states, roughly, that if
Here "simple" means that the Vapnik–Chervonenkis dimension of the class
The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis using the concept of growth function.
Uniform convergence theorem statement
If
where, for any
and
And for any natural number
From the point of Learning Theory one can consider
Sauer–Shelah lemma
The Sauer–Shelah lemma relates the shattering number
Lemma:
Corollary:
Proof of uniform convergence theorem
and are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing
| Q P ( h ) − Q ^ x ( h ) | ≥ ε into the problem of analyzing | Q ^ r ( h ) − Q ^ s ( h ) | ≥ ε / 2 , where r and s are i.i.d samples of size m drawn according to the distribution P . One can view r as the original randomly drawn sample of length m , while s may be thought as the testing sample which is used to estimate Q P ( h ) . - Permutation: Since
r and s are picked identically and independently, so swapping elements between them will not change the probability distribution on r and s . So, we will try to bound the probability of | Q ^ r ( h ) − Q ^ s ( h ) | ≥ ε / 2 for some h ∈ H by considering the effect of a specific collection of permutations of the joint sample x = r | | s . Specifically, we consider permutations σ ( x ) which swap x i and x m + i in some subset of 1 , 2 , . . . , m . The symbol r | | s means the concatenation of r and s . - Reduction to a finite class: We can now restrict the function class
H to a fixed joint sample and hence, if H has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let
Then for
Proof: By the triangle inequality,
if
Therefore,
since
Now for
Thus for any
Notice,
for the mentioned bound on
Permutations
Let
Lemma: Let
where the expectation is over
Proof: For any
(since coordinate permutations preserve the product distribution
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class
Lemma: Basing on the previous lemma,
Proof: Let us define
We see that
For
Since, the distribution over the permutations
Thus,
where the probability on the right is over
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.