Large deviations: Cramér vs Sanov
I have recently been reading up on two classical results from large deviation theory: the CramérChernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.
It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the CramérChernoff theorem seems to introduce the fewest measuretheoretic complications. Let me explain…
The CramérChernoff Theorem
Let \(X, X_1, ..., X_n\) be independent, identically distributed random variables, and \(\DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator*{\argmin}{arg\,min}\) let \(\mu(\lambda) = \log \E[e^{\lambda X}]\) be their cumulant generating function. Then many standard concentration inequalities (Hoeffding, Bernstein, Bennett) can be derived [0, Appendix A.1] from a single basic result:
\[\Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) \leq e^{n\{\lambda a  \mu(\lambda)\}}\qquad(\lambda > 0).\]This result can easily be proved using Markov’s inequality and is nontrivial as long as \(a > \E[X]\). Its ubiquity is explained by the CramérChernoff theorem [1], which states that it asymptotically achieves the best possible exponent if we optimize our choice of \(\lambda\):
\[\lim_{n \to \infty} \frac{1}{n}\log \Pr\Big(\frac{1}{n}\sum_{i=1}^n X_i \geq a\Big) = \sup_{\lambda > 0}\ \{\lambda a  \mu(\lambda)\}.\]Sanov’s Theorem
The empirical distribution \(P_n\) of \(X_1, \ldots, X_n\) gives probability \(P_n(A) = \frac{1}{n}\sum_{i=1}^n {\bf 1}_{\{X_i \in A\}}\) to any event \(A\), which is the fraction of variables taking their value in \(A\). If the distribution of the variables is \(P^*\), then asymptotically we would expect \(P_n\) to be close to \(P^*\). So then if \(\mathcal{P}\) is a set of distributions that is far away from \(P^*\) in some sense, the probability that \(P_n \in \mathcal{P}\) should be small. This intuition is quantified by Sanov’s theorem [2, 3].
To avoid measuretheoretic complications, let us assume that the variables \(X_1, \ldots, X_n\) are discrete, with a finite number of possible values. Suppose \(\mathcal{P}\) is a convex set of distributions and assume that the information projection
\[Q^* = \argmin_{P \in \mathcal{P}}\ D(P\P^*)\]of \(P^*\) on \(\mathcal{P}\) exists, where \(D(P\P^*) = \sum_x P(x) \log \frac{P(x)}{P^*(x)}\) is the KullbackLeibler divergence. Then \(D(Q^*\P^*)\) provides an informationtheoretic measure of the “distance” of \(\mathcal{P}\) from \(P^*\), and indeed [2]
\[\Pr\Big(P_n \in \mathcal{P}\Big) \leq e^{nD(Q^*\P^*)}.\]Moreover, Sanov’s theorem tells us that this bound is asymptotically tight in the exponent:
\[\lim_{n \to \infty} \frac{1}{n}\log \Pr\Big(P_n \in \mathcal{P}\Big) = D(Q^*\P^*).\]Csiszár [2, p.790] has an elegant information theoretic proof of the upper bound. He also works out sufficient measuretheoretic conditions for the theorem to hold in continuous spaces, which are quite clean, but still require considerable care to verify.
One may extend Sanov’s theorem to nonconvex sets \(\mathcal{P}\) by a union bound argument. For example, Cover and Thomas [3] take a union bound over all possible values for \(P_n\), which they call types. By discretization arguments one may further extend the theorem to infinite spaces [4], but then things get a bit too asymptotic for my taste.
From Sanov to CramérChernoff
It turns out that the CramérChernoff theorem can be obtained from Sanov’s theorem. (This is called contraction.) The trick is to observe that
\[\E_{X \sim P_n}[X] = \frac{1}{n} \sum_{i=1}^n X_i,\]so if we define the convex set \(\mathcal{P} = \{P \mid \E_{X \sim P}[X] \geq a\}\), then
\[\Pr\Big(P_n \in \mathcal{P}\Big) = \Pr\Big(\frac{1}{n} \sum_{i=1}^n X_i \geq a\Big).\]It remains to evaluate \(D(Q^* \ P^*)\) in this case, which can be done as follows:

Introduce a Lagrange multiplier to obtain:
\[\min_{P \in \mathcal{P}} D(P\P^*) = \min_{\{P \mid \E_{X \sim P}[X] \geq a\}} D(P\P^*) = \min_P \sup_{\lambda > 0} \Big\{D(P\P^*)  \lambda (\E_P[X]  a)\Big\}.\] 
Use Sion’s minimax theorem to swap the order of the min and the sup:
\[\min_P \sup_{\lambda > 0} \Big\{D(P\P^*)  \lambda (\E_P[X]  a)\Big\} = \sup_{\lambda > 0}\inf_P \Big\{D(P\P^*)  \lambda (\E_P[X]  a)\Big\}.\] 
Recognize the following convex duality for KullbackLeibler divergence:
\[\sup_P\ \big(\E_{X \sim P}[\lambda X]  D(PP^*\big) = \mu(\lambda),\]
where \(\mu(\lambda) = \log \E_{P^*}[e^{\lambda X}]\) is the cumulant generating function defined above. We get:
\[\begin{split}\sup_{\lambda > 0}\inf_P \Big\{D(P\P^*)  \lambda (\E_P[X]  a)\Big\} &= \sup_{\lambda > 0}\Big\{\lambda a \sup_P\Big(\E_P[\lambda X] D(P\P^*)\Big)\Big\}\\&= \sup_{\lambda > 0}\ \{\lambda a  \mu(\lambda)\}.\end{split}\]Chaining everything together we exactly recover the CramérChernoff theorem, and we see that the upper bounds have exactly the same constants.
From CramérChernoff to Sanov
One may also view things the other way around. The CramérChernoff theorem bounds the probability that the value of the empirical mean \(\frac{1}{n} \sum_{i=1}^n X_i\) lies in the set \(A = [a,\infty)\). As discussed by Van der Vaart [1, pp. 208210], both the notion of empirical mean and the set \(A\) can be generalized. In particular, one may regard the empirical distribution \(P_n\) as the mean of \(n\) pointmasses (i.e. Dirac measures) on the points \(X_1, \ldots, X_n\). Van der Vaart then presents Sanov’s theorem as just one instance of such generalized CramérChernoff theorems.
Conclusion
We have seen the close similarities between the CramérChernoff theorem and Sanov’s theorem. For me Sanov’s theorem seems easier to interpret, but in continuous spaces one has to deal with the more complicated measuretheoretic conditions of Csiszár [2]. For technical reasons it may therefore be preferable to use the CramérChernoff result.
Last Words
It turns out that the upper bound in the CramérChernoff theorem does leave some slack in the order of \(1/\sqrt{n}\), which is negligible compared to the term in the exponent.
References
 N. CesaBianchi, G. Lugosi, Prediction, Learning and Games, Cambridge University Press, 2006
 A.W. van der Vaart, Asymptotic Statistics, Cambridge University Press, 2000
 I. Csiszár, Sanov property, generalized Iprojection and a conditional limit theorem, The Annals of Probability, 1984, Vol. 12, No. 3, pp. 768793
 T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, 1991
 F. den Hollander, Large Deviations, Vol. 14 of Fields Institute Monographs, 2000