## April 20, 2012

### Just How Quickly Do We Forget?

Attention conservation notice: 2500+ words on estimating how quickly time series forget their own history. Only of interest if you care about the intersection of stochastic processes and statistical learning theory. Full of jargon, equations, log-rolling and self-promotion, yet utterly abstract.

I promised to say something about the content of Daniel's thesis, so let me talk about two of his papers, which go into chapter 4; there is a short conference version and a long journal version.

Daniel J. McDonald, Cosma Rohilla Shalizi and Mark Schervish, "Estimating beta-mixing coefficients", AIStats 2011, arxiv:1103.0941
Abstract: The literature on statistical learning for time series assumes the asymptotic independence or "mixing" of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the $\beta$-mixing rate based on a single stationary sample path and show it is $L_1$-risk consistent.
----, "Estimating beta-mixing coefficients via histograms", arxiv:1109.5998
Abstract: The literature on statistical learning for time series often assumes asymptotic independence or "mixing" of data sources. Beta-mixing has long been important in establishing the central limit theorem and invariance principle for stochastic processes; recent work has identified it as crucial to extending results from empirical processes and statistical learning theory to dependent data, with quantitative risk bounds involving the actual beta coefficients. There is, however, presently no way to actually estimate those coefficients from data; while general functional forms are known for some common classes of processes (Markov processes, ARMA models, etc.), specific coefficients are generally beyond calculation. We present an $L_1$-risk consistent estimator for the beta-mixing coefficients, based on a single stationary sample path. Since mixing coefficients involve infinite-order dependence, we use an order-d Markov approximation. We prove high-probability concentration results for the Markov approximation and show that as $d \rightarrow \infty$, the Markov approximation converges to the true mixing coefficient. Our estimator is constructed using d dimensional histogram density estimates. Allowing asymptotics in the bandwidth as well as the dimension, we prove $L_1$ concentration for the histogram as an intermediate step.

Recall the world's simplest ergodic theorem: if $X_t$ is a sequence of random variables with common expectation $m$ and variance $v$, and stationary covariance $\mathrm{Cov}[X_t, X_{t+h}] = c_h$. Then the time average $\overline{X}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{X_i}$ also has expectation $m$, and the question is whether it converges on that expectation. The world's simplest ergodic theorem asserts that if the correlation time $T = \frac{\sum_{h=1}^{\infty}{|c_h|}}{v} < \infty$ then $\mathrm{Var}\left[ \overline{X}_n \right] \leq \frac{v}{n}(1+2T)$

Since, as I said, the expectation of $\overline{X}_n$ is $m$ and its variance is going to zero, we say that $\overline{X}_n \rightarrow m$ "in mean square".

From this, we can get a crude but often effective deviation inequality, using Chebyshev's inequality: $\Pr{\left(|\overline{X}_n - m| > \epsilon\right)} \leq \frac{v}{\epsilon^2}\frac{1+2T}{n}$

The meaning of the condition that the correlation time $T$ be finite is that the correlations themselves have to trail off as we consider events which are widely separated in time — they don't ever have to be zero, but they do need to get smaller and smaller as the separation $h$ grows. (One can actually weaken the requirement on the covariance function to just $\lim_{n\rightarrow \infty}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0$, but this would take us too far afield.) In fact, as these formulas show, the convergence looks just like what we'd see for independent data, only with $\frac{n}{1+2T}$ samples instead of $n$, so we call the former the effective sample size.

All of this is about the convergence of averages of $X_t$, and based on its covariance function $c_h$. What if we care not about $X$ but about $f(X)$? The same idea would apply, but unless $f$ is linear, we can't easily get its covariance function from $c_h$. The mathematicians' solution to this has been to invent stronger notions of decay-of-correlations, called "mixing". Very roughly speaking, we say that $X$ is mixing when, if you pick any two (nice) functions $f$ and $g$, I can always show that $\lim_{h\rightarrow\infty}{\mathrm{Cov}\left[ f(X_t), g(X_{t+h}) \right]} = 0$

Note (or believe) that this is "convergence in distribution"; it happens if, and only if, the distribution of events up to time $t$ is becoming independent of the distribution of events from time $t+h$ onwards.

To get useful results, it is necessary to quantify mixing, which is usually done through somewhat stronger notions of dependence. (Unfortunately, none of these have meaningful names. The review by Bradley ought to be the standard reference.) For instance, the "total variation" or $L_1$ distance between probability measures $P$ and $Q$, with densities $p$ and $q$ is, $d_{TV}(P,Q) = \frac{1}{2}\int{\left|p(u) - q(u)\right| du}$ This has several interpretations, but the easiest to grasp is that it says how much $P$ and $Q$ can differ in the probability they give to any one event: for any $E$, $d_{TV}(P,Q) \geq |P(E) - Q(E)|$. One use of this distance is to measure how the dependence between random variables, by seeing far their joint distribution is from the product of their marginal distributions. Abusing notation a little to write $P(U,V)$ for the joint distribution of $U$ and $V$, we measure dependence as $\beta(U,V) \equiv d_{TV}(P(U,V), P(U) \otimes P(V)) = \frac{1}{2}\int{\left|p(u,v)-p(u)p(v)\right|du dv}$ This will be zero just when $U$ and $V$ are statistically independent, and one when, on average, conditioning on $U$ confines $V$ to a set which would otherwise have probability zero. (For instance if $U$ has a continuous distribution and $V$ is a function of $U$ — or one of two randomly chosen functions of $U$.)

We can relate this back to the earlier idea of correlations between functions by realizing that $\beta(U,V) = \sup_{|r|\leq 1}{\left|\int{r(u,v) dP(U,V)} - \int{r(u,v)dP(U)dP(V)}\right|} ~,$ that $\beta$ says how much the expected value of a bounded function $r$ could change between the dependent and the independent distributions. (There is no assumption that the test function $r$ factorizes, and in fact it's important to allow $r(u,v) \neq f(u)g(v)$.)

We apply these ideas to time series by looking at the dependence between the past and the future: $\begin{eqnarray*} \beta(h) & \equiv & d_{TV}(P(X^t_{-\infty}, X_{t+h}^{\infty}), P(X^t_{-\infty}) \otimes P(X_{t+h}^{\infty})) \\ & = & \frac{1}{2}\int{\left|p(x^t_{-\infty},x_{t+h}^{\infty})-p(x^t_{-\infty})p(x^{\infty}_{t+h})\right|dx^t_{-\infty}dx^{\infty}_{t+h}} \end{eqnarray*}$ (By stationarity, the integral actually does not depend on $t$.) When $\beta(h) \rightarrow 0$ as $h \rightarrow \infty$, we have a "beta-mixing" process. (These are also called "absolutely regular".) Convergence in total variation implies convergence in distribution, but not vice versa, so beta-mixing is stronger than common-or-garden mixing.

Notions like beta-mixing were originally introduced purely for probabilistic convenience, to handle questions like "when does the central limit theorem hold for stochastic processes?" These are interesting for people who like stochastic processes, or indeed for those who want to do Markov chain Monte Carlo and want to know how long to let the chain run. For our purposes, though, what's important is that when people in statistical learning theory have given serious attention to dependent data, they have usually relied on a beta-mixing assumption.

The reason for this focus on beta-mixing is that it "plays nicely" with approximating dependent processes by independent ones. The usual form of such arguments is as follows. We want to prove a result about our dependent but mixing process $X$. For instance, we realize that our favorite prediction model will tend to do worse out-of-sample than on the data used to fit it, and we might want to bound the probability that this over-fitting will exceed $\epsilon$. If we know the beta-mixing coefficients $\beta(h)$, we can pick a separation, call it $a$, where $\beta(a)$ is reasonably small. Now we divide $X$ up into $\mu = n/a$ blocks of length $a$. If we take every other block, they're nearly independent of each other (because $\beta(a)$ is small) but not quite (because $\beta(a) \neq 0$). Introduce a (fictitious) random sequence $Y$, where blocks of length $a$ have the same distribution as the blocks in $X$, but there's no dependence between blocks. Since $Y$ is an IID process, it is easy for us to prove that, for instance, the probability of over-fitting $Y$ by more than $\epsilon$ is at most some small $\delta(\epsilon,\mu/2)$. Since $\beta$ tells us about how well dependent probabilities are approximated by independent ones, the probability of the bad event happening with the dependent data is at most $\delta(\epsilon,\mu/2) + (\mu/2)\beta(a)$. We can make this as small as we like by letting $\mu$ and $a$ both grow as the time series gets longer. Basically, anything result which holds for an IID process will also hold for a beta-mixing one, with a penalty in the probability that depends on $\beta$. There are some details to fill in here (how to pick the separation $a$? should the blocks always be the same length as the "filler" between blocks?), but this is the basic frame.

What it leaves open, however, is how to estimate the mixing coefficients $\beta(h)$. For Markov models, one could it principle calculate it from the transition probabilities. For more general processes, though, calculating beta from the known distribution is not easy. In fact, we are not aware of any previous work on estimating the $\beta(h)$ coefficients from observational data. (References welcome!) Because of this, even in learning theory, people have just assumed that the mixing coefficients were known, or that it was known they went to zero at a certain rate. This was not enough for what we wanted to do, which was actually calculate bounds on error from data.

There were two tricks to actually coming up with an estimator. The first was to reduce the ambitions a little bit. If you look at the equation for $\beta(h)$ above, you'll see that it involves integrating over the infinite-dimensional distribution. This is daunting, so instead of looking at the whole past and future, we'll introduce a horizon, $d$ steps away, and cut things off there: $\begin{eqnarray*} \beta^{(d)}(h) & \equiv & d_{TV}(P(X^t_{t-d}, X_{t+h}^{t+h+d}), P(X^t_{t-d}) \otimes P(X_{t+h}^{t+h+d})) \\ & = & \frac{1}{2}\int{\left|p(x^t_{t-d},x_{t+h}^{t+h+d})-p(x^t_{t-d})p(x^{t+h+d}_{t+h})\right|dx^t_{t-d}dx^{t+h+d}_{t+h}} \end{eqnarray*}$ If $X$ is a Markov process, then there's no difference between $\beta^{(d)}(h)$ and $\beta(h)$. If $X$ is a Markov process of order $p$, then $\beta^{(d)}(h) = \beta(h)$ once $d \geq p$. If $X$ is not Markov at any order, it is still the case that $\beta^{(d)}(h) \rightarrow \beta(h)$ as $d$ grows. So we have an approximation to $\beta$ which only involves finite-dimensional integrals, which we might have some hope of doing.

The other trick is to get rid of those integrals. Another way of writing the beta-dependence between the random variables $U$ and $V$ is $\beta(U,V) = \sup_{\mathcal{A},\mathcal{B}}{\frac{1}{2}\sum_{a\in\mathcal{A}}{\sum_{b\in\mathcal{B}}{\left| \Pr{(a \cap b)} - \Pr{(a)}\Pr{(b)} \right|}}}$ where $\mathcal{A}$ runs over finite partitions of values of $U$, and $\mathcal{B}$ likewise runs over finite partitions of values of $V$. I won't try to show that this formula is equivalent to the earlier definition, but I will contend that if you think about how that integral gets cashed out as a sum, you can sort of see how it would be. If we want $\beta^{(d)}(h)$, we can take $U = X^{t}_{t-d}$ and $V = X^{t+h+d}_{t+h}$, and we could find the dependence by taking the supremum over partitions of those two variables.

Now, suppose that the joint density $p(x^t_{t-d},x_{t+h}^{t+h+d})$ was piecewise constant, with those pieces being rectangles parallel to the coordinate axes. Then sub-dividing those rectangles would not change the sum, and the $\sup$ would actually be attained for that particular partition. Most densities are not of course piecewise constant, but we can approximate them by such piecewise-constant functions, and make the approximation arbitrarily close (in total variation). More, we can estimate those piecewise-constant approximating densities from a time series. Those estimates are, simply, histograms, which are about the oldest form of density estimation. We show that histogram density estimates converge in total variation on the true densities, when the bin-width is allowed to shrink as we get more data.

Because the total variation distance is in fact a metric, we can use the triangle inequality to get an upper bound on the true beta coefficient, in terms of the beta coefficients of the estimated histograms, and the expected error of the histogram estimates. All of the error terms shrink to zero as the time series gets longer, so we end up with consistent estimates of $\beta^{(d)}(h)$. That's enough if we have a Markov process, but in general we don't. So we can let $d$ grow as $n$ does, and that (after a surprisingly long measure-theoretic argument) turns out to do the job: our histogram estimates of $\beta^{(d)}(h)$, with suitably-growing $d$, converge on the true $\beta(h)$.

To confirm that this works, the papers go through some simulation examples, where it's possible to cross-check our estimates. We can of course also do this for empirical time series. For instance, in his thesis Daniel took four standard macroeconomic time series for the US (GDP, consumption, investment, and hours worked, all de-trended in the usual way). This data goes back to 1948, and is measured four times a year, so there are 255 quarterly observations. Daniel estimated a $\beta$ of 0.26 at one quarter's separation, $\widehat{\beta}(2) = 0.15$, $\widehat{\beta}(3) = 0.02$, and somewhere between 0 and 0.11 for $\widehat{\beta}(4)$. (That last is a sign that we don't have enough data to go beyond $h = 4$.) Optimistically assuming no dependence beyond a year, one can calculate the effective number of independent data points, which is not 255 but 31. This has morals for macroeconomics which are worth dwelling on, but that will have to wait for another time. (Spoiler: $\sqrt{\frac{1}{31}} \approx 0.18$, and that's if you're lucky.)

It's inelegant to have to construct histograms when all we want is a single number, so it wouldn't surprise us if there were a slicker way of doing this. (For estimating mutual information, which is in many ways analogous, estimating the joint distribution as an intermediate step is neither necessary nor desirable.) But for now, we can do it, when we couldn't before.

Posted at April 20, 2012 14:57 | permanent link