Notebooks

Ergodic Theory of Markov and Related Processes

Last update: 14 Aug 2025 21:54
First version: 20 August 2007, actual content added 1 May 2025

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Yet Another Inadequate Placeholder of references.

--- Oh, OK, I'll take a stab at this. Suppose we have a $k$ state Markov chain, with current state \( X_t \), and with transition matrix $\mathbf{p}$. By an ancient convention, \( \mathbf{p}_{ij} \) is the probability of going from state $i$ to state $j$, \( \mathbf{p}_{ij} = \Prob{X_{t+1}=j|X_t=i} \), so if $f$ is a length-$k$ vector of state probabilities at one time, \( f \mathbf{p} \) is the vector of state probabilities at the next time step, and \( f \mathbf{p}^t \) is the vector of state probabilities at time $t$. Let's assume that $\mathbf{p}$ has only a single connected component and is aperiodic, so we can go from every state to every other state and back again, and there are no periodicities to the possible return times. Then we know that (i) the largest eigenvalue of $\mathbf{p}$, \( \lambda_1 \), is 1, and (ii) there is a unique all-positive left eigenvector going along with this eigenvalue, say \( v_1 \). (That's the Frobenius-Perron theorem, or at any rate its application to this situation.) We can scale this vector until its entries sum to 1. Since they're non-negative and add up to 1, they're a possible probability distribution on the states of the chain. Since \( v_1 \mathbf{p} = v_1 \), this distribution is invariant.

Now further assume that the linear-algebra gods are kind and the left eigenvectors of $\mathbf{p}$ \( v_1, v_2, \ldots v_k \), form a basis, so we can write \[ f=\sum_{i=1}^{k}{c_i v_i} \] Now the dynamics under the Markov chain are simple: \[ f \mathbf{p} = \sum_{i=1}^{k}{c_i v_i \mathbf{p}} = \sum_{i=1}^{k}{c_i v_i \lambda_i} \] and indeed \[ f \mathbf{p}^t = \sum_{i=1}^{k}{c_i v_i \lambda_i^t} \] Since 1 is the largest eigenvalue, \( |\lambda_i| < 1 \) for \( i >1 \). Thus every component of \( f \mathbf{p} \) which comes from the other eigenvectors is shrinking exponentially fast, and \[ f \mathbf{p}^t \rightarrow c_1 v_1 \] Since \( f \mathbf{p}^t \) must be a probability distribution, and so add up to 1, and since \( v_1 \) adds up to 1, we can conclude that \( c_1=1 \). Thus \[ f \mathbf{p}^t \rightarrow v_1 \]

To sum up: starting from any initial distribution, the distribution of the Markov chain approaches the invariant distribution. This holds in particular even if $f$ puts probability 1 on a particular state $x$: \[ X_{t}|X_0=x \rightsquigarrow v_1 \] This means that \( X_t \) becomes asymptotically independent of \( X_0 \) as $t\rightarrow\infty$. But since there's nothing special about starting at time 0, that means that \( X_t \) and \( X_s \) become asymptotically independent as $|t-s| \rightarrow \infty$.

Now consider some real-valued function of the state \( h(X_t) \). By the argument about, \( h(X_t) \) and \( h(X_s) \) will be asymptotically independent as $|t-s| \rightarrow \infty$. In fact, one can show that the correlation between \( h(X_t) \) and \( h(X_s) \) will decay exponentially fast, with the exponent depending on the second eigenvalue $\lambda_2$. (I'll write this out some other time, it's more linear algebra than this nap of my son's will allow.) But now we invoke the world's simplest ergodic theorem and conclude that $\frac{1}{n}\sum_{t=1}^{n}{h(X_t)}$ converges (in mean square) to the expected value of $h(X)$ under the distribution \( v_1 \).

The promised argument about exponential decay of correlations, done during another of my son's naps (14 August 2025): Let's assume, for simplicity, that the chain starts in its invariant distribution, \( X_0 \sim v_1 \). We're interested in the covariance, at lag $t$, between \( h(X_0) \) and \( h(X_t) \), let us say $\gamma(t)$: \[ \gamma(t) \equiv \Expect{h(X_0) h(X_t)} - \Expect{h(X_0)} \Expect{h(X_t)} \] By stationarity, \( \Expect{h(X_0)} = \Expect{h(X_t)} = h(v_1) \), abusing (or extending) notation to denote the expected value of a function under a distribution over states. So let's tackle the product term in the covariance, using the law of total expectation: \[ \begin{eqnarray} \Expect{h(X_0) h(X_t)} & = & \Expect{ \Expect{h(X_0) h(X_t)|X_0} }\\ & = & \Expect{ h(X_0) \Expect{h(X_t)|X_0} } \end{eqnarray} \] As above, the distribution which puts probability 1 on the state \( x_0 \) can be expressed as a linear combination of the \( v_i \), with coefficients \( c_i(x_0) \), with \( c_1=1 \) always. So \[ \begin{eqnarray} \Expect{h(X_t)|X_0 = x_0} & = & \sum_{j=1}^{k}{h(j) \Prob{X_t=j|X_0=x_0}}\\ & = & \sum_{j=1}^{k}{h(j) \sum_{i=1}^{k}{c_i(x_0) \lambda_i^t v_{ij}}}\\ & = & \sum_{i=1}^{k}{ c_i(x_0) \lambda_i^t \sum_{j=1}^{k}{h(j) v_{ij}}}\\ & = & \sum_{i=1}^{k}{ h(v_i) c_i(x_0) \lambda_i^t}\\ & = & h(v_1) + \sum_{i=2}^{k}{h(v_i) c_i(X_0) \lambda_i^t} \end{eqnarray} \] exchanging the order of summation in the middle. Substituting in, \[ \begin{eqnarray} \Expect{h(X_0) h(X_t)} & = & h(v_1)\Expect{h(X_0)} + \Expect{h(X_0) \sum_{i=2}^{k}{h(v_i) c_i(X_0) \lambda_i^t}}\\ & = & h(v_1)^2 + \sum_{i=2}^{k}{h(v_i) \Expect{h(X_0) c_i(X_0)} \lambda_i^t}\\ \gamma(t) & = & \sum_{i=2}^{k}{h(v_i) \Expect{h(X_0) c_i(X_0)} \lambda_i^t} \end{eqnarray} \] remembering the definition of \( h(v_1) \), and interchanging (finite) sums and expectations.

Since \( |\lambda_i| < 1 \) for \( i > 1 \), each of the (finitely many) terms in the sum is going to zero exponentially in $t$. In fact, for large $t$, the covariance must be at most \( O(\lambda_2^t) \), because all the other exponentials are that small or smaller. Thus, as promised, the covariance of any function of the state decays to zero exponentially fast, with the exponent being controlled by the second-largest eigenvalue \( \lambda_2 \).

(What if we didn't start the chain in its invariant distribution, but said \( X_0 \sim \rho \neq v_1 \) instead? Basically the same thing, but with more annoying terms to keep track of, and re-assure ourselves are exponentially small. Express \( \rho = \sum_{i=1}^{k}{d_i v_i} \), where again \( d_1 = 1 \) to ensure proper normalization. Re-doing the algebra above, we get \( \Expect{h(X_0)} = h(\rho) \), \( \Expect{h(X_t)} = h(v_1) + O(\lambda_2^t) \), and \( \Expect{h(X_0) h(X_t)} = h(\rho) h(v_1) + O(\lambda_2^t) \), so, once again, everything's as promised.)

(I am pretty sure I had a different, more linear-algebraic argument in mind in May, but I forget what it was.)


Notebooks:   Powered by Blosxom