Notebooks

Projectivity in Statistical Models

01 Dec 2023 21:59

Suppose we consider a sequence of statistical observations, where we keep gathering more and more data. (Perhaps we're running more and more replications of an experiment, or doing larger and larger surveys, or sequentially extending a time-series.) We'll get a sequence of sample spaces, where each one contains all the previous spaces, plus some new variables for the additional information. If we're at the \( n^{\mathrm{th}} \) sample space, we can recover an earlier one, \( m < n \), by just dropping the extra data. Mathematically, this amounts to "projecting" on to the first \( m \) coordinates. Let's write \( \pi_{n\mapsto m} \) for the function which does this projection. The inverse of this, \( \pi_{n\mapsto m}^{-1} \), will be a set-valued function, i.e., \( \pi_{n\mapsto m}^{-1}(a) \) will consist of all size-\( n \) data values which would be mapped down to \( a \) when we just look at their first \( m \) coordinates.

Suppose also that we have a sequence of probability distributions, one for each sample space, say \( P_n \) for the \( n^{\mathrm{th}} \) space. Then we say the distributions are projective, or form a projective family, when, for any set \( A \), \[ P_m(X_m \in A) = P_n(X_n \in \pi_{n\mapsto m}^{-1}(A)) \] We also write this as \( P_m = \pi_{n \mapsto m} P_n \).

(If you wonder about sample spaces which aren't in a sequence, or different projections --- what if you wanted to ignore the first observation? --- you can work out how to extend the notation.)

You might think that this is too trivial a property to need a name, let alone to have to worry about. The point of giving this a name comes from the Kolmogorov extension theorem: If the \( P_n \) are a projective family for all finite \( n \), then there exists a well-defined probability measure on infinite sequences, of which all the finite-dimensional distributions are projections.

That's probability theory. The statistical issue comes when we specify models through their distributions over different sample spaces. Often, in surveys or in regression, we just give a marginal distribution for samples of size 1, and say "and we assume the data are IID", which means the joint distribution over larger samples are products of the 1D distribution, and projectivity is automatic. In time series, we often specify the model in conditional form, e.g., "Here's \( P(X_t|X_{t-1}) \)" for a Markov model, and then again, projectivity is automatic. But it turns out there are many situations in network data analysis and relational learning where we specify models in a way which gives us a \( P_n \) directly for each \( n \), and then it seems to me to be important to know if those specifications are projective, because otherwise, what on Earth do those distributions even mean?

Alessandro Rinaldo and I were able to give necessary and sufficient conditions for projectivity in exponential families. The conditions have to do with the sufficient statistics of the family, and have to do with how the values of those statistics can be altered by the additional data you get at the larger sample space. (The exact conditions are too algebraic-combinatorial for me to summarize pithily.) Applying those conditions to exponential-family random graph models shows that many popular specifications are not, in fact, projective, so that the distribution they give you on social networks of (say) 2499 people is not what you'd get by summing over networks of 2500 people. (There was important prior art here by Tom Snijders, which we didn't cite because we weren't aware of it, and we should have been.)

Some queries I have, to which I am not devoting a lot of time at the moment, but want to keep track of:

  1. Can the result be extended beyond exponential families to all distributions with sufficient statistics? (I've tried showing this, using the Neyman factorization criterion / characterization for sufficiency, and gotten some headway but not been able to make it work.)
  2. Can we characterize projective families of models even if they don't have sufficient statistics? (This'd be great but I'd be very surprised.)
  3. When projectivity fails, so that \( P_m(\cdot;\theta) \neq P_n(\cdot;\theta) \), there is presumably some least-false effective parameter value at the smaller size, i.e., a \( \pi_{n \mapsto m}(\theta) \) (you should excuse the expression) so that \( P_m(\cdot; \pi_{n \mapsto m}(\theta)) \) comes closest to \( \pi_{n \mapsto m}(P_n(\cdot; \theta)) \), perhaps in Kullback-Leibler divergence. Can we characterize those least-false parameter values?
  4. When a family isn't projective, might we rescue something useful by seeing if it is, in some sense, asymptotically projective? Of course, we'd have to fix exactly what that meant. (Perhaps: the distance between \( P_m \) and \( \pi_{n \mapsto m} P_{n} \) is upper-bounded for all \( n > m \), and the upper bound is decreasing in \( m \) towards 0?)


Notebooks: