Attention conservation notice: Over 3000 words of technical commentary on a paper on the statistical analysis of networks. Does a poor job of explaining things to those without background knowledge of networks, statistical inference and Markov chains. Includes some geeky jokes and many equations.
Yesterday in the Statistical Learning for Networks seminar, we discussed the following paper; what follows are a mix of my notes from before and after the discussion.
It's perhaps not completely clear from the abstract that their method works for a particular class of network growth models, which they call "duplication-attachment" (DA) models. These are pure growth models, where the network only expands, and never loses nodes or edges. (The network is assumed to be undirected, without self-loops or multiple edges between a given pair of nodes.) At each time step, we add one node. This is randomly chosen, with fixed probability, to be either a duplication or an attachment. If it's an attachment, the new node attaches to an old one, chosen uniformly over the graph (possibly with some fixed probability < 1). If it's a duplication event, we pick an existing node to duplicate, and the new one gets a certain probability of copying each of its model's links (independently), and a different probability of being linked to its model. It is entirely possible that a new node is added with no links to any existing node. Notice that nodes (and edges) have no intrinsic properties in this model; their names are arbitrary. Any two isomorphic graphs should therefore be assigned the same probability.
(The motivation for such models is that gene duplication is apparently fairly common, at least over evolutionary time, which would duplicate the interactions between genes, or their proteins. Attachment, here, is supposed to summarize all the other processes of network growth. There are several models of the evolution of protein interaction networks, e.g. those of Sole et al., 2002, and Vazquez et al., 2003, which are basically of the duplication-attachment type, and yield networks which at least qualitatively match some features of real-world examples, like degree distributions. These papers are not cited here.)
From any starting graph, it is easy to run the model forward and generate new, larger random graphs; the probabilities involved are all pretty simple uniform and binomial things. In fact, the current state of the graph completely determines the distribution of future graphs, so this is a Markov chain. The transition probabilities are fixed by the duplication and attachment parameters, collectively \( \theta \), and these, together with a distribution over starting graphs, give us a fully-specified stochastic model.
Normally, statistical inference for Markov chains is fairly straightforward, because most of the classical conveniences which make inference for IID data tractable are still available. (This is, after all, what led Markov to his chains!) So why then does the paper not end on the second page, with a citation "See Billingsley (1961)"? Because normally, when we observe a Markov chain, we observe a sequence of values from the chain, and that lets us estimate the transition parameters. Here, however, we have only the end-state, the final graph, as our data. The Markov chain will let us assign likelihoods to paths, but we don't know where we started from, and we don't know where we went from there, just how we ended up here.
Suppose we did know where we started from, some graph of
size \( t_0 \),
\( G_{t_0} \).  (Remember here that each step of the chain adds one node,
so \( t \) really counts nodes, not clock-time.  This is why it's natural to
start at \( t_0 \), as opposed to 0 or 1.  The paper does not seem to
explain this point.)  If we knew our initial state, then in principle we could
figure out the probability of reaching our final state from it, as a sum over
all possible paths:
\[
L(G_t,\theta) \equiv \mathrm{Pr}(G_t|G_0;\theta) = \sum_{G_{t_0 +1}, \ldots G_{t-1}, G_{t} \in \mathcal{G}(t_0,t)}{\prod_{s={t_0+1}}^{t}{\mathrm{Pr}(G_s|G_{s-1};\theta)}}
\]
where \( \mathcal{G}(t_0,t) \) is set of all growing sequences of graphs which
start at \( G_{t_0} \) and end at \( G_t \).  This mathematical expression is a
mouthful, admittedly, but it's probably clearer in a picture.
 
At this point, any physicists in the audience should be nodding their heads; what I've just said is that the likelihood, from a given starting configuration, is a sum over histories or sum over paths. Along with the authors, I'll return to how to evaluate this sum over histories presently, but first we need to figure out how to get that starting configuration.
If we had a known distribution over starting graphs, we could (in principle) just evaluate the likelihood conditional on each starting graph, and then take a weighted sum over graphs. This, however, is not what the authors do. (I'm really not sure where one would find such a distribution, other than another model for graph development. Bayesian practice would suggest picking something which led to easy computations, but this makes a mockery of any pretense to either modeling nature, or to representing incomplete prior knowledge.) Instead, they try to use the known dynamics of the DA model to fix on an unambiguous starting point, and do everything conditional on that.
They observe that you can take any graph, and, for each node, identify the other nodes it could have been copied from. (If A could have been copied directly from B, then A's neighbors must be a subset of B's [ignoring the link between A and B, if any].) So, from any starting graph, you can recursively remove nodes that could have arise through simple duplications. In general, at each stage in this recursion there will be multiple nodes which could be removed, and their choice is arbitrary. Remarkably enough, no matter which choices one makes, the recursion always terminates at the same graph. (More exactly, any two end-points are isomorphic to each other, and so identical for statistical purposes.) The proof is basically a fixed point theorem about a partial order defined on graphs through deletion-of-duplicates, but they confine it to the supplementary materials, so you can take it on trust (and they don't use such lattice-theoretic language even there). This graph --- the data, minus everything that could be pure duplication --- is what they take as their starting point. This is the \( G_{t_0} \) to the data's \( G_t \). Everything is then done conditional on \( G_{t_0} \).
OK, we have our initial condition and our final condition, and we have our Markov chain, so all we've got to do so is evaluate the sum over histories linking the two.
Problem: There are too many paths. In the worst case, the number of paths is going to grow factorially with the number of nodes in the observed graph. Even though along each path we've just got to do some straight-forward multiplication, simply enumerating all the paths and summing over them will take us forever. (The authors discuss some algorithmic tricks for speeding up the exact calculation, but still get something super-exponential!) Thus, evaluating the sum over histories for the likelihood is intractable, even for a single parameter value.
Solution: Don't look at all the paths. Rather, sample some paths, say \( N \) of them, evaluate the likelihood along each, and average. Hope that this converges quickly (in \( N \) ) to the exact value of the sum. This is, after all, how physicists approach many sums over histories.
Problem: Even if \( N \)> is fairly small, we need to examine many settings of the parameter \( \theta \). It could still kill us to have to sample \( N \) distinct paths for each parameter value.
Solution: Use importance sampling. Draw a single path, valid for all parameter values, and evaluate the likelihood in terms of the value of an "importance weight" along this path. The weight has to be a function of \( \theta \), but it should be the only thing which is. We do this here by writing the likelihood, \( L(G_t,\theta) \), as an expectation with respect to a reference measure, which the authors write \( \theta_0 \). This reference measure is given by another Markov chain, called the "driving chain"; despite its name, it is not a member of the DA family of chains. The trick here is that one sample of possible paths, generated according to this chain, can be used to (approximately) evaluate the likelihood at all parameter settings of the DA model.
The crucial equation is [3] on p. 7568 \[ L(\theta,G_t) = \mathbf{E}_{\theta_0}\left[ \prod_{s=t_0}^{t}S(\theta_0,\theta,G_s,\nu)\right] \] where \[ S(\theta_0,\theta,G_t,\nu) = \frac{1}{t} \omega(\theta,G_t,\nu) \frac{\omega(\theta_0,G_t)}{\omega(\theta_0,G_t,\nu)} \] (N.B., the paper writes the second factor as \( \omega(\nu,G_t,\nu) \), but this is wrong.) Let's unpack this a bit.
\( \omega(\theta,G_t,\nu) \) is the probability of producing the graph \( G_t \) through the addition of the node \( \nu \), with parameter setting \( \theta \). (N.B., \( \nu \) must be a "removable" node, one which could have been added by duplication.) \( \omega(\theta,G_t) \) is this transition probability, summed over all possible \( \nu \). The first two factors in S are what we want, the probability we'd get moving forward along the path according to the parameter \( \theta \). The third term is the reciprocal of the transition probabilities according to the driving chain. Its only job is to cancel those probabilities out.
The algorithm for generating the ith sample path is then as follows. Start with the observed graph \( G_t \). Count backwards, \( s = t, t-1, \ldots t_0+1 \) Pick a node \( \nu^{(i)}_s \) to delete, with probability proportional to \( \omega(\theta_0, G^{(i)}_s, \nu^{(i)}_s) \). (Once again, this limits us to the "removable" nodes, the ones which could have been produced by duplication.) Set \( G^{(i)}_{s-1} \) to be the result of deleting that node. Keep going back until we hit the irreducible core, \( G_{t_0} \). (We will always hit this core, by the fixed point theorem proved in the supplementary results.) Then \[ \begin{eqnarray*} l^{(i)}(\theta) & = & \prod_{s=t_0}^{t}{S(\theta_0,\theta,G^{(i)}_{s},\nu^{(i)}_s)}\\ \hat{L}(\theta,G_t) & = & \frac{1}{N}\sum_{i=1}^{N}{l^{(i)}(\theta)} \end{eqnarray*} \] To ease the calculation of this for multiple parameter settings, it may be worth noting (though the authors do not) that \[ l^{(i)}(\theta) = \left(\frac{(t_0-1)!}{t!}\right)\left(\prod_{s=t_0}^{t}{\omega(\theta,G^{(i)}_s,\nu^{(i)}_s)} \right)\left(\prod_{s=t_0}^{t}{\frac{\omega(\theta_0,G^{(i)}_s)}{\omega(\theta_0,G^{(i)}_s,\nu^{(i)}_s)}}\right) \] and the middle factor is the only one which depends on \( \theta \).
So, to summarize: We can generate a sample of paths connecting the observed final graph to the unobserved initial graph, according to the driving chain, and then approximate the likelihood for any parameter value by multiplying the importance weights along those paths and summing over paths. (The importance weights themselves even factor nicely.) We have thus solved the problem of evaluating a sum over histories, when we've got only one end of too many possible paths.
The trick used here to pull this off depended on having a uniquely-defined starting point for all parameter settings, namely the \( G_{t_0} \) defined through undoing duplications. (According to the authors, they took this from papers on the coalescent process in population genetics, but I have not been able to track down their references.) Strictly speaking, everything is conditional on that starting point. Of this, more below.
Left unaddressed by the above is the question of how many paths we need to sample. Remember, the whole point is to not have to look at every possible path! If it turns out that accurate approximations to the likelihood require us to sample some substantial fraction, then this is all for nothing. However, the authors' figures reveal something quite remarkable. Whether \( N \) is 10 or 1000, the approximate likelihood changes very little (at least on a log scale), even with real data. This suggests that we don't, actually, need a lot of paths, but why?
For each i, \( l^{(i)}(\theta) \) is an independent realization of a random variable, whose distribution depends only \( \theta \) (holding fixed the driving chain, and the initial and final graphs). Since they are IID, we can apply the central limit theorem, which tells us that their mean should converge at rate \( 1/\sqrt{N} \). Since that's noticeably smaller for N=1000 than for N=10, it must be the case that the variance of the likelihood along the individual sample paths is already pretty small. Why?
The lame physics answer is, "the principle of least action". There will be optimal, most-probable paths, and they will dominate the sum, the others tending to make negligible contributions. With high probability, a random sample will pick out the most probable paths. Q.E.D. This argument could, perhaps, be made less lame through an application of large deviations results, specifically conditional-limit-theorem- (or "Gibbs's conditioning principle"-) type results for Markov chains, which roughly say that if something improbable (a passage from \( G_{t_0} \) to \( G_t \)) happens, it does so in the least-improbable possible way, and deviations from that trajectory are exponentially suppressed. <televangelist>In the name of Cramér, in the name of Varadhan, in the name of Freidlin and Wentzell, I call on you, Brother Argument, to be healed! Arise and prove! And, cf. Eyink (2000).</televangelist>
An information-theoretic answer is to invoke the asymptotic equipartition principle, a.k.a. the Shannon-McMillan-Breiman theorem. This says that if \( X_1, X_2, \ldots X_t \) are generated according to a (well-behaved) stochastic process, whose distribution is \( \mu \), and \( \theta \) is a sufficiently well-behaved model, then \[ -\frac{1}{t}\log{L(\theta,X_1^t)} \rightarrow h(\mu) + d(\mu,\theta), ~\mu-a.s. \] where \( h(\mu) \) is the entropy rate of the data-generating process \( \mu \), and \( d(\mu,\theta) \) is the relative entropy rate between the data source and the model, i.e., the asymptotic growth rate of the Kullback-Leibler divergence. (For details, see Algoet and Cover, 1988, or Gray, 1990). So \[ \log{L(\theta)} = - (h+d(\theta))t + o(t), ~\mu-a.s. \] In words, there are only two kinds of sample long sample paths: those which aren't generated at all, and those where the normalized log-likelihood are equal, at least to first order. It's not clear to me here whether \( t \) is big enough in the examples for this effect to kick in.
The biggest unclarity of all, probably, is the role of \( G_{t_0} \). Recall that we reached this by removing nodes which could have been added by pure duplication. There is, however, no particular reason to think that the actual growth of the graph ever passed through this state. It has the advantage of giving us a unique starting point for the chain, but there are, potentially, others. One, of course, is the trivial network consisting of a single node! Another possibility (which came up in the discussion, I think mostly due to Anna Goldenberg) is to first remove potential duplicates, as the authors do, and then remove nodes which have only one link to them, as clear attachments. This process of unwinding the attachments could potentially be iterated, until no "danglers" are left. This, too, is a uniquely-defined point. We can then go back to removing nodes which are, now, potential duplicates, and so on. Someone (I forget who) suggested that this might always terminate at the one-node network; it would be nice to either show this or give a counter-example. But if there is some principled reason, other than tractability, to use their \( G_{t_0} \), I can't figure out what it is from this paper.
Only using a growing network, and in particular only focusing on growth through duplication, certainly simplifies the computational problem, by reducing the number of possible paths which could terminate in the observed graph. Deletion of nodes and edges is however going to be very important in more biologically-plausible models, to say nothing of models of social or technological networks. Presumably the trick of using a backward-looking chain which stops when it hits a unique starting configuration could still be used with deletions --- I think the authors are hinting as much in their conclusion --- but it's not clear to me that a unique starting point is appropriate. With biological interaction networks, for example, one might argue that, e.g., metazoans have been around for a long time, so the distribution of networks ought to be close to stationary, and so starting configurations should be drawn from an invariant distribution of the appropriate chain...
This raises two further points, which are not un-related: the asymptotics of the DA model, and the biological utility of such a model. Run for a long time, the DA model will produce graphs of unbounded size, but it's not immediately obvious what these graphs will look like. In particular, what will be their degree distribution? The Barabasi-Albert model (Albert and Barabasi, 2002) produces scale-free distributions, <boosterism>because it uses the same mechanism as Herbert Simon's classic paper</boosterism> (Bornholdt and Ebel, 2001). This relies on a rich-get-richer dynamic, where nodes with high degree are more likely to attract new edges. My initial thought was that this wasn't present in the DA model, because targets for attachment and targets for duplication are both chosen uniformly. However, someone in the discussion --- I think, though I may be mis-remembering, that it was Tanzy Love --- pointed out that while high-degree nodes are no more likely to be copied than low-degree nodes, edges into high-degree nodes are more likely to be copied than edges into low-degree nodes. This is because if a node has degree k, there are k other nodes whose duplicated could end up linking to it. It may even be the case that this is falls under the theorems in Simon... Presumably the asymptotics would only become harder to handle if we added events deleting nodes or edges.
As for the biological utility, I'll repeat that none of the nodes have any identity of their own; only their role in the network of relations represented by the edges has any bearing on the model. "If this be structuralism, make the most of it": by turning it into a neutral model for the evolution of biological networks. After all, there is no reason here to duplicate certain nodes or edges, it's all just uniform chance. One key use of neutral models is to provide a background against which to detect adaptation; how could we do that here?
Update, 18 January 2019: Fixed some typos, replaced potentially-misleading use of "path integral" with "sum over histories"; left in even the worst jokes as a reminder-to-self.
Posted at November 29, 2006 02:30 | permanent link