March 20, 2007

Eigenfactor (Why Oh Why Can't We Have a Better Academic Publishing System? Dept.)

There is an old Soviet-era joke about a nail factory which was assigned a target, under the five year plan, of 1600 tons of nails, and spent the whole five years producing a single gargantuan nail weighing (of course) 1600 tons. The joke illustrates not only the follies of "actually existing socialism", but a broader problem with using quantitative performance targets, namely that people will tend to adjust their efforts to meet the quantitative criteria, which can be only very poorly related to the real job they are supposed to be doing. This is not to say that objective performance criteria are always bad, because often the alternative is subjective evaluations by superiors, i.e., prejudice and caprice; but it does point to the need to carefully design those criteria, so that, as far as possible, they track what you actually want to have happen, and not just what's easy to measure or to calculate.

One place where easy calculation threatens to overwhelm substantive validity is in "bibliometrics", or the use of numerical methods to study patterns of scientific publication. For many years now, scientific journals have been advertising their "impact factor", as determined by ISI/Thompson Scientific, which is roughly the number of citations (as tracked by ISI/Thompson) to that journal, divided by the number of papers published in the journal. The idea is that journals with high impact factors are ones which publish articles people take note of, and go on to cite. Now, leaving to one side the big gap between "is cited a lot" and "is good science", there are huge, glaring holes with this as a way of measuring the quality or influence of a journal. An obvious one is that a citation from the World Journal of Cartesian Snooker and Even More Obscure Problems means much less than one from Nature. But another problem, perhaps even larger, is that different fields have different patterns of citation.

A stereotypical math paper, for example, will use a huge number of previously existing results, but contain very few citations, on the presumption that most of those results are assimilated background which its readers have already absorbed from any number of standard sources. If I write a paper on stochastic processes, I might well use the ergodic theorem for Markov chains, which says (roughly) that there is a way of assigning probabilities to states which is invariant under the chain's dynamics, and moreover the amount of time any sufficiently long trajectory spends in any one state is equal to that state's probability. This is a result with a very intricate history, going back to Markov himself in his struggles with his arch-enemy, but I'd look ridiculous if I cited any of this history, or even a textbook like Grimmett and Stirzaker. On the other hand, sociologists have a reputation for providing as many citations as possible for absolutely everything, and a pious habit of referring back to the 19th and early 20th century Masters. A leading sociology journal, then (say, American Journal of Sociology) might have an impact factor of around 5, while a leading mathematics journal (say, Annals of Probability) would have one significantly lower, even though both are near the top of their respective prestige hierarchies.

Now, you could say this is just another reason why we shouldn't try to rank journals. But there are times when doing things like this is going to be very helpful, e.g. when trying to decide which journals to spend a limited subscription budget on. So it would be nice if there was a way of doing something like this, which corrected for problems like the differences in citation customs across academic tribes.

One way to imagine doing this is as follows. Pick a completely random journal, and a random article from that journal. Now pick one of its references, again completely at random, and follow it up. Repeat this process by following a random reference in that paper, until you come to a dead end, namely a citation to something outside of your data set. Pick another random starting point and repeat, many times. Looking back over your random walks through the scientific literature, how much time did you spend in any given journal? It's not hard to convince yourself that you will spend more time in journals whose papers are highly cited by papers in other journals which are themselves highly cited. If you come to a paper with many references, you are that much less likely to follow any one of them, and so you will spend less time, all else being equal, on those papers than you will in the references of papers which are more sparing of citation. Saying "influential journals are ones which are often cited by influential journals" makes the definition sound hopelessly circular, but the random walk procedure makes it clear that it's not, or at least not hopelessly so.

It turns out that the random walk scheme is computationally very demanding — you need a lot of random walkers, taking a lot of very long walks, to get good results — but there is a short cut. The random process I've described is a well-behaved Markov chain. The ergodic theorem now tells us that a time average (how often does the walk hit a given journal?) can be replaced with a "space" average (what is the probability of being at a given journal?), where the probability weights are left unchanged by the action of the Markov chain. Finding these invariant distributions is an exercise in linear algebra; specifically it's going to be the leading eigenvector of the chain's transition matrix. (One of the beauties of the theory of Markov processes is how it lets us replace nasty nonlinear problems about individual trajectories with clean linear problems about probabilities.) And there are very nice, very fast algorithms for finding eigenvectors, even of very large matrices.

Thus the reasoning behind, the latest brainstorm from Carl Bergstrom's lab — most of the actual code and elbow-grease being provided by Jevin West and Ben Althouse. It covers all the journals that impact factor would, but also gives an estimate of the impact of citations to non-journals (which lets us see that some software is more influential than some journals). Plus you get to see all kinds of useful things about how much the journals cost (something Carl's been interested in for some time), and how that breaks down by paper or by citation. All in all, it's a very fun and potentially very useful tool for anyone interested in the academic publishing system, and/or applications of Markov chains.

Disclaimer: Rumors that Carl arranged for me to publicize everything his lab does in this weblog in exchange for beers from his private collection whenever I'm in Seattle are — sadly exaggerated.

Manual trackback: Geomblog; Muck and Mystery; Outsider; Structure+Strangeness; Flags and Lollipops; Dan O'Huiginn; MetaFilter; Yorkshire Ranter

(Thanks to Owen "Vlorbik" Thomas for typo correction.)

Networks; Learned Folly; Enigmas of Chance; Incestuous Amplification

Posted at March 20, 2007 21:08 | permanent link

Three-Toed Sloth