## April 25, 2022

### Intermittent Finds in Complex Systems and Stuff, No. 2

Attention conservation notice: Links to forbiddingly-technical scientific papers and lecture notes, about obscure corners of academia you don't care about, and whose only connecting logic is having come to the attention of someone with all the discernment and taste of a magpie (who's been taught elementary probability theory).
Or whatever the heck it is I study these days. (I did promise that this series would be intermittent.) In no particular order.
Modibo K. Camara, "Computationally Tractable Choice" [PDF]
I'll quote the abstract in full:
I incorporate computational constraints into decision theory in order to capture how cognitive limitations affect behavior. I impose an axiom of computational tractability that only rules out behaviors that are thought to be fundamentally hard. I use this framework to better understand common behavioral heuristics: if choices are tractable and consistent with the expected utility axioms, then they are observationally equivalent to forms of choice bracketing. Then I show that a computationally-constrained decisionmaker can be objectively better off if she is willing to use heuristics that would not appear rational to an outside observer.
If you like seeing SATISFIABILITY reduced to decision-theoretic optimization problems, this is the paper for you. I enjoyed this partly out of technical interest, and partly to see Simon and Lindblom's heuristic arguments from the 1950s rigorously validated.
One last remark: the slippage of "rationality" in the last sentence of the abstract is fascinating. We started by wanting to define "rational behavior" as being about effectively adapting means to ends; we had an intuition, inherited from 18th century philosophy, that calculating the expectation values in terms of rat orgasm equivalents would be a good way to adapt means to ends; we re-defined "rational behavior" as "acting as though one were calculating and then maximizing an expected number of rat orgasm equivalents"; now it turns out that that is provably an inferior way of adapting means to ends, and we have to worry about what it says about rationality. There's something very wrong with this picture! §
(Thanks to Suresh Naidu for sharing this paper with me.)
Carlos Fernández-Loría and Foster Provost, "Causal Decision Making and Causal Effect Estimation Are Not the Same... and Why It Matters", arxiv:2104.04103
To make an (admirably simple) argument even simpler: Think of decision-making as a classification problem, rather than estimation. If your classifier mis-estimates $\mathbb{P}\left( Y|X=x \right)$, but you're nonetheless on the correct side of 1/2 (or whatever your optimal boundary might be), it doesn't matter for classification accuracy! So if you over-estimate the benefits of treatment for those you decide to treat, well, you're still treating them...
Ira Globus-Harris, Michael Kearns, Aaron Roth, "Beyond the Frontier: Fairness Without Privacy Loss", arxiv:2201.10408
My comments got long enough to go elsewhere.
Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", arxiv:2110.01584
I was very excited to read this --- look at the authors! --- and it did not disappoint. It's a lovely paper which both makes a lot of sense at the conceptual level and gives decent, calculable bounds for realistic situations. I'd love to teach this in my learning-theory class, even though I'd have to cut other stuff to make room for the information-theoretic background.
Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler, "Memorization in Overparameterized Autoencoders", arxiv:1810.10333
I was blown away when Uhler demonstrated some of the results in a talk here, and the paper did not disappoint.
Mikhail Belkin, "Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation", arxiv:2105.14368
Further to the theme.
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel, "Extracting Training Data from Large Language Models", arxiv:2012.07805
Demonstrates that from GPT-2 they can extract "(public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs", even though "each of the above sequences are included in just one document in the training data".
• I don't understand why they compare zlib entropy to language-model perplexity, when entropy density is basically log(perplexity). This probably wouldn't make a big difference to any results but it bugged me.
• This has to be connected to Radhakrishnan et al., right?
• I'd really like to see someone throw this many parameters, and this much data, at something like Pereira, Singer and Tishby 1996 and see how it does in comparison, both in terms of the usual performance metrics and memorizing random (and inappropriate) bits of the training data. (Pereira may be in a position to do the experiment!)
• Some people will, of course, interpret this as evidence that GPT-2 knows who you are, and so is that much closer to judging the quick and the dead basilisk-dom being amenable to bargaining under the canons of timeless decision theory.
Gabriel Rossman and Jacob C. Fisher, "Network hubs cease to be influential in the presence of low levels of advertising", Proceedings of the National Academy of Sciences 118 (2021): e2013391118
In a pure social-contagion/diffusion-of-innovations process, the contagion/innovation will spread farther, and spread faster, if it begins at one of the the most central nodes in the network, than if it begins at a randomly chosen node, let alone a deliberately-peripheral one. This motivates a lot of effort in applications to search for influential figures and target them. What Rossman and Fisher do is extend the model very modestly, to model "advertising", i.e., a probability for nodes to contract the contagion / adopt the innovation spontaneously, without direct contact with an infected / adopter node. What they show is that even a very small amount of advertising massively reduces the advantage of beginning at a central node. It's a very convincing, lovely, and potentially-applicable result. I also strongly suspect there's a genuine phase transition here, with the transition point moving towards zero external field as the size of the network goes to infinity, but I haven't been able to show that (yet). --- Many thanks to Prof. Rossman for presenting this paper to CMU's Networkshop.
Yuan Zhang, Dong Xia, "Edgeworth expansions for network moments", arxiv:2004.06615
This is technical, but valuable for all of us interested in being able to quantify uncertainty in network data analysis, especially in those of us working graph-limits/graphons/conditionally-independent-dyads framework. --- Thanks to Prof. Zhang for a very enjoyable conversation about this paper during a "visit" to Ohio State via Zoom.
David Childers, "Forecasting for Economics and Business"
Great materials for an undergraduate economics course (73-423) at CMU. Thanks to David for the pointer.
Vera Melinda Galfi, Valerio Lucarini, Francesco Ragone, Jeroen Wouters, "Applications of large deviation theory in geophysical fluid dynamics and climate science", La Rivista del Nuovo Cimento 44 (2021): 291--363, arxiv:2106.13546
The laws of large numbers say that, on large enough scales, random systems converge on their expected values. ("Large scales" here might indeed be number of samples, or length of time series, or something similar.) In symbols which you should not take too literally here, as $n \rightarrow \infty$, $\mathbb{P} \left( |A_n - a_{\infty}| > \epsilon \right) \rightarrow 0$ for every $\epsilon > 0$, where $a_{\infty}$ is the limiting behavior of the process. Large deviations theory is about fluctuations away from the expected behavior, and specifically about finding rate functions $r$ such that $\mathbb{P} \left( |A_n - a_{\infty}| \geq \epsilon \right) \sim \exp{\left( -n r(\epsilon)\right) }$. This is a "large" deviation because the size $\epsilon$ is staying the same as $n$ grows. We'd anticipate seeing this kind of behavior if $A_n$ was the result of some number $\propto n$ of independent random variables, all of which had to cooperate in order to produce that $\epsilon$-sized fluctuation. More specifically, a good point-wise rate function will let us say that $\frac{1}{n}\log{\mathbb{P}\left( A_n \in B \right) } \rightarrow - \inf_{x \in B}{I(x)}$ so that, as the saying goes, an unlikely large deviation is overwhelmingly (exponentially) likely to happen in the least unlikely possible way. Large deviations theory gives us lots of tools for calculating rate functions, and so saying how unlikely various large deviations are (at least to within asymptotic log factors), and for characterizing those least-unlikely paths to improbable events. (I am glossing over all kinds of lovely mathematical details, but follow some links.)
Now climate systems contain a lot random variables, which are mostly tightly dependent on each other but not completely so. And a lot of what we should worry about with climate comes from large fluctuations away from typical behavior. (E.g., transitions from one meta-stable state of the climate, where, say, there is a Gulf Stream in the North Atlantic keeping western Europe warmer than Labrador or Kamchatka, to another meta-stable state where there is not.) So climate modeling is actually a very natural application for large deviations theory. This is a well-written review paper surveying those applications, with a minimum of mathematical apparatus. (The implied reader does, however, remember fluid mechanics and thermodynamics.) It makes me want to learn more about rare-event simulation techniques. §

Posted at April 25, 2022 10:41 | permanent link