December 02, 2013

Speed, Complexity and Interfacing with Other Systems (Introduction to Statistical Computing)

(My notes from this lecture are too fragmentary to type up; here's the sketch)

Programmer time is (usually) much more valuable than computer time; therefore, "premature optimization is the root of all evil". That said, computer time isn't free, and if you have to wait around for the machine to finish, computer time is programmer time.

Vectorize

In R, vectorization can lead to great speed-ups. The reason has to do with how R changes existing objects. This is to make a new instance of the object, copy over the unchanged values and fill in the changed one, and then delete the old instance of the object. This makes repeated small changes to large big objects very inefficient. (On the other hand, incrementally updating a handful of parameters in optimization isn't that bad.) Changing the size of the object is even more inefficient. Vectorization avoids all this allocation, copying, and de-allocation.
very.stupid.adder <- function(x,y) {
  z <- c()
  for (i in 1:length(x)) {
    z <- c(z,x[i]+y[i])
  }
  return(z)
}

stupid.adder <- function(x,y) {
  z <- vector(length=length(x))
  for (i in 1:length(x)) {
    z[i] <- x[i] + y[i]
  }
  return(z)
}
> a <- rnorm(1e5)
> b <- rnorm(1e5)
> 1e6*system.time(very.stupid.adder(a,b)) # time in microseconds
    user   system  elapsed 
20266000  3224000 23409000 
> 1e6*system.time(stupid.adder(a,b))
   user  system elapsed 
 204000    1000  204000 
> 1e6*system.time(a+b)
   user  system elapsed 
      0       0       0 
(R's copy-on-change policy isn't a pointless perversity; knowing that objects don't "change in place" makes some memory management issues much more efficient.)

One way to speed up your code, then, is to think carefully about how to vectorize it. Again, the major advantage of vectorization is clarity, but orders-of-magnitude speed-ups aren't bad either.

Cultivate Slop and Arbitrary Ignorance

We can often speed things up by accepting a more approximate answer in place of a more exact one. For statistical problems, in particular, there is no point in making our answers more precise than the data will justify. We've emphasized this for optimization, but it applies to lots of numerical computations as well.

(Here, it is useful to understand the trade-off between computing effort and approximation. If getting a precision of \( \pm \epsilon \) takes \( O(\log{1/\epsilon}) \) operations, then doubling the \( \epsilon \) we tolerate shaves only a constant amount of time off the calculation --- we'd have to accept exponentially more error to get real speed-ups. [Optimistically, if we can do the calculation at all, we should be able to do it precisely.] If however the scaling is \( O(1/\epsilon^2) \), doubling \( \epsilon \) would reduce the time by a factor of four.)

Computational problems also generally slow down as the data size increases. For many statistical problems, we can actually make use of this, by sampling the data and only running our procedure on a sample. This introduces some approximation error, compared to running on the full data, but this trade-off can be well worth it. Moreover, we can cross-check different randomized subsets of the data, if we're worried about being misled by small parts of it.

If you can't sample, try to reduce how often you have to go over the data.

There is also a whole, very important, field of computer science concerned with developing randomized approximation algorithms (see e.g. Mahoney on randomized matrix algorithms).

Parallelize

If you can break your problem up into independent chunks, you can farm each chunk out to a separate processor, and do them in parallel. (This presumes you have a lot of processors.) The split/apply/combine strategy is designed to work with this. It may be possible to do this for one stage in your problem, followed by others which don't divide neatly; you then alternate between embarrassingly-parallel and centralized processing. There are also problems which lend themselves to parallel processing with communication between the processors; these tend to be ones which are somehow "local" in space.

I am vague here, because the technology for parallelizing data processing is advancing rapidly.

Exploit Other's Work

Don't re-invent the wheel: look for well-executed software which does a part of your job and use it. (Don't write your own linear algebra package, unless you really know what you are doing, and have a good reason not to just improve existing packages.) Modularity and abstraction make it easier to use other's work, and replace one external library with another, better one.

Give Up on R

Highly optimized tools in the Unix shell for particular data-manipulation jobs. (Bradnam and Koch just scratches the surface.) Cultivating their acquaintance is a good use of your time. Remember that system lets you call arbitrary system functions, and capture the output.

At some point, speed (and maybe memory) efficiency point to a more streamlined, compiled language for specific parts of your work. The natural choices here are C and (still) Fortran; see Writing R Extensions. You could write all your software for data analysis C, but then you will spend all your time tracing down memory allocation bugs.

Actual Complexity Analysis

Computational complexity: resources (time [in operations not seconds], space [in bits not liters], communication, samples) needed as function of problem size.

Merge-sort, for example.

Intrinsic limits (so far as we can tell) to how easily certain sorts of problems can be solved.

Clever algorithms vs. exploiting special structure in problems.

Really deserves attention on its own: read Moore and Mertens. (Though Kleinberg and Tardos may be a bit more directly practical, Moore and Mertens will do more for your soul.)

Introduction to Statistical Computing

Posted at December 02, 2013 10:30 | permanent link

Three-Toed Sloth