Wednesday, December 16, 2009

And first, we play with the hardware...

In order to run the climate-modelling software more effectively, our second volunteer (Eric Raymond) bought a new motherboard that should roughly double the speed of his computer. It's a 2.66 GHz Intel Core 2 Duo. It is *not* a super-high-end quad-core hotrod; and thereby hangs a tale.

Some kinds of algorithms parallelize well - graphics rendering is one of the classic examples, signal analysis for oil exploration is another. If you are repeatedly transforming large arrays of measurements in such a way that the transform of each point depends on simple global rules, or at most information from nearby measurements, the process has what computer scientists call "good locality". Algorithms with good locality can, in theory, be distributed to large numbers of processors for a large speedup.

Some algorithms are intrinsically serial. A good example is the kind of complex, stateful logic used in optimizing the generation of compiled code from a language like C or FORTRAN.

Sometimes, you can artificially carve an intrinsically serial problem into chunks that are coupled in a controlled way - for example, to get faster compilation of a program linked from multiple modules, compile the modules in parallel and run a link pass on them all when done. This approach requires a human programmer to partition the code up into module-sized chunks and control the interfaces carefully.

The kind of math used in climate modeling has some parts have good locality (and thus could theoretically parallelize well) and others that don't. Unfortunately it's difficult to capture any benefits from throwing the parts with good locality onto multi-core machines, because recognizing that locality and using it to do automatic partitioning is
hard.

Here's what "hard" means: computer scientists have been poking at the problem for four decades and parallelizing compilers are still weak, poor makeshifts that tend to be tied to specialized hardware, require tricky maneuvers from programmers, and not work all that well even on the limited categories of code for which they work at all.

What it comes down to is that if you're compiling C or Fortran climate-modeling code on a general-purpose machine, each model run is going to use one core and one only. Two cores are handy so that one of them can run flat-out doing arithmetic while the other does housekeeping and OS stuff, but above two cores diminishing returns start to set in pretty rapidly. By the time you get to quad-core machines, two of the processors will be space heaters.

This is good news, in a way. It means really expensive hardware is pointless for this job - or, to put it another way, a modern commodity PC is already nearly as well suited to the task as any hardware anywhere.

Now, to downloading the code to look more closely at what it does and to poke at it. Yeah, Eric was going to upgrade his computer soon anyway, and yes, this was an excuse...

8 comments:

  1. A suggestion on building the models: Implementations of IEEE 754 vary wildly in quality and it is quite possible (depending on just which functions you use) to get diverging results on different FPUs, making determining reproducibility difficult. If at all possible I'd try using something like fdlibm (http://www.netlib.org/fdlibm/) to get decent FP reproducibility and forestall some of the inevitable complaints if people copy your work and find that it doesn't reproduce perfectly.

    ReplyDelete
  2. This is just a feeling I get, never having really played with a full sized one of these, but my understanding is that they run quite well, given their gridded nature, on multi (and I do mean *multi*) processor machine. This is why NCAR, GISS, GFDL and other top labs have systems listed on the Top 500 list. My guess is that CCSM is usually run on "bluefire" a 4064 processor IBM Power system which currently ranks at 80 on the list.

    Not to put down your effort, but you might want to start with a simpler model which is freely available, EdGCM. This is a model which runs on a PC and is based on the code for GISS Model II which dates from the late 80's and early 90's. I've downloaded this model and it has a fighting chance of running on your hardware. On a 2.8 Ghz single core Pentium 4/HT (I know, time to buy a new machine) it runs OK, but running to equilibrium takes awhile.

    Good luck, and EdGCM comes with preconfigured run decks for suggested experiements. Does CCSM have this sort of support available? If not, getting to your first goal of simulating early 20th century temps may be difficult since you will have to develop the dataset to use for the run deck.

    ReplyDelete
  3. You might take a look at the FAQ to get an idea of how long a run might take. Using a the coarsest grids it would take at least a week to run the 20th C sims (4 days spin up, 4 days with forcings), and this is with a 64 processor Power machine (a wimpy supercomputer).

    Also I noticed that they do not provide run decks for the 20th C simulations, so that might be a fun project also.

    Good luck!

    ReplyDelete
  4. I'm not dead.... I just am in the middle of tearing out my upstairs and completly redoing it. Right now... I put a hole in a wall that abuts my attic and the wife is sort of unhappy about the 10 degree air pouring into the house. (OK, I got that fixed in a day, but the shower is still torn apart) So until I get all this fixed, I won't be doing much on this project.

    Rattus and John,

    Thanks for you comments. They dovetail with the gut feeling I've got from reading the 1st of the many manuals UCAR/NCAR puts out for these. I have downgraded my planned resolution to a much courser grid because I doubt my machine can do much with the full resolution grid.

    ReplyDelete
  5. Larry,

    Even with the lowest resolution grid which is verified (T31_gx3v5) it just ain't gonna fly. This is why I keep pushing EdGCM. It is still a production model where "computing resources are limited (GISS)", source code for the model itself is available if you want to look at it here, and it has the run decks you have expressed interest in running.

    My guess is that you might see 4 years per wall clock day using the lowest verified resolution. That means about 2 months to run an experiment with 20th century forcings (you have to remember that the model needs spin up time before you apply the 20th century forcings). With EdGCM/Model II you can do this less than a day.

    ReplyDelete
  6. CCSM4 is out; grab it.

    ReplyDelete
  7. Any news (ESR's blog censors my comments) on this project?

    Or is this just another blog infested by Chinese porno spam?

    You guys sure gave up easily.

    ReplyDelete
  8. I gave up because I'm not an atmospheric chemist, I'm a computer programmer. I simply do not understand the subtleties of what numbers to plug in and why.

    ReplyDelete