Friday, November 27, 2009

On Data Sets and Merges

One of the problems with science (observational and experimental both) is that you can get conflicting data sets. Indeed, more scientific papers are discredited by misused (or non-repeatable) data sets than any other.

Data set gathering is incredibly tedious and expensive, and in many ways nonreplicable. For example, there are plenty of dendochronology data sets out there (measurements of the width of tree rings as a measurement of rainfall, CO2 and nitrogen in soils); not all of them are useful as climate proxies. Some - coming from very carefully selected points - may be good measurements of local climate effects.

This is particularly important for creating climate records prior to about 1640, when the first scientifically reliable thermometers came about.

One of the important tasks in ANY scientific endeavor is to winnow the data. Particularly, in the effects of climatology, you have a lot of noisy, geographically discontiguous data sets that need to be interpreted and a chronology built. Even the thermometer records from roughly 1860 onwards have issues; Pielke et. al 2007 documents some of the noise signals in the thermometer data. (I'm not going to touch the issue on whether Dr. Pielke's work was blocked from publication by Dr. Karl at NCDC; the paper is a good discussion of where noise comes into contemporary data.).

What this means is that I'm somewhat more forgiving of fudge factors. I've been around practicing science enough to know that everything has fudge factors in science, because if you don't limit the variable set, you can't quantify the outcomes with any reliability.

That lack of reliability is why science reports things in probabilities. Which gets turned by people on both sides of the political spectrum into "But that means you're not certain of the outcome. I'll change my position when you can give me absolute certainty." Normally, this comes about when you have a data read or analysis that reveals an uncomfortable truth.

Good science discusses where the uncertainties in the data set and analysis methods are. Unfortunately, good science that does this routinely gets picked apart by agenda-hawks in the method described above.

Between noisy data sets coming from geographically discontiguous areas, and having to state uncertainty percentages, it is NOT unreasonable to say 'the data are clearly wrong'. The data can be from bad instrumentation, the data could be corrupted by factors that aren't being accounted for, and the data could be correct; when you have multiple data sources and one or two of them are clear outliers, there may well be an instance where 'the data are clearly wrong'. (This is why I don't consider the Phil Jones email to be a smoking gun.)

A later post is going to cover what data sets are being used for climate science, what their noise sources are, and how those noise sources are corrected for. It will be written for a technical layman (because that's what I really am); I've got friends who know more who'll look things over to make sure that any obvious gaffes are fixed. There will, almost certainly, be elisions of technical information.


  1. Thanks for your evenhanded commentary. The folks who have been pulling comments from the "HARRY_READ_ME.txt" file and extrapolating are being quite unfair.

    I've worked with observational data before, and even after it's been put in a reasonably coherent form, it can be a real bear to clean up and analyze and used for other purposes.

    I just can't imagine that those folks calling for Phil Jones' head on a platter would be willing to invest the effort it takes to go through the genuine "raw data" and make it usable. It is, as you say, a very tedious task.

  2. G-Man, I'm still not happy with the behavior of Phil Jones, et all. They have a lot to explain, and having them lose some credibility over this is quite reasonable for ducking FOI requests, treating people with contrary opinions as political hacks and more.

    EG, at some point, I'm likely to take a position that you disagree with. I hope you'll still regard me as even-handed when it's an ox you're fond of on the spit. *grin*

    That being said, if you find this blog useful - please spread the word. The only way I get the message out is to have people refer back to this blog.

  3. Scientists are VERY VERY careful to say,

    A causes B within parameters C,D, and E with a R probability.

    The above explains why I hate engaging in debates with people that didn't go through the same training and work experience that I do. The drop the C,D,E, claim that since R does not equal 1, you are only guessing. That is why Phil is treating anyone with contrary opinions as a political hack.... This topic is SO politicized that it impossible to disentangle the genuine scientific dissent from the hackery. I think it is important to point out how incredibly frustrating that is.