Tuesday, September 6, 2016

Questionable "Scientific" Papers - 10

Fig. 1
As a follow-up to the previous post in this series, I wanted to share some facts about the data I use to generate graphs that regular readers peruse.

I want you to know that a lot of work goes into it, but sometimes it may not show.

Database management, such as the World Ocean Database (WOD) and the Permanent Service for Mean Sea Level (PSMSL) are as much a part of modern science as looking through a microscope is.

Especially in the sense that they make the data available to us for our close scrutiny, to help us discern the junk from the science.

Looking back on Labor Day, I wanted you to have a better understanding of why I criticize those who want to take the work out of science, by taking the data gathering and sharing out of it.

In today's post, I will also show some appreciation for the scientists who painstakingly gather data for us.

For example, Fig. 1 is a screen capture of my query of an SQL database "rawwod" where I store downloaded WOD data.
Fig. 2 Each '4' has '9', and each '9' has '18'

I call it "raw" because no averaging has been done, it is the raw measurements of scientists who study the ocean and then share the results of their work.

Anyway, as you can see, there are a lot of measurements stored in those four tables:
raw1000_v1: 42,538,956
raw3000_v1: 41,307,714
raw5000_v1: 52,364,650
raw7000_v1: 36,979,185
total:            173,190,505
Think of the scientists who, over many years, have been, and still are, taking billions of measurements on the stormy, windy, and undulating high seas.

Fig. 3 Limit 25 rows
That over one hundred seventy three million measurements I am working with is a tiny fraction of the whole (it will double, or more as I download more zones and add the v2 salinity).

BTW, you may be wondering what the "raw1000_v1, raw3000_v1, raw5000_v1, and raw7000_v1" names are about (the 'v1' means WOD variable 1, which is temperature; 'v2' is salinity which I have not gotten to yet, but I have them substantially downloaded and ready for their SQL tables).

Those who have viewed the WOD database will note that there are four basic zone sections (1000, 3000, 5000, and 7000).

From a programing perspective, they are defined by the first digit (1,3,5,7).

Each of those has nine subsections (0 - 8), which again, from a programming perspective, are defined by the second digit (1000, 3100, 5500, and 7800, etc.).

Those in turn have eighteen sub-subsections (1000-1017), which are defined by the last two digits of a zone's number (incidentally the zone number tells us its latitude and longitude boundaries too).

Anyway, as shown in Fig. 2, the zone numbers add up to 648 WOD zones (4 × 9 × 18 = 648).
Fig. 4 2,007,831 rows

A software developer has to deal with the original situation as it is, not how they would do it (I like the WOD numbering system).

In this case, a "tree" structure is the way I chose to handle this problem, because there are gaps.

One can't use a 648 element array because the zone numbers are not sequential.

They proceed as 1000-1017 (gap 1018 - 1099), then proceed as 1100-1117 (more gaps on the way to 1817), then the 3000-3017 series begins, (gap 3018 - 3099) 3100-3117 (gaps ...)  etc. etc. so the tree structure is an efficient way to code a traverse of the WOD zones (see Fig. 2).

Notice Fig. 3 and Fig. 4 to get a glimpse of what the rows of records look like "in the raw."

The SQL query that produced the list in Fig. 3 has an SQL "limit 25" clause in it, so it only produced 25 rows.

Fig. 5 the history table
I "let fly" with Fig. 4 by removing that limit, so it retrieved all rows for that zone.

But, I can't show you all 2,007,831 rows in zone 1112 ... it would fill way too many blog pages.

But you get my drift, there is a lot of data to look at, so I have to tailor results to viable blog post lengths.

Notice the many different depths in meters (next to last column @ Fig. 4) just so you understand that I also had to deal a blow to the volume of data involved.

Fig. 6 Finally, a graph !
I did so by using the "0-200", "200-400", "400-600", "600-800", "800-1000", "1000-3000", and ">3000" depth levels ('>' means greater than).

All the many variant depths and values are isolated to, or are averaged (by year) into seven levels, covering the entire water column from the surface to the bottom.

What this all boils down to is depicted in Fig. 5, which shows the SQL query in the 'history' table (which is what is produced from the '2,007,831 rows' @ Fig. 4) on WOD Zone 1112, after the condensing.

Then, what you typically are shown is the graph at Fig. 6.

In this post I have not mentioned the downloading and slicing of the PI data (my nickname for their file format) because that was covered in the previous post.

Now you know why I was also critical, in the previous post, of the scientist who wanted to estimate these billions of measurements with a "mathematical model":
With recently improved instrumental accuracy, the change in the heat content of the oceans and the corresponding contribution to the change of the sea level can be determined from in situ measurements of temperature variation with depth. Nevertheless, it would be favourable if the same changes could be evaluated from just the sea surface temperatures because the past record could then be reconstructed and future scenarios explored. Using a single column model we show that the average change in the heat content of the oceans and the corresponding contribution to a global change in the sea level can be evaluated from the past sea surface temperatures.
(Ocean Science, 6, 179–184, 2010, emphasis added). And that ill-advised misadventure was advocated without even going to ARGO land where the PFL measurements are made by accurate automated submersibles (ARGO net).

[The WOD "PFL" files are robust:  "Note: WOD includes GTSPP and Argo data, plus more" - Ocean Profile Data]

The previous post in this series is here.

1 comment:

  1. Data are getting important:

    "As the research community embraces data sharing, academic journals can do their bit to help. Starting this month, all research papers accepted for publication in Nature and an initial 12 other Nature titles will be required to include information on whether and how others can access the underlying data" (link).