Thursday, September 20, 2018

Databases Galore - 23

Fig. 1 Modified WOD Yearly Data Scheme
Just in case you regular readers think I have been goofing off, this is a status report.

As you know I decided to go "whole hog" on the World Ocean Database (WOD) data.

That means downloading all of it.

Don't try this at home unless you have some patience and some fairly hefty computing machines (whatever that means).

It took me 8 hours of straight download time to get the data depicted in Fig. 1.

I used a WGET program script with parameters supplied by WOD: ("wget -N -nH -nd -r -e robots=off --no-parent --force-html ").

It downloads 240 compressed (.gz) files  (each file is one year of data) in each category shown in Fig. 1 (total is "2,382 files, totaling 21.8 GB" compressed files).

After downloading those compressed files I decompress them and place those with data into super-compressed (.tar.gz) files as shown in Fig. 1.

Some of the resulting files are very large (e.g. 6+ gigabytes) and some are only 29 bytes (all hat, no cattle ... i.e. empty placeholder files).

The reason for all that is that the WOD download script is designed to handle data that may show up from some museum somewhere (LOL), and then there is also the future.

The download script works beautifully because the next time you download it only downloads changes to any of those files that are out of date.

And history being what it is, that means the next download will only take a few minutes instead of a few hours (assuming you keep those files intact in a persistent download directory ... meaning dedicating 20 or 30 gigabytes to the cause as I have).

Anyway, after decompressing and re-compressing those 2,382 some-odd files into eleven (11) mega-files as shown in Fig. 1 (and answering the question "why am I doing this"), one is ready to "begin."

Since I am bitching about all the "hardddd weeeerrrrkk" it takes to please you hungry-for-knowledge readers, I also had to write a module to fragment/divide some of the huge (3-6 gig) files into smaller files because my computer tasted them and said "ptewy ... you must be kidding" (too large).

 I have gigs of RAM, but even that is not enough in some cases.

I now have 98% of the WOD format files that I have been telling you about converted into .CSV files (which I can load into my SQL server).

And truth be told, I am excited to see how the new data looks.

The next post in this series is here, the previous post in this series is here.

Thanks Mark ...