2011-03-11

the smallest way to store data

is not to store it at all.
A unique id for a US Census block winds up being 15 decimal digits, which fits handily into an 8 byte int.
Actually there are less than 10,000,000 blocks in the US, so that could easily be a 32 bit number.
But if what I really want to do is store a mapping from each block to district number for each block (easily a 1 byte number), the smallest way to store this is just a list of district numbers. Use the Census data file as a canonical ordering of the blocks.
CSV for this becomes 15 decimal digits, comma, one to three decimal digits, newline. 20 bytes vs 1.
For the hundreds of thousands of blocks in Texas, after gzipping the CSV, this is a 2372 KB file. gzipped byte list is 32 KB.
Sadly, a CSV file in a .zip archive seems to be the common interchange format for these things.
At least I get to use my format between my client and my server.

1 comment: