The following sequence could lead to this corruption:
(1) Append new data to bulkdata
(2) Update interval file positions in SQL
(3) Flush (2)
(4) Crash before flushing (1)
(5) Reload database without running fsck
(6) Start writing new data to end of bulkdata and introduce new interval
This probably means that the file was written, and metadata was journaled,
but the system crashed before data was written. If that happens, the
end of the file will be zeroed out. We don't bother checking the entire
file here; if we see just one timestamp that is unexpectedly zero, let's
truncate the data there.
A bulkdata dir may get created for a new stream with an empty or
corrupted _format, before any data gets actually written. In that
case, we can just delete the new stream; worst case, we lose some
metadata.
Note: The info in _format should really get moved into the database.
This was born when bulkdata switched from PyTables to a custom storage
system, and was probably stored this way to avoid tying the main DB
to specific implementation details while they were still in flux.
Previous commits went back and forth a bit on whether the various APIs
should use bytes or strings, but bytes appears to be a better answer,
because actual data in streams will always be 7-bit ASCII or raw
binary. There's no reason to apply the performance penalty of
constantly converting between bytes and strings.
One drawback now is that lots of code now has to have "b" prefixes on
strings, especially in tests, which inflates this commit quite a bit.
Normally, indexes for an array are expected to fit in a platform's
native long (32 or 64-bit). In nilmdb, tables aren't real arrays and
we need to handle unbounded indices.