nilmdb.nilmdb is the NILM database interface. A nilmdb.BulkData interface stores data in flat files, and a SQL database tracks metadata and ranges.
Access to the nilmdb must be single-threaded. This is handled with the nilmdb.serializer class. In the future this could probably be turned into a per-path serialization.
nilmdb.server is a HTTP server that provides an interface to talk, thorugh the serialization layer, to the nilmdb object.
nilmdb.client is a HTTP client that connects to this.
Committing a transaction in the default sync mode (PRAGMA synchronous=FULL) takes about 125msec. sqlite3 will commit transactions at 3 times:
explicit con.commit()
between a series of DML commands and non-DML commands, e.g. after a series of INSERT, SELECT, but before a CREATE TABLE or PRAGMA.
at the end of an explicit transaction, e.g. “with self.con as con:”
To speed up testing, or if this transaction speed becomes an issue, the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
We need to send the contents of “data” as POST. Do we need chunked transfer?
Before timestamps are added:
Raw data is about 440 kB/s (9 channels)
Prep data is about 12.5 kB/s (1 phase)
How do we know how much data to send?
Converting from ASCII to PyTables:
Maybe:
# threaded side creates this object
parser = nilmdb.layout.Parser("layout_name")
# threaded side parses and fills it with data
parser.parse(textdata)
# serialized side pulls out rows
for n in xrange(parser.nrows):
parser.fill_row(rowinstance, n)
table.append()
stream_get_ranges(path)
-> return IntervalSet?First approach was quadratic. Adding four hours of data:
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw
real 24m31.093s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
real 43m44.528s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
real 93m29.713s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
real 166m53.007s
Disabling pytables indexing didn’t help:
real 31m21.492s
real 52m51.963s
real 102m8.151s
real 176m12.469s
Server RAM usage is constant.
Speed problems were due to IntervalSet speed, of parsing intervals from the database and adding the new one each time.
First optimization is to cache result of nilmdb:_get_intervals
,
which gives the best speedup.
Also switched to internally using bxInterval from bx-python package.
Speed of tests/test_interval:TestIntervalSpeed
is pretty decent
and seems to be growing logarithmically now. About 85μs per insertion
for inserting 131k entries.
Storing the interval data in SQL might be better, with a scheme like: http://www.logarithmic.net/pfh/blog/01235197474
Next slowdown target is nilmdb.layout.Parser.parse().
Rewrote parsers using cython and sscanf
Stats (rev 10831), with _add_interval
disabled
layout.pyx.Parser.parse:128 6303 sec, 262k calls layout.pyx.parse:63 13913 sec, 5.1g calls numpy:records.py.fromrecords:569 7410 sec, 262k calls
Probably OK for now.
After all updates, now takes about 8.5 minutes to insert an hour of data, constant after adding 171 hours (4.9 billion data points)
Data set size: 98 gigs = 20 bytes per data point. 6 uint16 data + 1 uint32 timestamp = 16 bytes per point So compression must be off -- will retry with compression forced on.
Initial implementation was pretty slow, even with binary search in sorted list
Replaced with bxInterval; now takes about log n time for an insertion
__iadd__
Tried blist too, worse than bxinterval.
Might be algorithmic improvements to be made in Interval.py,
like in __and__
Replaced again with rbtree. Seems decent. Numbers are time per
insert for 2**17 insertions, followed by total wall time and RAM
usage for running “make test” with test_rbtree
and test_interval
with range(5,20):
Would like to move Interval itself back to Python so other
non-cythonized code like client code can use it more easily.
Testing speed with just test_interval
being tested, with
range(5,22)
, using /usr/bin/time -v python tests/runtests.py
,
times recorded for 2097152:
52ae397
(Interval in cython):
12.6133 μs each, ratio 0.866533, total 47 sec, 399 MB RAMCurrent/old design has specific layouts: RawData, PrepData, RawNotchedData. Let’s get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We’ll still use strings to describe them, with format:
type_count
where type is “uint16”, “float32”, or “float64”, and count is an integer.
nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility:
"RawData" == "uint16_6"
"RawNotchedData" == "uint16_9"
"PrepData" == "float32_8"
BulkData is a custom bulk data storage system that was written to
replace PyTables. The general structure is a data
subdirectory in
the main NilmDB directory. Within data
, paths are created for each
created stream. These locations are called tables. For example,
tables might be located at
nilmdb/data/newton/raw/
nilmdb/data/newton/prep/
nilmdb/data/cottage/raw/
Each table contains:
An unchanging _format
file (Python pickle format) that describes
parameters of how the data is broken up, like files per directory,
rows per file, and the binary data format
Hex named subdirectories ("%04x", although more than 65536 can exist)
Hex named files within those subdirectories, like:
/nilmdb/data/newton/raw/000b/010a
The data format of these files is raw binary, interpreted by the
Python struct
module according to the format string in the
_format
file.
Same as above, with .removed
suffix, is an optional file (Python
pickle format) containing a list of row numbers that have been
logically removed from the file. If this range covers the entire
file, the entire file will be removed.
Note that the bulkdata.nrows
variable is calculated once in
BulkData.__init__()
, and only ever incremented during use. Thus,
even if all data is removed, nrows
can remain high. However, if
the server is restarted, the newly calculated nrows
may be lower
than in a previous run due to deleted data. To be specific, this
sequence of events:
will result in having different row numbers in the database, and differently numbered files on the filesystem, than the sequence:
This is okay! Everything should remain consistent both in the
BulkData
and NilmDB
. Not attempting to readjust nrows
during
deletion makes the code quite a bit simpler.
Similarly, data files are never truncated shorter. Removing data from the end of the file will not shorten it; it will only be deleted when it has been fully filled and all of the data has been subsequently removed.
Original design had the nilmdb.nilmdb thread (through bulkdata) convert from on-disk layout to a Python list, and then the nilmdb.server thread (from cherrypy) converts to ASCII. For at least the extraction side of things, it’s easy to pass the bulkdata a layout name instead, and have it convert directly from on-disk to ASCII format, because this conversion can then be shoved into a C module. This module, which provides a means for converting directly from on-disk format to ASCII or Python lists, is the “rocket” interface. Python is still used to manage the files and figure out where the data should go; rocket just puts binary data directly in or out of those files at specified locations.
Before rocket, testing speed with uint16_6 data, with an end-to-end test (extracting data with nilmtool):
After switching to the rocket design, but using the Python version (pyrocket):
After switching to a C extension module (rocket.c)
After client block updates (described below):
Using “insert --timestamp” or “extract --bare” cuts the speed in half.
Generally want to avoid parsing the bulk of the data as lines if possible, and transfer things in bigger blocks at once.
Current places where we use lines:
All data returned by client.stream_extract
, since it comes from
httpclient.get_gen
, which iterates over lines. Not sure if this
should be changed, because a nilmtool extract
is just about the
same speed as curl -q .../stream/extract
!
client.StreamInserter.insert_iter
and
client.StreamInserter.insert_line
, which should probably get
replaced with block versions. There’s no real need to keep
updating the timestamp every time we get a new line of data.
Timestamps are currently double-precision floats (64 bit). Since the mantissa is 53-bit, this can only represent about 15-17 significant figures, and microsecond Unix timestamps like 1222333444.000111 are already 16 significant figures. Rounding is therefore an issue; it’s hard to sure that converting from ASCII, then back to ASCII, will always give the same result.
Also, if the client provides a floating point value like 1.9999999999, we need to be careful that we don’t store it as 1.9999999999 but later print it as 2.000000, because then round-trips change the data.
Possible solutions:
When the client provides a floating point value to the server, always round to the 6th decimal digit before verifying & storing. Good for compatibility and simplicity. But still might have rounding issues, and clients will also need to round when doing their own verification. Having every piece of code need to know which digit to round at is not ideal.
Always store int64 timestamps on the server, representing microseconds since epoch. int64 timestamps are used in all HTTP parameters, in insert/extract ASCII strings, client API, commandline raw timestamps, etc. Pretty big change.
This is what we’ll go with...
Client programs that interpret the timestamps as doubles instead of ints will remain accurate until 2^53 microseconds, or year 2255.
On insert, maybe it’s OK to send floating point microsecond values (1234567890123456.0), just to cope with clients that want to print everything as a double. Server could try parsing as int64, and if that fails, parse as double and truncate to int64. However, this wouldn’t catch imprecise inputs like “1.23456789012e+15”. But maybe that can just be ignored; it’s likely to cause a non-monotonic error at the client.
Timestamps like 1234567890.123456 never show up anywhere, except for interfacing to datetime_tz etc. Command line “raw timestamps” are always printed as int64 values, and a new format “@1234567890123456” is added to the parser for specifying them exactly.
The ASCII interface is too slow for high-bandwidth processing, like sinefits, prep, etc. A binary interface was added so that you can extract the raw binary out of the bulkdata storage. This binary is a little-endian format, e.g. in C a uint16_6 stream would be:
#include <endian.h>
#include <stdint.h>
struct {
int64_t timestamp_le;
uint16_t data_le[6];
} __attribute__((packed));
Remember to byteswap (with e.g. letoh
in C)!
This interface is used by the new nilmdb.client.numpyclient.NumpyClient
class, which is a subclass of the normal nilmcb.client.client.Client
and has all of the same functions. It adds three new functions:
stream_extract_numpy
to extract data as a Numpy array
stream_insert_numpy
to insert data as a Numpy array
stream_insert_numpy_context
is the context manager for
incrementally inserting data
It is significantly faster! It is about 20 times faster to decimate a
stream with nilm-decimate
when the filter code is using the new
binary/numpy interface.
mod_wsgi requires “WSGIChunkedRequest On” to handle
“Transfer-encoding: Chunked” requests. However, /stream/insert
doesn’t handle this correctly right now, because:
The cherrypy.request.body.read()
call needs to be fixed for chunked requests
We don’t want to just buffer endlessly in the server, and it will require some thought on how to handle data in chunks (what to do about interval endpoints).
It is probably better to just keep the endpoint management on the client side, so leave “WSGIChunkedRequest off” for now.
Stream data is passed back and forth as raw bytes
objects in most
places, including the nilmdb.client
and command-line interfaces.
This is done partially for performance reasons, and partially to
support the binary insert/extract options, where character-set encoding
would not apply.
For the HTTP server, the raw bytes transferred over HTTP are interpreted as follows:
/stream/insert
, the client-provided Content-Type
is ignored,
and the data is read as if it were application/octet-stream
./stream/extract
, the returned data is application/octet-stream
./version
/dbinfo
/stream/list
/stream/create
/stream/destroy
/stream/rename
/stream/get_metadata
/stream/set_metadata
/stream/update_metadata
/stream/remove
/stream/intervals