Structure --------- nilmdb.nilmdb is the NILM database interface. A nilmdb.BulkData interface stores data in flat files, and a SQL database tracks metadata and ranges. Access to the nilmdb must be single-threaded. This is handled with the nilmdb.serializer class. In the future this could probably be turned into a per-path serialization. nilmdb.server is a HTTP server that provides an interface to talk, thorugh the serialization layer, to the nilmdb object. nilmdb.client is a HTTP client that connects to this. Sqlite performance ------------------ Committing a transaction in the default sync mode (PRAGMA synchronous=FULL) takes about 125msec. sqlite3 will commit transactions at 3 times: 1. explicit con.commit() 2. between a series of DML commands and non-DML commands, e.g. after a series of INSERT, SELECT, but before a CREATE TABLE or PRAGMA. 3. at the end of an explicit transaction, e.g. "with self.con as con:" To speed up testing, or if this transaction speed becomes an issue, the sync=False option to NilmDB will set PRAGMA synchronous=OFF. Inserting streams ----------------- We need to send the contents of "data" as POST. Do we need chunked transfer? - Don't know the size in advance, so we would need to use chunked if we send the entire thing in one request. - But we shouldn't send one chunk per line, so we need to buffer some anyway; why not just make new requests? - Consider the infinite-streaming case, we might want to send it immediately? Not really -- server still should do explicit inserts of fixed-size chunks. - Even chunked encoding needs the size of each chunk beforehand, so everything still gets buffered. Just a tradeoff of buffer size. Before timestamps are added: - Raw data is about 440 kB/s (9 channels) - Prep data is about 12.5 kB/s (1 phase) - How do we know how much data to send? - Remember that we can only do maybe 8-50 transactions per second on the sqlite database. So if one block of inserted data is one transaction, we'd need the raw case to be around 64kB per request, ideally more. - Maybe use a range, based on how long it's taking to read the data - If no more data, send it - If data > 1 MB, send it - If more than 10 seconds have elapsed, send it - Should those numbers come from the server? Converting from ASCII to PyTables: - For each row getting added, we need to set attributes on a PyTables Row object and call table.append(). This means that there isn't a particularly efficient way of converting from ascii. - Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline) - But this means we're doing parsing on the serialized side - Let's keep parsing on the threaded server side so we can detect errors better, and not block the serialized nilmdb for a slow parsing process. - Client sends ASCII data - Server converts this ACSII data to a list of values - Maybe: # threaded side creates this object parser = nilmdb.layout.Parser("layout_name") # threaded side parses and fills it with data parser.parse(textdata) # serialized side pulls out rows for n in xrange(parser.nrows): parser.fill_row(rowinstance, n) table.append() Inserting streams, inside nilmdb -------------------------------- - First check that the new stream doesn't overlap. - Get minimum timestamp, maximum timestamp from data parser. - (extend parser to verify monotonicity and track extents) - Get all intervals for this stream in the database - See if new interval overlaps any existing ones - If so, bail - Question: should we cache intervals inside NilmDB? - Assume database is fast for now, and always rebuild fom DB. - Can add a caching layer later if we need to. - `stream_get_ranges(path)` -> return IntervalSet? Speed ----- - First approach was quadratic. Adding four hours of data: $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw real 24m31.093s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw real 43m44.528s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw real 93m29.713s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw real 166m53.007s - Disabling pytables indexing didn't help: real 31m21.492s real 52m51.963s real 102m8.151s real 176m12.469s - Server RAM usage is constant. - Speed problems were due to IntervalSet speed, of parsing intervals from the database and adding the new one each time. - First optimization is to cache result of `nilmdb:_get_intervals`, which gives the best speedup. - Also switched to internally using bxInterval from bx-python package. Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent and seems to be growing logarithmically now. About 85μs per insertion for inserting 131k entries. - Storing the interval data in SQL might be better, with a scheme like: http://www.logarithmic.net/pfh/blog/01235197474 - Next slowdown target is nilmdb.layout.Parser.parse(). - Rewrote parsers using cython and sscanf - Stats (rev 10831), with _add_interval disabled layout.pyx.Parser.parse:128 6303 sec, 262k calls layout.pyx.parse:63 13913 sec, 5.1g calls numpy:records.py.fromrecords:569 7410 sec, 262k calls - Probably OK for now. - After all updates, now takes about 8.5 minutes to insert an hour of data, constant after adding 171 hours (4.9 billion data points) - Data set size: 98 gigs = 20 bytes per data point. 6 uint16 data + 1 uint32 timestamp = 16 bytes per point So compression must be off -- will retry with compression forced on. IntervalSet speed ----------------- - Initial implementation was pretty slow, even with binary search in sorted list - Replaced with bxInterval; now takes about log n time for an insertion - TestIntervalSpeed with range(17,18) and profiling - 85 μs each - 131072 calls to `__iadd__` - 131072 to bx.insert_interval - 131072 to bx.insert:395 - 2355835 to bx.insert:106 (18x as many?) - Tried blist too, worse than bxinterval. - Might be algorithmic improvements to be made in Interval.py, like in `__and__` - Replaced again with rbtree. Seems decent. Numbers are time per insert for 2**17 insertions, followed by total wall time and RAM usage for running "make test" with `test_rbtree` and `test_interval` with range(5,20): - old values with bxinterval: 20.2 μS, total 20 s, 177 MB RAM - rbtree, plain python: 97 μS, total 105 s, 846 MB RAM - rbtree converted to cython: 26 μS, total 29 s, 320 MB RAM - rbtree and interval converted to cython: 8.4 μS, total 12 s, 134 MB RAM Layouts ------- Current/old design has specific layouts: RawData, PrepData, RawNotchedData. Let's get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We'll still use strings to describe them, with format: type_count where type is "uint16", "float32", or "float64", and count is an integer. nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility: "RawData" == "uint16_6" "RawNotchedData" == "uint16_9" "PrepData" == "float32_8" BulkData design --------------- BulkData is a custom bulk data storage system that was written to replace PyTables. The general structure is a `data` subdirectory in the main NilmDB directory. Within `data`, paths are created for each created stream. These locations are called tables. For example, tables might be located at nilmdb/data/newton/raw/ nilmdb/data/newton/prep/ nilmdb/data/cottage/raw/ Each table contains: - An unchanging `_format` file (Python pickle format) that describes parameters of how the data is broken up, like files per directory, rows per file, and the binary data format - Hex named subdirectories `("%04x", although more than 65536 can exist)` - Hex named files within those subdirectories, like: /nilmdb/data/newton/raw/000b/010a The data format of these files is raw binary, interpreted by the Python `struct` module according to the format string in the `_format` file. - Same as above, with `.removed` suffix, is an optional file (Python pickle format) containing a list of row numbers that have been logically removed from the file. If this range covers the entire file, the entire file will be removed. - Note that the `bulkdata.nrows` variable is calculated once in `BulkData.__init__()`, and only ever incremented during use. Thus, even if all data is removed, `nrows` can remain high. However, if the server is restarted, the newly calculated `nrows` may be lower than in a previous run due to deleted data. To be specific, this sequence of events: - insert data - remove all data - insert data will result in having different row numbers in the database, and differently numbered files on the filesystem, than the sequence: - insert data - remove all data - restart server - insert data This is okay! Everything should remain consistent both in the `BulkData` and `NilmDB`. Not attempting to readjust `nrows` during deletion makes the code quite a bit simpler. - Similarly, data files are never truncated shorter. Removing data from the end of the file will not shorten it; it will only be deleted when it has been fully filled and all of the data has been subsequently removed. Rocket ------ Original design had the nilmdb.nilmdb thread (through bulkdata) convert from on-disk layout to a Python list, and then the nilmdb.server thread (from cherrypy) converts to ASCII. For at least the extraction side of things, it's easy to pass the bulkdata a layout name instead, and have it convert directly from on-disk to ASCII format, because this conversion can then be shoved into a C module. This module, which provides a means for converting directly from on-disk format to ASCII or Python lists, is the "rocket" interface. Python is still used to manage the files and figure out where the data should go; rocket just puts binary data directly in or out of those files at specified locations. Before rocket, testing speed with uint16_6 data, with an end-to-end test (extracting data with nilmtool): - insert: 65 klines/sec - extract: 120 klines/sec After switching to the rocket design, but using the Python version (pyrocket): - insert: 57 klines/sec - extract: 120 klines/sec After switching to a C extension module (rocket.c) - insert: 74 klines/sec through insert.py; 99.6 klines/sec through nilmtool - extract: 335 klines/sec After client block updates (described below): - insert: 180 klines/sec through nilmtool (pre-timestamped) - extract: 390 klines/sec through nilmtool Using "insert --timestamp" or "extract --bare" cuts the speed in half. Blocks versus lines ------------------- Generally want to avoid parsing the bulk of the data as lines if possible, and transfer things in bigger blocks at once. Current places where we use lines: - All data returned by `client.stream_extract`, since it comes from `httpclient.get_gen`, which iterates over lines. Not sure if this should be changed, because a `nilmtool extract` is just about the same speed as `curl -q .../stream/extract`! - `client.StreamInserter.insert_iter` and `client.StreamInserter.insert_line`, which should probably get replaced with block versions. There's no real need to keep updating the timestamp every time we get a new line of data. - Finished. Just a single insert() that takes any length string and does very little processing until it's time to send it to the server.