9.8 KiB
Structure
nilmdb.nilmdb is the NILM database interface. A nilmdb.BulkData interface stores data in flat files, and a SQL database tracks metadata and ranges.
Access to the nilmdb must be single-threaded. This is handled with the nilmdb.serializer class. In the future this could probably be turned into a per-path serialization.
nilmdb.server is a HTTP server that provides an interface to talk, thorugh the serialization layer, to the nilmdb object.
nilmdb.client is a HTTP client that connects to this.
Sqlite performance
Committing a transaction in the default sync mode (PRAGMA synchronous=FULL) takes about 125msec. sqlite3 will commit transactions at 3 times:
-
explicit con.commit()
-
between a series of DML commands and non-DML commands, e.g. after a series of INSERT, SELECT, but before a CREATE TABLE or PRAGMA.
-
at the end of an explicit transaction, e.g. "with self.con as con:"
To speed up testing, or if this transaction speed becomes an issue, the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
Inserting streams
We need to send the contents of "data" as POST. Do we need chunked transfer?
- Don't know the size in advance, so we would need to use chunked if we send the entire thing in one request.
- But we shouldn't send one chunk per line, so we need to buffer some anyway; why not just make new requests?
- Consider the infinite-streaming case, we might want to send it immediately? Not really -- server still should do explicit inserts of fixed-size chunks.
- Even chunked encoding needs the size of each chunk beforehand, so everything still gets buffered. Just a tradeoff of buffer size.
Before timestamps are added:
-
Raw data is about 440 kB/s (9 channels)
-
Prep data is about 12.5 kB/s (1 phase)
-
How do we know how much data to send?
- Remember that we can only do maybe 8-50 transactions per second on the sqlite database. So if one block of inserted data is one transaction, we'd need the raw case to be around 64kB per request, ideally more.
- Maybe use a range, based on how long it's taking to read the data
- If no more data, send it
- If data > 1 MB, send it
- If more than 10 seconds have elapsed, send it
- Should those numbers come from the server?
Converting from ASCII to PyTables:
- For each row getting added, we need to set attributes on a PyTables Row object and call table.append(). This means that there isn't a particularly efficient way of converting from ascii.
- Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
- But this means we're doing parsing on the serialized side
- Let's keep parsing on the threaded server side so we can detect errors better, and not block the serialized nilmdb for a slow parsing process.
- Client sends ASCII data
- Server converts this ACSII data to a list of values
-
Maybe:
# threaded side creates this object parser = nilmdb.layout.Parser("layout_name") # threaded side parses and fills it with data parser.parse(textdata) # serialized side pulls out rows for n in xrange(parser.nrows): parser.fill_row(rowinstance, n) table.append()
-
Inserting streams, inside nilmdb
- First check that the new stream doesn't overlap.
- Get minimum timestamp, maximum timestamp from data parser.
- (extend parser to verify monotonicity and track extents)
- Get all intervals for this stream in the database
- See if new interval overlaps any existing ones
- If so, bail
- Question: should we cache intervals inside NilmDB?
- Assume database is fast for now, and always rebuild fom DB.
- Can add a caching layer later if we need to.
stream_get_ranges(path)
-> return IntervalSet?
- Get minimum timestamp, maximum timestamp from data parser.
Speed
-
First approach was quadratic. Adding four hours of data:
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw real 24m31.093s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw real 43m44.528s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw real 93m29.713s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw real 166m53.007s
-
Disabling pytables indexing didn't help:
real 31m21.492s real 52m51.963s real 102m8.151s real 176m12.469s
-
Server RAM usage is constant.
-
Speed problems were due to IntervalSet speed, of parsing intervals from the database and adding the new one each time.
-
First optimization is to cache result of
nilmdb:_get_intervals
, which gives the best speedup. -
Also switched to internally using bxInterval from bx-python package. Speed of
tests/test_interval:TestIntervalSpeed
is pretty decent and seems to be growing logarithmically now. About 85μs per insertion for inserting 131k entries. -
Storing the interval data in SQL might be better, with a scheme like: http://www.logarithmic.net/pfh/blog/01235197474
-
-
Next slowdown target is nilmdb.layout.Parser.parse().
-
Rewrote parsers using cython and sscanf
-
Stats (rev 10831), with _add_interval disabled
layout.pyx.Parser.parse:128 6303 sec, 262k calls layout.pyx.parse:63 13913 sec, 5.1g calls numpy:records.py.fromrecords:569 7410 sec, 262k calls
-
-
Probably OK for now.
-
After all updates, now takes about 8.5 minutes to insert an hour of data, constant after adding 171 hours (4.9 billion data points)
-
Data set size: 98 gigs = 20 bytes per data point. 6 uint16 data + 1 uint32 timestamp = 16 bytes per point So compression must be off -- will retry with compression forced on.
IntervalSet speed
-
Initial implementation was pretty slow, even with binary search in sorted list
-
Replaced with bxInterval; now takes about log n time for an insertion
- TestIntervalSpeed with range(17,18) and profiling
- 85 μs each
- 131072 calls to
__iadd__
- 131072 to bx.insert_interval
- 131072 to bx.insert:395
- 2355835 to bx.insert:106 (18x as many?)
- TestIntervalSpeed with range(17,18) and profiling
-
Tried blist too, worse than bxinterval.
-
Might be algorithmic improvements to be made in Interval.py, like in
__and__
-
Replaced again with rbtree. Seems decent. Numbers are time per insert for 2**17 insertions, followed by total wall time and RAM usage for running "make test" with
test_rbtree
andtest_interval
with range(5,20):- old values with bxinterval: 20.2 μS, total 20 s, 177 MB RAM
- rbtree, plain python: 97 μS, total 105 s, 846 MB RAM
- rbtree converted to cython: 26 μS, total 29 s, 320 MB RAM
- rbtree and interval converted to cython: 8.4 μS, total 12 s, 134 MB RAM
Layouts
Current/old design has specific layouts: RawData, PrepData, RawNotchedData. Let's get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We'll still use strings to describe them, with format:
type_count
where type is "uint16", "float32", or "float64", and count is an integer.
nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility:
"RawData" == "uint16_6"
"RawNotchedData" == "uint16_9"
"PrepData" == "float32_8"
BulkData design
BulkData is a custom bulk data storage system that was written to
replace PyTables. The general structure is a data
subdirectory in
the main NilmDB directory. Within data
, paths are created for each
created stream. These locations are called tables. For example,
tables might be located at
nilmdb/data/newton/raw/
nilmdb/data/newton/prep/
nilmdb/data/cottage/raw/
Each table contains:
-
An unchanging
_format
file (Python pickle format) that describes parameters of how the data is broken up, like files per directory, rows per file, and the binary data format -
Hex named subdirectories
("%04x", although more than 65536 can exist)
-
Hex named files within those subdirectories, like:
/nilmdb/data/newton/raw/000b/010a
The data format of these files is raw binary, interpreted by the Python
struct
module according to the format string in the_format
file. -
Same as above, with
.removed
suffix, is an optional file (Python pickle format) containing a list of row numbers that have been logically removed from the file. If this range covers the entire file, the entire file will be removed. -
Note that the
bulkdata.nrows
variable is calculated once inBulkData.__init__()
, and only ever incremented during use. Thus, even if all data is removed,nrows
can remain high. However, if the server is restarted, the newly calculatednrows
may be lower than in a previous run due to deleted data. To be specific, this sequence of events:- insert data
- remove all data
- insert data
will result in having different row numbers in the database, and differently numbered files on the filesystem, than the sequence:
- insert data
- remove all data
- restart server
- insert data
This is okay! Everything should remain consistent both in the
BulkData
andNilmDB
. Not attempting to readjustnrows
during deletion makes the code quite a bit simpler. -
Similarly, data files are never truncated shorter. Removing data from the end of the file will not shorten it; it will only be deleted when it has been fully filled and all of the data has been subsequently removed.