nilmdb

7.3 KiB

Raw Blame History

Structure

nilmdb.nilmdb is the NILM database interface. It tracks a PyTables database holds actual rows of data, and a SQL database tracks metadata and ranges.

Access to the nilmdb must be single-threaded. This is handled with the nilmdb.serializer class.

nilmdb.server is a HTTP server that provides an interface to talk, thorugh the serialization layer, to the nilmdb object.

nilmdb.client is a HTTP client that connects to this.

Sqlite performance

Committing a transaction in the default sync mode (PRAGMA synchronous=FULL) takes about 125msec. sqlite3 will commit transactions at 3 times:

1: explicit con.commit()

2: between a series of DML commands and non-DML commands, e.g. after a series of INSERT, SELECT, but before a CREATE TABLE or PRAGMA.

3: at the end of an explicit transaction, e.g. “with self.con as con:”

To speed up testing, or if this transaction speed becomes an issue, the sync=False option to NilmDB will set PRAGMA synchronous=OFF.

Inserting streams

We need to send the contents of “data” as POST. Do we need chunked transfer?

Don’t know the size in advance, so we would need to use chunked if we send the entire thing in one request.
But we shouldn’t send one chunk per line, so we need to buffer some anyway; why not just make new requests?
Consider the infinite-streaming case, we might want to send it immediately? Not really -- server still should do explicit inserts of fixed-size chunks.
Even chunked encoding needs the size of each chunk beforehand, so everything still gets buffered. Just a tradeoff of buffer size.

Before timestamps are added:

Raw data is about 440 kB/s (9 channels)
Prep data is about 12.5 kB/s (1 phase)
How do we know how much data to send?
- Remember that we can only do maybe 8-50 transactions per second on the sqlite database. So if one block of inserted data is one transaction, we’d need the raw case to be around 64kB per request, ideally more.
- Maybe use a range, based on how long it’s taking to read the data
  - If no more data, send it
  - If data > 1 MB, send it
  - If more than 10 seconds have elapsed, send it
- Should those numbers come from the server?

Converting from ASCII to PyTables:

For each row getting added, we need to set attributes on a PyTables Row object and call table.append(). This means that there isn’t a particularly efficient way of converting from ascii.
Could create a function like nilmdb.layout.Layout(“foo”.fillRow(asciiline)
- But this means we’re doing parsing on the serialized side
- Let’s keep parsing on the threaded server side so we can detect errors better, and not block the serialized nilmdb for a slow parsing process.
Client sends ASCII data

Server converts this ACSII data to a list of values

Maybe:

# threaded side creates this object
parser = nilmdb.layout.Parser("layout_name")
# threaded side parses and fills it with data
parser.parse(textdata)
# serialized side pulls out rows
for n in xrange(parser.nrows):
    parser.fill_row(rowinstance, n)
    table.append()

Inserting streams, inside nilmdb

First check that the new stream doesn’t overlap.
- Get minimum timestamp, maximum timestamp from data parser.
  - (extend parser to verify monotonicity and track extents)
- Get all intervals for this stream in the database
- See if new interval overlaps any existing ones
  - If so, bail
- Question: should we cache intervals inside NilmDB?
  - Assume database is fast for now, and always rebuild fom DB.
  - Can add a caching layer later if we need to.
- stream_get_ranges(path) -> return IntervalSet?

Speed

First approach was quadratic. Adding four hours of data:

$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw real 24m31.093s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw real 43m44.528s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw real 93m29.713s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw real 166m53.007s
Disabling pytables indexing didn’t help:

real 31m21.492s real 52m51.963s real 102m8.151s real 176m12.469s
Server RAM usage is constant.
Speed problems were due to IntervalSet speed, of parsing intervals from the database and adding the new one each time.
- First optimization is to cache result of nilmdb:_get_intervals, which gives the best speedup.
- Also switched to internally using bxInterval from bx-python package. Speed of tests/test_interval:TestIntervalSpeed is pretty decent and seems to be growing logarithmically now. About 85μs per insertion for inserting 131k entries.
- Storing the interval data in SQL might be better, with a scheme like: http://www.logarithmic.net/pfh/blog/01235197474
Next slowdown target is nilmdb.layout.Parser.parse().
- Rewrote parsers using cython and sscanf
- Stats (rev 10831), with _add_interval disabled layout.pyx.Parser.parse:128 6303 sec, 262k calls layout.pyx.parse:63 13913 sec, 5.1g calls numpy:records.py.fromrecords:569 7410 sec, 262k calls
- Probably OK for now.
After all updates, now takes about 8.5 minutes to insert an hour of data, constant after adding 171 hours (4.9 billion data points)
Data set size: 98 gigs = 20 bytes per data point. 6 uint16 data + 1 uint32 timestamp = 16 bytes per point So compression must be off -- will retry with compression forced on.

IntervalSet speed

Initial implementation was pretty slow, even with binary search in sorted list
Replaced with bxInterval; now takes about log n time for an insertion
- TestIntervalSpeed with range(17,18) and profiling
  - 85 μs each
  - 131072 calls to __iadd__
  - 131072 to bx.insert_interval
  - 131072 to bx.insert:395
  - 2355835 to bx.insert:106 (18x as many?)
Tried blist too, worse than bxinterval.
Might be algorithmic improvements to be made in Interval.py, like in __and__
Replaced again with rbtree. Seems decent. Numbers are time per insert for 2**17 insertions, followed by total wall time and RAM usage for running “make test” with test_rbtree and test_interval with range(5,20):
- old values with bxinterval: 20.2 μS, total 20 s, 177 MB RAM
- rbtree, plain python: 97 μS, total 105 s, 846 MB RAM
- rbtree converted to cython: 26 μS, total 29 s, 320 MB RAM
- rbtree and interval converted to cython: 8.4 μS, total 12 s, 134 MB RAM

Layouts

Current/old design has specific layouts: RawData, PrepData, RawNotchedData. Let’s get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We’ll still use strings to describe them, with format:

type_count

where type is “uint16”, “float32”, or “float64”, and count is an integer.

nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility:

"RawData" == "uint16_6"
"RawNotchedData" == "uint16_9"
"PrepData" == "float32_8"

7.3 KiB Raw Blame History