nilmdb/design.md
2012-12-14 16:57:02 -05:00

7.3 KiB

Structure

nilmdb.nilmdb is the NILM database interface. It tracks a PyTables database holds actual rows of data, and a SQL database tracks metadata and ranges.

Access to the nilmdb must be single-threaded. This is handled with the nilmdb.serializer class.

nilmdb.server is a HTTP server that provides an interface to talk, thorugh the serialization layer, to the nilmdb object.

nilmdb.client is a HTTP client that connects to this.

Sqlite performance

Committing a transaction in the default sync mode (PRAGMA synchronous=FULL) takes about 125msec. sqlite3 will commit transactions at 3 times:

1: explicit con.commit()

2: between a series of DML commands and non-DML commands, e.g. after a series of INSERT, SELECT, but before a CREATE TABLE or PRAGMA.

3: at the end of an explicit transaction, e.g. "with self.con as con:"

To speed up testing, or if this transaction speed becomes an issue, the sync=False option to NilmDB will set PRAGMA synchronous=OFF.

Inserting streams

We need to send the contents of "data" as POST. Do we need chunked transfer?

  • Don't know the size in advance, so we would need to use chunked if we send the entire thing in one request.
  • But we shouldn't send one chunk per line, so we need to buffer some anyway; why not just make new requests?
  • Consider the infinite-streaming case, we might want to send it immediately? Not really -- server still should do explicit inserts of fixed-size chunks.
  • Even chunked encoding needs the size of each chunk beforehand, so everything still gets buffered. Just a tradeoff of buffer size.

Before timestamps are added:

  • Raw data is about 440 kB/s (9 channels)

  • Prep data is about 12.5 kB/s (1 phase)

  • How do we know how much data to send?

    • Remember that we can only do maybe 8-50 transactions per second on the sqlite database. So if one block of inserted data is one transaction, we'd need the raw case to be around 64kB per request, ideally more.
    • Maybe use a range, based on how long it's taking to read the data
      • If no more data, send it
      • If data > 1 MB, send it
      • If more than 10 seconds have elapsed, send it
    • Should those numbers come from the server?

Converting from ASCII to PyTables:

  • For each row getting added, we need to set attributes on a PyTables Row object and call table.append(). This means that there isn't a particularly efficient way of converting from ascii.
  • Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
    • But this means we're doing parsing on the serialized side
    • Let's keep parsing on the threaded server side so we can detect errors better, and not block the serialized nilmdb for a slow parsing process.
  • Client sends ASCII data
  • Server converts this ACSII data to a list of values
    • Maybe:

      # threaded side creates this object
      parser = nilmdb.layout.Parser("layout_name")
      # threaded side parses and fills it with data
      parser.parse(textdata)
      # serialized side pulls out rows
      for n in xrange(parser.nrows):
          parser.fill_row(rowinstance, n)
      	table.append()
      

Inserting streams, inside nilmdb

  • First check that the new stream doesn't overlap.
    • Get minimum timestamp, maximum timestamp from data parser.
      • (extend parser to verify monotonicity and track extents)
    • Get all intervals for this stream in the database
    • See if new interval overlaps any existing ones
      • If so, bail
    • Question: should we cache intervals inside NilmDB?
      • Assume database is fast for now, and always rebuild fom DB.
      • Can add a caching layer later if we need to.
    • stream_get_ranges(path) -> return IntervalSet?

Speed

  • First approach was quadratic. Adding four hours of data:

    $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw real 24m31.093s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw real 43m44.528s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw real 93m29.713s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw real 166m53.007s

  • Disabling pytables indexing didn't help:

    real 31m21.492s real 52m51.963s real 102m8.151s real 176m12.469s

  • Server RAM usage is constant.

  • Speed problems were due to IntervalSet speed, of parsing intervals from the database and adding the new one each time.

    • First optimization is to cache result of nilmdb:_get_intervals, which gives the best speedup.

    • Also switched to internally using bxInterval from bx-python package. Speed of tests/test_interval:TestIntervalSpeed is pretty decent and seems to be growing logarithmically now. About 85μs per insertion for inserting 131k entries.

    • Storing the interval data in SQL might be better, with a scheme like: http://www.logarithmic.net/pfh/blog/01235197474

  • Next slowdown target is nilmdb.layout.Parser.parse().

    • Rewrote parsers using cython and sscanf
    • Stats (rev 10831), with _add_interval disabled layout.pyx.Parser.parse:128 6303 sec, 262k calls layout.pyx.parse:63 13913 sec, 5.1g calls numpy:records.py.fromrecords:569 7410 sec, 262k calls
    • Probably OK for now.
  • After all updates, now takes about 8.5 minutes to insert an hour of data, constant after adding 171 hours (4.9 billion data points)

  • Data set size: 98 gigs = 20 bytes per data point. 6 uint16 data + 1 uint32 timestamp = 16 bytes per point So compression must be off -- will retry with compression forced on.

IntervalSet speed

  • Initial implementation was pretty slow, even with binary search in sorted list

  • Replaced with bxInterval; now takes about log n time for an insertion

    • TestIntervalSpeed with range(17,18) and profiling
      • 85 μs each
      • 131072 calls to __iadd__
      • 131072 to bx.insert_interval
      • 131072 to bx.insert:395
      • 2355835 to bx.insert:106 (18x as many?)
  • Tried blist too, worse than bxinterval.

  • Might be algorithmic improvements to be made in Interval.py, like in __and__

  • Replaced again with rbtree. Seems decent. Numbers are time per insert for 2**17 insertions, followed by total wall time and RAM usage for running "make test" with test_rbtree and test_interval with range(5,20):

    • old values with bxinterval: 20.2 μS, total 20 s, 177 MB RAM
    • rbtree, plain python: 97 μS, total 105 s, 846 MB RAM
    • rbtree converted to cython: 26 μS, total 29 s, 320 MB RAM
    • rbtree and interval converted to cython: 8.4 μS, total 12 s, 134 MB RAM

Layouts

Current/old design has specific layouts: RawData, PrepData, RawNotchedData. Let's get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We'll still use strings to describe them, with format:

type_count

where type is "uint16", "float32", or "float64", and count is an integer.

nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility:

"RawData" == "uint16_6"
"RawNotchedData" == "uint16_9"
"PrepData" == "float32_8"