nilmdb/design.md

Structure
---------
nilmdb.nilmdb is the NILM database interface.  It tracks a PyTables
database holds actual rows of data, and a SQL database tracks metadata
and ranges.

Access to the nilmdb must be single-threaded.  This is handled with
the nilmdb.serializer class.

nilmdb.server is a HTTP server that provides an interface to talk,
thorugh the serialization layer, to the nilmdb object.

nilmdb.client is a HTTP client that connects to this.

Sqlite performance
------------------

Committing a transaction in the default sync mode (PRAGMA synchronous=FULL)
takes about 125msec.  sqlite3 will commit transactions at 3 times:

  1: explicit con.commit()

  2: between a series of DML commands and non-DML commands, e.g.
   after a series of INSERT, SELECT, but before a CREATE TABLE or
   PRAGMA.

3: at the end of an explicit transaction, e.g. "with self.con as con:"

To speed up testing, or if this transaction speed becomes an issue,
the sync=False option to NilmDB will set PRAGMA synchronous=OFF.


Inserting streams
-----------------

We need to send the contents of "data" as POST.  Do we need chunked
transfer?

- Don't know the size in advance, so we would need to use chunked if
  we send the entire thing in one request.
- But we shouldn't send one chunk per line, so we need to buffer some
  anyway; why not just make new requests?
- Consider the infinite-streaming case, we might want to send it
  immediately?  Not really -- server still should do explicit inserts
  of fixed-size chunks.
- Even chunked encoding needs the size of each chunk beforehand, so
  everything still gets buffered.  Just a tradeoff of buffer size.

Before timestamps are added:
- Raw data is about 440 kB/s    (9 channels)
- Prep data is about 12.5 kB/s  (1 phase)
- How do we know how much data to send?

  - Remember that we can only do maybe 8-50 transactions per second on
    the sqlite database.  So if one block of inserted data is one
    transaction, we'd need the raw case to be around 64kB per request,
	ideally more.
  - Maybe use a range, based on how long it's taking to read the data
    - If no more data, send it
    - If data > 1 MB, send it
	- If more than 10 seconds have elapsed, send it
  - Should those numbers come from the server?

Converting from ASCII to PyTables:
- For each row getting added, we need to set attributes on a PyTables
  Row object and call table.append().  This means that there isn't a
  particularly efficient way of converting from ascii.
- Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
  - But this means we're doing parsing on the serialized side
  - Let's keep parsing on the threaded server side so we can detect
    errors better, and not block the serialized nilmdb for a slow
    parsing process.
- Client sends ASCII data
- Server converts this ACSII data to a list of values
  - Maybe:

		# threaded side creates this object
		parser = nilmdb.layout.Parser("layout_name")
		# threaded side parses and fills it with data
		parser.parse(textdata)
		# serialized side pulls out rows
		for n in xrange(parser.nrows):
		    parser.fill_row(rowinstance, n)
			table.append()


Inserting streams, inside nilmdb
--------------------------------

- First check that the new stream doesn't overlap.
  - Get minimum timestamp, maximum timestamp from data parser.
    - (extend parser to verify monotonicity and track extents)
  - Get all intervals for this stream in the database
  - See if new interval overlaps any existing ones
    - If so, bail
  - Question: should we cache intervals inside NilmDB?
    - Assume database is fast for now, and always rebuild fom DB.
	- Can add a caching layer later if we need to.
  - `stream_get_ranges(path)` -> return IntervalSet?

Speed
-----

- First approach was quadratic.  Adding four hours of data:

    $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw
	real    24m31.093s
	$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
	real    43m44.528s
	$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
	real    93m29.713s
	$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
	real    166m53.007s

- Disabling pytables indexing didn't help:

    real    31m21.492s
	real    52m51.963s
	real    102m8.151s
	real    176m12.469s

- Server RAM usage is constant.

- Speed problems were due to IntervalSet speed, of parsing intervals
  from the database and adding the new one each time.

  - First optimization is to cache result of `nilmdb:_get_intervals`,
    which gives the best speedup.

  - Also switched to internally using bxInterval from bx-python package.
    Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
	and seems to be growing logarithmically now.  About 85μs per insertion
	for inserting 131k entries.

  - Storing the interval data in SQL might be better, with a scheme like:
    http://www.logarithmic.net/pfh/blog/01235197474

- Next slowdown target is nilmdb.layout.Parser.parse().
  - Rewrote parsers using cython and sscanf
  - Stats (rev 10831), with _add_interval disabled
     layout.pyx.Parser.parse:128        6303 sec, 262k calls
	 layout.pyx.parse:63               13913 sec, 5.1g calls
	 numpy:records.py.fromrecords:569   7410 sec, 262k calls
  - Probably OK for now.

IntervalSet speed
-----------------
- Initial implementation was pretty slow, even with binary search in
  sorted list

- Replaced with bxInterval; now takes about log n time for an insertion
  - TestIntervalSpeed with range(17,18) and profiling
    - 85 μs each
    - 131072 calls to `__iadd__`
    - 131072 to bx.insert_interval
    - 131072 to bx.insert:395
    - 2355835 to bx.insert:106  (18x as many?)

- Tried blist too, worse than bxinterval.

- Might be algorithmic improvements to be made in Interval.py,
  like in `__and__`


Layouts
-------
Current/old design has specific layouts: RawData, PrepData, RawNotchedData.
Let's get rid of this entirely and switch to simpler data types that are
just collections and counts of a single type.  We'll still use strings
to describe them, with format:

    type_count

where type is "uint16", "float32", or "float64", and count is an integer.

nilmdb.layout.named() will parse these strings into the appropriate
handlers.  For compatibility:

    "RawData" == "uint16_6"
    "RawNotchedData" == "uint16_9"
    "PrepData" == "float32_8"