2012-03-25 15:12:49 -04:00
|
|
|
Structure
|
|
|
|
---------
|
|
|
|
nilmdb.nilmdb is the NILM database interface. It tracks a PyTables
|
|
|
|
database holds actual rows of data, and a SQL database tracks metadata
|
|
|
|
and ranges.
|
|
|
|
|
|
|
|
Access to the nilmdb must be single-threaded. This is handled with
|
|
|
|
the nilmdb.serializer class.
|
|
|
|
|
|
|
|
nilmdb.server is a HTTP server that provides an interface to talk,
|
|
|
|
thorugh the serialization layer, to the nilmdb object.
|
|
|
|
|
|
|
|
nilmdb.client is a HTTP client that connects to this.
|
|
|
|
|
|
|
|
Sqlite performance
|
|
|
|
------------------
|
|
|
|
|
|
|
|
Committing a transaction in the default sync mode (PRAGMA synchronous=FULL)
|
|
|
|
takes about 125msec. sqlite3 will commit transactions at 3 times:
|
|
|
|
|
|
|
|
1: explicit con.commit()
|
|
|
|
|
|
|
|
2: between a series of DML commands and non-DML commands, e.g.
|
|
|
|
after a series of INSERT, SELECT, but before a CREATE TABLE or
|
|
|
|
PRAGMA.
|
|
|
|
|
|
|
|
3: at the end of an explicit transaction, e.g. "with self.con as con:"
|
|
|
|
|
|
|
|
To speed up testing, or if this transaction speed becomes an issue,
|
|
|
|
the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
|
|
|
|
|
|
|
|
|
|
|
|
Inserting streams
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
We need to send the contents of "data" as POST. Do we need chunked
|
|
|
|
transfer?
|
|
|
|
|
|
|
|
- Don't know the size in advance, so we would need to use chunked if
|
|
|
|
we send the entire thing in one request.
|
|
|
|
- But we shouldn't send one chunk per line, so we need to buffer some
|
|
|
|
anyway; why not just make new requests?
|
|
|
|
- Consider the infinite-streaming case, we might want to send it
|
|
|
|
immediately? Not really -- server still should do explicit inserts
|
|
|
|
of fixed-size chunks.
|
|
|
|
- Even chunked encoding needs the size of each chunk beforehand, so
|
|
|
|
everything still gets buffered. Just a tradeoff of buffer size.
|
2012-04-10 16:52:59 -04:00
|
|
|
|
2012-03-25 15:12:49 -04:00
|
|
|
Before timestamps are added:
|
|
|
|
- Raw data is about 440 kB/s (9 channels)
|
|
|
|
- Prep data is about 12.5 kB/s (1 phase)
|
|
|
|
- How do we know how much data to send?
|
|
|
|
|
|
|
|
- Remember that we can only do maybe 8-50 transactions per second on
|
|
|
|
the sqlite database. So if one block of inserted data is one
|
|
|
|
transaction, we'd need the raw case to be around 64kB per request,
|
|
|
|
ideally more.
|
|
|
|
- Maybe use a range, based on how long it's taking to read the data
|
|
|
|
- If no more data, send it
|
|
|
|
- If data > 1 MB, send it
|
|
|
|
- If more than 10 seconds have elapsed, send it
|
|
|
|
- Should those numbers come from the server?
|
2012-04-10 16:52:59 -04:00
|
|
|
|
2012-03-26 15:54:04 -04:00
|
|
|
Converting from ASCII to PyTables:
|
|
|
|
- For each row getting added, we need to set attributes on a PyTables
|
|
|
|
Row object and call table.append(). This means that there isn't a
|
|
|
|
particularly efficient way of converting from ascii.
|
|
|
|
- Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
|
|
|
|
- But this means we're doing parsing on the serialized side
|
|
|
|
- Let's keep parsing on the threaded server side so we can detect
|
|
|
|
errors better, and not block the serialized nilmdb for a slow
|
|
|
|
parsing process.
|
|
|
|
- Client sends ASCII data
|
|
|
|
- Server converts this ACSII data to a list of values
|
|
|
|
- Maybe:
|
2012-04-10 16:52:59 -04:00
|
|
|
|
2012-03-26 18:28:33 -04:00
|
|
|
# threaded side creates this object
|
|
|
|
parser = nilmdb.layout.Parser("layout_name")
|
2012-03-26 15:54:04 -04:00
|
|
|
# threaded side parses and fills it with data
|
|
|
|
parser.parse(textdata)
|
|
|
|
# serialized side pulls out rows
|
|
|
|
for n in xrange(parser.nrows):
|
2012-03-26 18:28:33 -04:00
|
|
|
parser.fill_row(rowinstance, n)
|
2012-03-26 15:54:04 -04:00
|
|
|
table.append()
|
2012-03-27 19:19:08 -04:00
|
|
|
|
|
|
|
|
|
|
|
Inserting streams, inside nilmdb
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
- First check that the new stream doesn't overlap.
|
|
|
|
- Get minimum timestamp, maximum timestamp from data parser.
|
|
|
|
- (extend parser to verify monotonicity and track extents)
|
|
|
|
- Get all intervals for this stream in the database
|
|
|
|
- See if new interval overlaps any existing ones
|
|
|
|
- If so, bail
|
|
|
|
- Question: should we cache intervals inside NilmDB?
|
|
|
|
- Assume database is fast for now, and always rebuild fom DB.
|
|
|
|
- Can add a caching layer later if we need to.
|
|
|
|
- `stream_get_ranges(path)` -> return IntervalSet?
|
2012-04-09 18:46:04 -04:00
|
|
|
|
|
|
|
Speed
|
|
|
|
-----
|
|
|
|
|
2012-04-13 17:00:33 -04:00
|
|
|
- First approach was quadratic. Adding four hours of data:
|
2012-04-09 18:46:04 -04:00
|
|
|
|
2012-11-29 15:18:20 -05:00
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw
|
2012-04-09 18:46:04 -04:00
|
|
|
real 24m31.093s
|
2012-11-29 15:18:20 -05:00
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
|
2012-04-10 15:58:16 -04:00
|
|
|
real 43m44.528s
|
2012-11-29 15:18:20 -05:00
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
|
2012-04-10 15:58:16 -04:00
|
|
|
real 93m29.713s
|
2012-11-29 15:18:20 -05:00
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
|
2012-04-10 16:52:59 -04:00
|
|
|
real 166m53.007s
|
2012-04-11 18:05:27 -04:00
|
|
|
|
2012-04-13 17:00:33 -04:00
|
|
|
- Disabling pytables indexing didn't help:
|
2012-04-11 18:05:27 -04:00
|
|
|
|
2012-04-13 17:00:33 -04:00
|
|
|
real 31m21.492s
|
2012-04-11 18:05:27 -04:00
|
|
|
real 52m51.963s
|
|
|
|
real 102m8.151s
|
|
|
|
real 176m12.469s
|
2012-04-13 17:00:33 -04:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Server RAM usage is constant.
|
|
|
|
|
|
|
|
- Speed problems were due to IntervalSet speed, of parsing intervals
|
2012-11-29 15:18:20 -05:00
|
|
|
from the database and adding the new one each time.
|
2012-05-04 12:08:32 -04:00
|
|
|
|
|
|
|
- First optimization is to cache result of `nilmdb:_get_intervals`,
|
|
|
|
which gives the best speedup.
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Also switched to internally using bxInterval from bx-python package.
|
|
|
|
Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
|
|
|
|
and seems to be growing logarithmically now. About 85μs per insertion
|
|
|
|
for inserting 131k entries.
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Storing the interval data in SQL might be better, with a scheme like:
|
|
|
|
http://www.logarithmic.net/pfh/blog/01235197474
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Next slowdown target is nilmdb.layout.Parser.parse().
|
|
|
|
- Rewrote parsers using cython and sscanf
|
|
|
|
- Stats (rev 10831), with _add_interval disabled
|
|
|
|
layout.pyx.Parser.parse:128 6303 sec, 262k calls
|
|
|
|
layout.pyx.parse:63 13913 sec, 5.1g calls
|
|
|
|
numpy:records.py.fromrecords:569 7410 sec, 262k calls
|
|
|
|
- Probably OK for now.
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
IntervalSet speed
|
|
|
|
-----------------
|
|
|
|
- Initial implementation was pretty slow, even with binary search in
|
|
|
|
sorted list
|
|
|
|
|
|
|
|
- Replaced with bxInterval; now takes about log n time for an insertion
|
|
|
|
- TestIntervalSpeed with range(17,18) and profiling
|
|
|
|
- 85 μs each
|
|
|
|
- 131072 calls to `__iadd__`
|
|
|
|
- 131072 to bx.insert_interval
|
|
|
|
- 131072 to bx.insert:395
|
|
|
|
- 2355835 to bx.insert:106 (18x as many?)
|
2012-05-03 15:18:07 -04:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Tried blist too, worse than bxinterval.
|
2012-05-03 15:18:07 -04:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Might be algorithmic improvements to be made in Interval.py,
|
|
|
|
like in `__and__`
|
2012-08-06 17:46:09 -04:00
|
|
|
|
2012-11-29 01:35:01 -05:00
|
|
|
- Replaced again with rbtree. Seems decent. Numbers are time per
|
|
|
|
insert for 2**17 insertions, followed by total wall time and RAM
|
2012-11-29 15:18:20 -05:00
|
|
|
usage for running "make test" with `test_rbtree` and `test_interval`
|
2012-11-29 01:35:01 -05:00
|
|
|
with range(5,20):
|
2012-11-29 15:18:20 -05:00
|
|
|
- old values with bxinterval:
|
|
|
|
20.2 μS, total 20 s, 177 MB RAM
|
|
|
|
- rbtree, plain python:
|
2012-11-29 01:35:01 -05:00
|
|
|
97 μS, total 105 s, 846 MB RAM
|
|
|
|
- rbtree converted to cython:
|
|
|
|
26 μS, total 29 s, 320 MB RAM
|
2012-11-29 15:18:20 -05:00
|
|
|
- rbtree and interval converted to cython:
|
|
|
|
8.4 μS, total 12 s, 134 MB RAM
|
2012-08-06 17:46:09 -04:00
|
|
|
|
|
|
|
Layouts
|
|
|
|
-------
|
|
|
|
Current/old design has specific layouts: RawData, PrepData, RawNotchedData.
|
|
|
|
Let's get rid of this entirely and switch to simpler data types that are
|
|
|
|
just collections and counts of a single type. We'll still use strings
|
|
|
|
to describe them, with format:
|
|
|
|
|
|
|
|
type_count
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-08-06 17:46:09 -04:00
|
|
|
where type is "uint16", "float32", or "float64", and count is an integer.
|
|
|
|
|
|
|
|
nilmdb.layout.named() will parse these strings into the appropriate
|
|
|
|
handlers. For compatibility:
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-08-06 17:46:09 -04:00
|
|
|
"RawData" == "uint16_6"
|
|
|
|
"RawNotchedData" == "uint16_9"
|
|
|
|
"PrepData" == "float32_8"
|