Current/old design has specific layouts: RawData, PrepData, RawNotchedData. Let's get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We'll still use strings to describe them, with format: type_count where type is "uint16", "float32", or "float64", and count is an integer. nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility: "RawData" == "uint16_6" "RawNotchedData" == "uint16_9" "PrepData" == "float32_8" git-svn-id: https://bucket.mit.edu/svn/nilm/nilmdb@10981 ddd99763-3ecb-0310-9145-efcb8ce7c51f
6.5 KiB
Structure
nilmdb.nilmdb is the NILM database interface. It tracks a PyTables database holds actual rows of data, and a SQL database tracks metadata and ranges.
Access to the nilmdb must be single-threaded. This is handled with the nilmdb.serializer class.
nilmdb.server is a HTTP server that provides an interface to talk, thorugh the serialization layer, to the nilmdb object.
nilmdb.client is a HTTP client that connects to this.
Sqlite performance
Committing a transaction in the default sync mode (PRAGMA synchronous=FULL) takes about 125msec. sqlite3 will commit transactions at 3 times:
1: explicit con.commit()
2: between a series of DML commands and non-DML commands, e.g. after a series of INSERT, SELECT, but before a CREATE TABLE or PRAGMA.
3: at the end of an explicit transaction, e.g. "with self.con as con:"
To speed up testing, or if this transaction speed becomes an issue, the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
Inserting streams
We need to send the contents of "data" as POST. Do we need chunked transfer?
- Don't know the size in advance, so we would need to use chunked if we send the entire thing in one request.
- But we shouldn't send one chunk per line, so we need to buffer some anyway; why not just make new requests?
- Consider the infinite-streaming case, we might want to send it immediately? Not really -- server still should do explicit inserts of fixed-size chunks.
- Even chunked encoding needs the size of each chunk beforehand, so everything still gets buffered. Just a tradeoff of buffer size.
Before timestamps are added:
-
Raw data is about 440 kB/s (9 channels)
-
Prep data is about 12.5 kB/s (1 phase)
-
How do we know how much data to send?
- Remember that we can only do maybe 8-50 transactions per second on the sqlite database. So if one block of inserted data is one transaction, we'd need the raw case to be around 64kB per request, ideally more.
- Maybe use a range, based on how long it's taking to read the data
- If no more data, send it
- If data > 1 MB, send it
- If more than 10 seconds have elapsed, send it
- Should those numbers come from the server?
Converting from ASCII to PyTables:
- For each row getting added, we need to set attributes on a PyTables Row object and call table.append(). This means that there isn't a particularly efficient way of converting from ascii.
- Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
- But this means we're doing parsing on the serialized side
- Let's keep parsing on the threaded server side so we can detect errors better, and not block the serialized nilmdb for a slow parsing process.
- Client sends ASCII data
- Server converts this ACSII data to a list of values
-
Maybe:
# threaded side creates this object parser = nilmdb.layout.Parser("layout_name") # threaded side parses and fills it with data parser.parse(textdata) # serialized side pulls out rows for n in xrange(parser.nrows): parser.fill_row(rowinstance, n) table.append()
-
Inserting streams, inside nilmdb
- First check that the new stream doesn't overlap.
- Get minimum timestamp, maximum timestamp from data parser.
- (extend parser to verify monotonicity and track extents)
- Get all intervals for this stream in the database
- See if new interval overlaps any existing ones
- If so, bail
- Question: should we cache intervals inside NilmDB?
- Assume database is fast for now, and always rebuild fom DB.
- Can add a caching layer later if we need to.
stream_get_ranges(path)
-> return IntervalSet?
- Get minimum timestamp, maximum timestamp from data parser.
Speed
-
First approach was quadratic. Adding four hours of data:
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw real 24m31.093s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw real 43m44.528s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw real 93m29.713s $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw real 166m53.007s
-
Disabling pytables indexing didn't help:
real 31m21.492s real 52m51.963s real 102m8.151s real 176m12.469s
-
Server RAM usage is constant.
-
Speed problems were due to IntervalSet speed, of parsing intervals from the database and adding the new one each time.
-
First optimization is to cache result of
nilmdb:_get_intervals
, which gives the best speedup. -
Also switched to internally using bxInterval from bx-python package. Speed of
tests/test_interval:TestIntervalSpeed
is pretty decent and seems to be growing logarithmically now. About 85μs per insertion for inserting 131k entries. -
Storing the interval data in SQL might be better, with a scheme like: http://www.logarithmic.net/pfh/blog/01235197474
-
-
Next slowdown target is nilmdb.layout.Parser.parse().
- Rewrote parsers using cython and sscanf
- Stats (rev 10831), with _add_interval disabled layout.pyx.Parser.parse:128 6303 sec, 262k calls layout.pyx.parse:63 13913 sec, 5.1g calls numpy:records.py.fromrecords:569 7410 sec, 262k calls
- Probably OK for now.
IntervalSet speed
-
Initial implementation was pretty slow, even with binary search in sorted list
-
Replaced with bxInterval; now takes about log n time for an insertion
- TestIntervalSpeed with range(17,18) and profiling
- 85 μs each
- 131072 calls to
__iadd__
- 131072 to bx.insert_interval
- 131072 to bx.insert:395
- 2355835 to bx.insert:106 (18x as many?)
- TestIntervalSpeed with range(17,18) and profiling
-
Tried blist too, worse than bxinterval.
-
Might be algorithmic improvements to be made in Interval.py, like in
__and__
Layouts
Current/old design has specific layouts: RawData, PrepData, RawNotchedData. Let's get rid of this entirely and switch to simpler data types that are just collections and counts of a single type. We'll still use strings to describe them, with format:
type_count
where type is "uint16", "float32", or "float64", and count is an integer.
nilmdb.layout.named() will parse these strings into the appropriate handlers. For compatibility:
"RawData" == "uint16_6"
"RawNotchedData" == "uint16_9"
"PrepData" == "float32_8"