2012-03-25 15:12:49 -04:00
|
|
|
Structure
|
|
|
|
---------
|
2013-01-04 17:27:04 -05:00
|
|
|
nilmdb.nilmdb is the NILM database interface. A nilmdb.BulkData
|
|
|
|
interface stores data in flat files, and a SQL database tracks
|
|
|
|
metadata and ranges.
|
2012-03-25 15:12:49 -04:00
|
|
|
|
|
|
|
Access to the nilmdb must be single-threaded. This is handled with
|
2013-01-04 17:27:04 -05:00
|
|
|
the nilmdb.serializer class. In the future this could probably
|
|
|
|
be turned into a per-path serialization.
|
2012-03-25 15:12:49 -04:00
|
|
|
|
|
|
|
nilmdb.server is a HTTP server that provides an interface to talk,
|
|
|
|
thorugh the serialization layer, to the nilmdb object.
|
|
|
|
|
|
|
|
nilmdb.client is a HTTP client that connects to this.
|
|
|
|
|
|
|
|
Sqlite performance
|
|
|
|
------------------
|
|
|
|
|
|
|
|
Committing a transaction in the default sync mode (PRAGMA synchronous=FULL)
|
|
|
|
takes about 125msec. sqlite3 will commit transactions at 3 times:
|
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
1. explicit con.commit()
|
2012-03-25 15:12:49 -04:00
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
2. between a series of DML commands and non-DML commands, e.g.
|
2012-03-25 15:12:49 -04:00
|
|
|
after a series of INSERT, SELECT, but before a CREATE TABLE or
|
|
|
|
PRAGMA.
|
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
3. at the end of an explicit transaction, e.g. "with self.con as con:"
|
2012-03-25 15:12:49 -04:00
|
|
|
|
|
|
|
To speed up testing, or if this transaction speed becomes an issue,
|
|
|
|
the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
|
|
|
|
|
|
|
|
|
|
|
|
Inserting streams
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
We need to send the contents of "data" as POST. Do we need chunked
|
|
|
|
transfer?
|
|
|
|
|
|
|
|
- Don't know the size in advance, so we would need to use chunked if
|
|
|
|
we send the entire thing in one request.
|
|
|
|
- But we shouldn't send one chunk per line, so we need to buffer some
|
|
|
|
anyway; why not just make new requests?
|
|
|
|
- Consider the infinite-streaming case, we might want to send it
|
|
|
|
immediately? Not really -- server still should do explicit inserts
|
|
|
|
of fixed-size chunks.
|
|
|
|
- Even chunked encoding needs the size of each chunk beforehand, so
|
|
|
|
everything still gets buffered. Just a tradeoff of buffer size.
|
2012-04-10 16:52:59 -04:00
|
|
|
|
2012-03-25 15:12:49 -04:00
|
|
|
Before timestamps are added:
|
2013-01-09 23:36:23 -05:00
|
|
|
|
2012-03-25 15:12:49 -04:00
|
|
|
- Raw data is about 440 kB/s (9 channels)
|
|
|
|
- Prep data is about 12.5 kB/s (1 phase)
|
|
|
|
- How do we know how much data to send?
|
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
- Remember that we can only do maybe 8-50 transactions per second on
|
|
|
|
the sqlite database. So if one block of inserted data is one
|
|
|
|
transaction, we'd need the raw case to be around 64kB per request,
|
|
|
|
ideally more.
|
|
|
|
- Maybe use a range, based on how long it's taking to read the data
|
|
|
|
- If no more data, send it
|
|
|
|
- If data > 1 MB, send it
|
|
|
|
- If more than 10 seconds have elapsed, send it
|
|
|
|
- Should those numbers come from the server?
|
2012-04-10 16:52:59 -04:00
|
|
|
|
2012-03-26 15:54:04 -04:00
|
|
|
Converting from ASCII to PyTables:
|
2013-01-09 23:36:23 -05:00
|
|
|
|
2012-03-26 15:54:04 -04:00
|
|
|
- For each row getting added, we need to set attributes on a PyTables
|
|
|
|
Row object and call table.append(). This means that there isn't a
|
|
|
|
particularly efficient way of converting from ascii.
|
|
|
|
- Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
|
2013-01-09 23:36:23 -05:00
|
|
|
- But this means we're doing parsing on the serialized side
|
|
|
|
- Let's keep parsing on the threaded server side so we can detect
|
|
|
|
errors better, and not block the serialized nilmdb for a slow
|
|
|
|
parsing process.
|
2012-03-26 15:54:04 -04:00
|
|
|
- Client sends ASCII data
|
|
|
|
- Server converts this ACSII data to a list of values
|
2013-01-09 23:36:23 -05:00
|
|
|
- Maybe:
|
2012-04-10 16:52:59 -04:00
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
# threaded side creates this object
|
|
|
|
parser = nilmdb.layout.Parser("layout_name")
|
|
|
|
# threaded side parses and fills it with data
|
|
|
|
parser.parse(textdata)
|
|
|
|
# serialized side pulls out rows
|
|
|
|
for n in xrange(parser.nrows):
|
|
|
|
parser.fill_row(rowinstance, n)
|
|
|
|
table.append()
|
2012-03-27 19:19:08 -04:00
|
|
|
|
|
|
|
|
|
|
|
Inserting streams, inside nilmdb
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
- First check that the new stream doesn't overlap.
|
2013-01-09 23:36:23 -05:00
|
|
|
- Get minimum timestamp, maximum timestamp from data parser.
|
|
|
|
- (extend parser to verify monotonicity and track extents)
|
|
|
|
- Get all intervals for this stream in the database
|
|
|
|
- See if new interval overlaps any existing ones
|
|
|
|
- If so, bail
|
|
|
|
- Question: should we cache intervals inside NilmDB?
|
|
|
|
- Assume database is fast for now, and always rebuild fom DB.
|
|
|
|
- Can add a caching layer later if we need to.
|
|
|
|
- `stream_get_ranges(path)` -> return IntervalSet?
|
2012-04-09 18:46:04 -04:00
|
|
|
|
|
|
|
Speed
|
|
|
|
-----
|
|
|
|
|
2012-04-13 17:00:33 -04:00
|
|
|
- First approach was quadratic. Adding four hours of data:
|
2012-04-09 18:46:04 -04:00
|
|
|
|
2013-01-09 23:26:59 -05:00
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw
|
2013-01-09 23:36:23 -05:00
|
|
|
real 24m31.093s
|
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
|
|
|
|
real 43m44.528s
|
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
|
|
|
|
real 93m29.713s
|
|
|
|
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
|
|
|
|
real 166m53.007s
|
2012-04-11 18:05:27 -04:00
|
|
|
|
2012-04-13 17:00:33 -04:00
|
|
|
- Disabling pytables indexing didn't help:
|
2012-04-11 18:05:27 -04:00
|
|
|
|
2013-01-09 23:26:59 -05:00
|
|
|
real 31m21.492s
|
2013-01-09 23:36:23 -05:00
|
|
|
real 52m51.963s
|
|
|
|
real 102m8.151s
|
|
|
|
real 176m12.469s
|
2012-04-13 17:00:33 -04:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Server RAM usage is constant.
|
|
|
|
|
|
|
|
- Speed problems were due to IntervalSet speed, of parsing intervals
|
2012-11-29 15:18:20 -05:00
|
|
|
from the database and adding the new one each time.
|
2012-05-04 12:08:32 -04:00
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
- First optimization is to cache result of `nilmdb:_get_intervals`,
|
|
|
|
which gives the best speedup.
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
- Also switched to internally using bxInterval from bx-python package.
|
|
|
|
Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
|
|
|
|
and seems to be growing logarithmically now. About 85μs per insertion
|
|
|
|
for inserting 131k entries.
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2013-01-09 23:36:23 -05:00
|
|
|
- Storing the interval data in SQL might be better, with a scheme like:
|
|
|
|
http://www.logarithmic.net/pfh/blog/01235197474
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Next slowdown target is nilmdb.layout.Parser.parse().
|
2013-01-09 23:36:23 -05:00
|
|
|
- Rewrote parsers using cython and sscanf
|
|
|
|
- Stats (rev 10831), with _add_interval disabled
|
2013-01-09 23:26:59 -05:00
|
|
|
|
|
|
|
layout.pyx.Parser.parse:128 6303 sec, 262k calls
|
2013-01-09 23:36:23 -05:00
|
|
|
layout.pyx.parse:63 13913 sec, 5.1g calls
|
|
|
|
numpy:records.py.fromrecords:569 7410 sec, 262k calls
|
2013-01-09 23:26:59 -05:00
|
|
|
|
|
|
|
- Probably OK for now.
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-12-14 16:57:02 -05:00
|
|
|
- After all updates, now takes about 8.5 minutes to insert an hour of
|
|
|
|
data, constant after adding 171 hours (4.9 billion data points)
|
|
|
|
|
|
|
|
- Data set size: 98 gigs = 20 bytes per data point.
|
|
|
|
6 uint16 data + 1 uint32 timestamp = 16 bytes per point
|
|
|
|
So compression must be off -- will retry with compression forced on.
|
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
IntervalSet speed
|
|
|
|
-----------------
|
|
|
|
- Initial implementation was pretty slow, even with binary search in
|
|
|
|
sorted list
|
|
|
|
|
|
|
|
- Replaced with bxInterval; now takes about log n time for an insertion
|
2013-01-09 23:26:59 -05:00
|
|
|
- TestIntervalSpeed with range(17,18) and profiling
|
|
|
|
- 85 μs each
|
|
|
|
- 131072 calls to `__iadd__`
|
|
|
|
- 131072 to bx.insert_interval
|
|
|
|
- 131072 to bx.insert:395
|
|
|
|
- 2355835 to bx.insert:106 (18x as many?)
|
2012-05-03 15:18:07 -04:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Tried blist too, worse than bxinterval.
|
2012-05-03 15:18:07 -04:00
|
|
|
|
2012-05-04 12:08:32 -04:00
|
|
|
- Might be algorithmic improvements to be made in Interval.py,
|
|
|
|
like in `__and__`
|
2012-08-06 17:46:09 -04:00
|
|
|
|
2012-11-29 01:35:01 -05:00
|
|
|
- Replaced again with rbtree. Seems decent. Numbers are time per
|
|
|
|
insert for 2**17 insertions, followed by total wall time and RAM
|
2012-11-29 15:18:20 -05:00
|
|
|
usage for running "make test" with `test_rbtree` and `test_interval`
|
2012-11-29 01:35:01 -05:00
|
|
|
with range(5,20):
|
2013-01-09 23:26:59 -05:00
|
|
|
- old values with bxinterval:
|
|
|
|
20.2 μS, total 20 s, 177 MB RAM
|
|
|
|
- rbtree, plain python:
|
|
|
|
97 μS, total 105 s, 846 MB RAM
|
|
|
|
- rbtree converted to cython:
|
|
|
|
26 μS, total 29 s, 320 MB RAM
|
|
|
|
- rbtree and interval converted to cython:
|
|
|
|
8.4 μS, total 12 s, 134 MB RAM
|
2012-08-06 17:46:09 -04:00
|
|
|
|
|
|
|
Layouts
|
|
|
|
-------
|
|
|
|
Current/old design has specific layouts: RawData, PrepData, RawNotchedData.
|
|
|
|
Let's get rid of this entirely and switch to simpler data types that are
|
|
|
|
just collections and counts of a single type. We'll still use strings
|
|
|
|
to describe them, with format:
|
|
|
|
|
|
|
|
type_count
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-08-06 17:46:09 -04:00
|
|
|
where type is "uint16", "float32", or "float64", and count is an integer.
|
|
|
|
|
|
|
|
nilmdb.layout.named() will parse these strings into the appropriate
|
|
|
|
handlers. For compatibility:
|
2012-11-29 15:18:20 -05:00
|
|
|
|
2012-08-06 17:46:09 -04:00
|
|
|
"RawData" == "uint16_6"
|
|
|
|
"RawNotchedData" == "uint16_9"
|
|
|
|
"PrepData" == "float32_8"
|
2013-01-09 19:25:45 -05:00
|
|
|
|
|
|
|
|
|
|
|
BulkData design
|
|
|
|
---------------
|
|
|
|
|
|
|
|
BulkData is a custom bulk data storage system that was written to
|
|
|
|
replace PyTables. The general structure is a `data` subdirectory in
|
|
|
|
the main NilmDB directory. Within `data`, paths are created for each
|
|
|
|
created stream. These locations are called tables. For example,
|
|
|
|
tables might be located at
|
|
|
|
|
|
|
|
nilmdb/data/newton/raw/
|
2013-01-09 23:36:23 -05:00
|
|
|
nilmdb/data/newton/prep/
|
|
|
|
nilmdb/data/cottage/raw/
|
2013-01-09 19:25:45 -05:00
|
|
|
|
|
|
|
Each table contains:
|
|
|
|
|
|
|
|
- An unchanging `_format` file (Python pickle format) that describes
|
|
|
|
parameters of how the data is broken up, like files per directory,
|
|
|
|
rows per file, and the binary data format
|
|
|
|
|
|
|
|
- Hex named subdirectories `("%04x", although more than 65536 can exist)`
|
|
|
|
|
|
|
|
- Hex named files within those subdirectories, like:
|
|
|
|
|
|
|
|
/nilmdb/data/newton/raw/000b/010a
|
|
|
|
|
2013-01-09 23:26:59 -05:00
|
|
|
The data format of these files is raw binary, interpreted by the
|
|
|
|
Python `struct` module according to the format string in the
|
|
|
|
`_format` file.
|
|
|
|
|
2013-01-09 19:25:45 -05:00
|
|
|
- Same as above, with `.removed` suffix, is an optional file (Python
|
|
|
|
pickle format) containing a list of row numbers that have been
|
|
|
|
logically removed from the file. If this range covers the entire
|
2013-01-09 23:26:59 -05:00
|
|
|
file, the entire file will be removed.
|
|
|
|
|
|
|
|
- Note that the `bulkdata.nrows` variable is calculated once in
|
|
|
|
`BulkData.__init__()`, and only ever incremented during use. Thus,
|
|
|
|
even if all data is removed, `nrows` can remain high. However, if
|
|
|
|
the server is restarted, the newly calculated `nrows` may be lower
|
|
|
|
than in a previous run due to deleted data. To be specific, this
|
|
|
|
sequence of events:
|
|
|
|
|
|
|
|
- insert data
|
|
|
|
- remove all data
|
|
|
|
- insert data
|
|
|
|
|
|
|
|
will result in having different row numbers in the database, and
|
|
|
|
differently numbered files on the filesystem, than the sequence:
|
|
|
|
|
|
|
|
- insert data
|
|
|
|
- remove all data
|
|
|
|
- restart server
|
|
|
|
- insert data
|
|
|
|
|
|
|
|
This is okay! Everything should remain consistent both in the
|
|
|
|
`BulkData` and `NilmDB`. Not attempting to readjust `nrows` during
|
|
|
|
deletion makes the code quite a bit simpler.
|
|
|
|
|
|
|
|
- Similarly, data files are never truncated shorter. Removing data
|
|
|
|
from the end of the file will not shorten it; it will only be
|
|
|
|
deleted when it has been fully filled and all of the data has been
|
|
|
|
subsequently removed.
|