Browse Source

Update doc formatting, .gitignore

tags/nilmdb-0.1
Jim Paris 11 years ago
parent
commit
f0c2a64ae3
2 changed files with 62 additions and 58 deletions
  1. +2
    -0
      .gitignore
  2. +60
    -58
      design.md

+ 2
- 0
.gitignore View File

@@ -3,3 +3,5 @@ tests/*testdb/
.coverage
*.pyc
design.html
timeit*out


+ 60
- 58
design.md View File

@@ -19,13 +19,13 @@ Sqlite performance
Committing a transaction in the default sync mode (PRAGMA synchronous=FULL)
takes about 125msec. sqlite3 will commit transactions at 3 times:

1: explicit con.commit()
1. explicit con.commit()

2: between a series of DML commands and non-DML commands, e.g.
2. between a series of DML commands and non-DML commands, e.g.
after a series of INSERT, SELECT, but before a CREATE TABLE or
PRAGMA.

3: at the end of an explicit transaction, e.g. "with self.con as con:"
3. at the end of an explicit transaction, e.g. "with self.con as con:"

To speed up testing, or if this transaction speed becomes an issue,
the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
@@ -48,56 +48,58 @@ transfer?
everything still gets buffered. Just a tradeoff of buffer size.

Before timestamps are added:

- Raw data is about 440 kB/s (9 channels)
- Prep data is about 12.5 kB/s (1 phase)
- How do we know how much data to send?

- Remember that we can only do maybe 8-50 transactions per second on
the sqlite database. So if one block of inserted data is one
transaction, we'd need the raw case to be around 64kB per request,
ideally more.
- Maybe use a range, based on how long it's taking to read the data
- If no more data, send it
- If data > 1 MB, send it
- If more than 10 seconds have elapsed, send it
- Should those numbers come from the server?
- Remember that we can only do maybe 8-50 transactions per second on
the sqlite database. So if one block of inserted data is one
transaction, we'd need the raw case to be around 64kB per request,
ideally more.
- Maybe use a range, based on how long it's taking to read the data
- If no more data, send it
- If data > 1 MB, send it
- If more than 10 seconds have elapsed, send it
- Should those numbers come from the server?

Converting from ASCII to PyTables:

- For each row getting added, we need to set attributes on a PyTables
Row object and call table.append(). This means that there isn't a
particularly efficient way of converting from ascii.
- Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
- But this means we're doing parsing on the serialized side
- Let's keep parsing on the threaded server side so we can detect
errors better, and not block the serialized nilmdb for a slow
parsing process.
- But this means we're doing parsing on the serialized side
- Let's keep parsing on the threaded server side so we can detect
errors better, and not block the serialized nilmdb for a slow
parsing process.
- Client sends ASCII data
- Server converts this ACSII data to a list of values
- Maybe:
- Maybe:

# threaded side creates this object
parser = nilmdb.layout.Parser("layout_name")
# threaded side parses and fills it with data
parser.parse(textdata)
# serialized side pulls out rows
for n in xrange(parser.nrows):
parser.fill_row(rowinstance, n)
table.append()
# threaded side creates this object
parser = nilmdb.layout.Parser("layout_name")
# threaded side parses and fills it with data
parser.parse(textdata)
# serialized side pulls out rows
for n in xrange(parser.nrows):
parser.fill_row(rowinstance, n)
table.append()


Inserting streams, inside nilmdb
--------------------------------

- First check that the new stream doesn't overlap.
- Get minimum timestamp, maximum timestamp from data parser.
- (extend parser to verify monotonicity and track extents)
- Get all intervals for this stream in the database
- See if new interval overlaps any existing ones
- If so, bail
- Question: should we cache intervals inside NilmDB?
- Assume database is fast for now, and always rebuild fom DB.
- Can add a caching layer later if we need to.
- `stream_get_ranges(path)` -> return IntervalSet?
- Get minimum timestamp, maximum timestamp from data parser.
- (extend parser to verify monotonicity and track extents)
- Get all intervals for this stream in the database
- See if new interval overlaps any existing ones
- If so, bail
- Question: should we cache intervals inside NilmDB?
- Assume database is fast for now, and always rebuild fom DB.
- Can add a caching layer later if we need to.
- `stream_get_ranges(path)` -> return IntervalSet?

Speed
-----
@@ -105,44 +107,44 @@ Speed
- First approach was quadratic. Adding four hours of data:

$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw
real 24m31.093s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
real 43m44.528s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
real 93m29.713s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
real 166m53.007s
real 24m31.093s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
real 43m44.528s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
real 93m29.713s
$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
real 166m53.007s

- Disabling pytables indexing didn't help:

real 31m21.492s
real 52m51.963s
real 102m8.151s
real 176m12.469s
real 52m51.963s
real 102m8.151s
real 176m12.469s

- Server RAM usage is constant.

- Speed problems were due to IntervalSet speed, of parsing intervals
from the database and adding the new one each time.

- First optimization is to cache result of `nilmdb:_get_intervals`,
which gives the best speedup.
- First optimization is to cache result of `nilmdb:_get_intervals`,
which gives the best speedup.

- Also switched to internally using bxInterval from bx-python package.
Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
and seems to be growing logarithmically now. About 85μs per insertion
for inserting 131k entries.
- Also switched to internally using bxInterval from bx-python package.
Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
and seems to be growing logarithmically now. About 85μs per insertion
for inserting 131k entries.

- Storing the interval data in SQL might be better, with a scheme like:
http://www.logarithmic.net/pfh/blog/01235197474
- Storing the interval data in SQL might be better, with a scheme like:
http://www.logarithmic.net/pfh/blog/01235197474

- Next slowdown target is nilmdb.layout.Parser.parse().
- Rewrote parsers using cython and sscanf
- Stats (rev 10831), with _add_interval disabled
- Rewrote parsers using cython and sscanf
- Stats (rev 10831), with _add_interval disabled

layout.pyx.Parser.parse:128 6303 sec, 262k calls
layout.pyx.parse:63 13913 sec, 5.1g calls
numpy:records.py.fromrecords:569 7410 sec, 262k calls
layout.pyx.parse:63 13913 sec, 5.1g calls
numpy:records.py.fromrecords:569 7410 sec, 262k calls

- Probably OK for now.

@@ -213,8 +215,8 @@ created stream. These locations are called tables. For example,
tables might be located at

nilmdb/data/newton/raw/
nilmdb/data/newton/prep/
nilmdb/data/cottage/raw/
nilmdb/data/newton/prep/
nilmdb/data/cottage/raw/

Each table contains:



Loading…
Cancel
Save