Update doc formatting, .gitignore

11 years ago · f0c2a64ae3
--- a/.gitignore
+++ b/.gitignore
@@ -3,3 +3,5 @@ tests/*testdb/
 .coverage
 *.pyc
 design.html
 timeit*out

--- a/design.md
+++ b/design.md
@@ -19,13 +19,13 @@ Sqlite performance
 Committing a transaction in the default sync mode (PRAGMA synchronous=FULL)
 takes about 125msec.  sqlite3 will commit transactions at 3 times:

  1: explicit con.commit()
 1. explicit con.commit()

  2: between a series of DML commands and non-DML commands, e.g.
 2. between a series of DML commands and non-DML commands, e.g.
   after a series of INSERT, SELECT, but before a CREATE TABLE or
   PRAGMA.

 3: at the end of an explicit transaction, e.g. "with self.con as con:"
 3. at the end of an explicit transaction, e.g. "with self.con as con:"

 To speed up testing, or if this transaction speed becomes an issue,
 the sync=False option to NilmDB will set PRAGMA synchronous=OFF.
@@ -48,56 +48,58 @@ transfer?
  everything still gets buffered.  Just a tradeoff of buffer size.

 Before timestamps are added:

 - Raw data is about 440 kB/s    (9 channels)
 - Prep data is about 12.5 kB/s  (1 phase)
 - How do we know how much data to send?

  - Remember that we can only do maybe 8-50 transactions per second on
    the sqlite database.  So if one block of inserted data is one
    transaction, we'd need the raw case to be around 64kB per request,
 	ideally more.
  - Maybe use a range, based on how long it's taking to read the data
    - If no more data, send it
    - If data > 1 MB, send it
 	- If more than 10 seconds have elapsed, send it
  - Should those numbers come from the server?
    - Remember that we can only do maybe 8-50 transactions per second on
      the sqlite database.  So if one block of inserted data is one
      transaction, we'd need the raw case to be around 64kB per request,
      ideally more.
    - Maybe use a range, based on how long it's taking to read the data
        - If no more data, send it
        - If data > 1 MB, send it
    - If more than 10 seconds have elapsed, send it
    - Should those numbers come from the server?

 Converting from ASCII to PyTables:

 - For each row getting added, we need to set attributes on a PyTables
  Row object and call table.append().  This means that there isn't a
  particularly efficient way of converting from ascii.
 - Could create a function like nilmdb.layout.Layout("foo".fillRow(asciiline)
  - But this means we're doing parsing on the serialized side
  - Let's keep parsing on the threaded server side so we can detect
    errors better, and not block the serialized nilmdb for a slow
    parsing process.
    - But this means we're doing parsing on the serialized side
    - Let's keep parsing on the threaded server side so we can detect
      errors better, and not block the serialized nilmdb for a slow
      parsing process.
 - Client sends ASCII data
 - Server converts this ACSII data to a list of values
  - Maybe:
    - Maybe:

 		# threaded side creates this object
 		parser = nilmdb.layout.Parser("layout_name")
 		# threaded side parses and fills it with data
 		parser.parse(textdata)
 		# serialized side pulls out rows
 		for n in xrange(parser.nrows):
 		    parser.fill_row(rowinstance, n)
 			table.append()
            # threaded side creates this object
            parser = nilmdb.layout.Parser("layout_name")
            # threaded side parses and fills it with data
            parser.parse(textdata)
            # serialized side pulls out rows
            for n in xrange(parser.nrows):
                parser.fill_row(rowinstance, n)
                table.append()


 Inserting streams, inside nilmdb
 --------------------------------

 - First check that the new stream doesn't overlap.
  - Get minimum timestamp, maximum timestamp from data parser.
    - (extend parser to verify monotonicity and track extents)
  - Get all intervals for this stream in the database
  - See if new interval overlaps any existing ones
    - If so, bail
  - Question: should we cache intervals inside NilmDB?
    - Assume database is fast for now, and always rebuild fom DB.
 	- Can add a caching layer later if we need to.
  - `stream_get_ranges(path)` -> return IntervalSet?
    - Get minimum timestamp, maximum timestamp from data parser.
        - (extend parser to verify monotonicity and track extents)
    - Get all intervals for this stream in the database
    - See if new interval overlaps any existing ones
        - If so, bail
    - Question: should we cache intervals inside NilmDB?
        - Assume database is fast for now, and always rebuild fom DB.
        - Can add a caching layer later if we need to.
    - `stream_get_ranges(path)` -> return IntervalSet?

 Speed
 -----
@@ -105,44 +107,44 @@ Speed
 - First approach was quadratic.  Adding four hours of data:

        $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-110000 /bpnilm/1/raw
    	real    24m31.093s
    	$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
    	real    43m44.528s
    	$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
    	real    93m29.713s
    	$ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
    	real    166m53.007s
        real    24m31.093s
        $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-120001 /bpnilm/1/raw
        real    43m44.528s
        $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-130002 /bpnilm/1/raw
        real    93m29.713s
        $ time zcat /home/jim/bpnilm-data/snapshot-1-20110513-110002.raw.gz | ./nilmtool.py insert -s 20110513-140003 /bpnilm/1/raw
        real    166m53.007s

 - Disabling pytables indexing didn't help:

        real    31m21.492s
    	real    52m51.963s
    	real    102m8.151s
    	real    176m12.469s
        real    52m51.963s
        real    102m8.151s
        real    176m12.469s

 - Server RAM usage is constant.

 - Speed problems were due to IntervalSet speed, of parsing intervals
  from the database and adding the new one each time.

  - First optimization is to cache result of `nilmdb:_get_intervals`,
    which gives the best speedup.
    - First optimization is to cache result of `nilmdb:_get_intervals`,
      which gives the best speedup.

  - Also switched to internally using bxInterval from bx-python package.
    Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
 	and seems to be growing logarithmically now.  About 85μs per insertion
 	for inserting 131k entries.
    - Also switched to internally using bxInterval from bx-python package.
      Speed of `tests/test_interval:TestIntervalSpeed` is pretty decent
      and seems to be growing logarithmically now.  About 85μs per insertion
      for inserting 131k entries.

  - Storing the interval data in SQL might be better, with a scheme like:
    http://www.logarithmic.net/pfh/blog/01235197474
    - Storing the interval data in SQL might be better, with a scheme like:
      http://www.logarithmic.net/pfh/blog/01235197474

 - Next slowdown target is nilmdb.layout.Parser.parse().
  - Rewrote parsers using cython and sscanf
  - Stats (rev 10831), with _add_interval disabled
    - Rewrote parsers using cython and sscanf
    - Stats (rev 10831), with _add_interval disabled

        layout.pyx.Parser.parse:128        6303 sec, 262k calls
 	    layout.pyx.parse:63               13913 sec, 5.1g calls
 	    numpy:records.py.fromrecords:569   7410 sec, 262k calls
         layout.pyx.parse:63               13913 sec, 5.1g calls
        numpy:records.py.fromrecords:569   7410 sec, 262k calls

 - Probably OK for now.

@@ -213,8 +215,8 @@ created stream.  These locations are called tables.  For example,
 tables might be located at

    nilmdb/data/newton/raw/
 	nilmdb/data/newton/prep/
 	nilmdb/data/cottage/raw/
    nilmdb/data/newton/prep/
    nilmdb/data/cottage/raw/

 Each table contains: