Previous commits went back and forth a bit on whether the various APIs
should use bytes or strings, but bytes appears to be a better answer,
because actual data in streams will always be 7-bit ASCII or raw
binary. There's no reason to apply the performance penalty of
constantly converting between bytes and strings.
One drawback now is that lots of code now has to have "b" prefixes on
strings, especially in tests, which inflates this commit quite a bit.
The default bisect module includes a fast C implementation, which
requires that array indices fit within the system "long" type. For
32-bit systems, that's not acceptable, as the table indices for raw
data can exceed 2^32 very quickly. A pure python version works fine.
This gives an easy way to get a large values in the database start_pos
and end_pos fields, which is necessary for testing failure modes when
those get too large (e.g. on 32-bit systems). Adjust tests to make
use of this knob.
Normally, indexes for an array are expected to fit in a platform's
native long (32 or 64-bit). In nilmdb, tables aren't real arrays and
we need to handle unbounded indices.
We do this by creating a new requests.Session object for each request,
sending a "Connection: close" request header, and then explicitly
marking the connection for close after the response is read.
This is to avoid a longstanding race condition with HTTP keepalive
and server timeouts. Due to data processing, capture, etc, requests
may be separated by an arbitrary delay. If this delay is shorter
than the server's KeepAliveTimeout, the same connection is used.
If the delay is longer, a new connection is used. If the delay is
the same, however, the request may be sent on the old connection at
the exact same time that the server closes it. Typically, the
client sees the connection as closing between the request and the
response, which leads to "httplib.BadStatusLine" errors.
This patch avoids the race condition entirely by not using persistent
connections.
Another solution may be to detect those errors and retry the
connection, resending the request. However, the race condition could
potentially show up in other places, like a closed connection during
the request body, not after. Such an error could also be a legitimate
network condition or problem. This solution should be more reliable,
and the overhead of each new connection will hopefully be minimal for
typical workloads.
We saw a bug where renamed streams had missing data at the end. I
think what happened is:
- Write data to /old/path
- Rename to /new/path
- Write data to /new/path
- Cache entry for /old/path gets evicted, file gets truncated
Instead, make sure we evict /old/path right away when renaming.