Previous commits went back and forth a bit on whether the various APIs
should use bytes or strings, but bytes appears to be a better answer,
because actual data in streams will always be 7-bit ASCII or raw
binary. There's no reason to apply the performance penalty of
constantly converting between bytes and strings.
One drawback now is that lots of code now has to have "b" prefixes on
strings, especially in tests, which inflates this commit quite a bit.
This gives an easy way to get a large values in the database start_pos
and end_pos fields, which is necessary for testing failure modes when
those get too large (e.g. on 32-bit systems). Adjust tests to make
use of this knob.
We do this by creating a new requests.Session object for each request,
sending a "Connection: close" request header, and then explicitly
marking the connection for close after the response is read.
This is to avoid a longstanding race condition with HTTP keepalive
and server timeouts. Due to data processing, capture, etc, requests
may be separated by an arbitrary delay. If this delay is shorter
than the server's KeepAliveTimeout, the same connection is used.
If the delay is longer, a new connection is used. If the delay is
the same, however, the request may be sent on the old connection at
the exact same time that the server closes it. Typically, the
client sees the connection as closing between the request and the
response, which leads to "httplib.BadStatusLine" errors.
This patch avoids the race condition entirely by not using persistent
Another solution may be to detect those errors and retry the
connection, resending the request. However, the race condition could
potentially show up in other places, like a closed connection during
the request body, not after. Such an error could also be a legitimate
network condition or problem. This solution should be more reliable,
and the overhead of each new connection will hopefully be minimal for
We were still calculating the maximum number of rows correctly,
so the extra data was really extra and would get re-written to the
beginning of the subsequent file.
The only case in which this would lead to database issues is if the
very last file was lengthened incorrectly, and the "nrows" calculation
would therefore be wrong when the database was reopened. Still, even
in that case, it should just leave a small gap in the data, not cause