OnPGError is called on every error response received from Postgres and can
be used to close connections on specific errors. Defaults to closing on
FATAL-severity errors.
Fixes#1803
stdlib_test.TestConnConcurrency had been flickering on CI deadline /
timeout errors. This was extremely confusing because the test deadline
was set for 2 minutes and the errors would occur much quicker.
The problem only manifested in an extremely specific and timing
sensitive situation.
1. The watchdog timer for deadlocked writes starts the goroutine to
start the background reader
2. The background reader is stopped
3. The next operation is a read without a preceding write (AFAIK only
CheckConn does this)
4. The deadline is set to interrupt the read
5. The goroutine from 1 actually starts the background reader
6. The background reader gets an error reading the connection with the
deadline
7. The deadline is cleared
8. The next read on the connection will get the timeout error
connect_timeout given in conn string was not obeyed if sslmode is not specified (default is prefer) or equals sslmode=allow|prefer. It took twice the amount of time specified by connect_timeout in conn string. While this behavior is correct if multi-host is provided in conn string, it doesn't look correct in case of single host. This behavior was also not matching with libpq.
The root cause was to implement sslmode=allow|prefer conn are tried twice. First with TLSConfig and if that doesn't work then without TLSConfig. The fix for this issue now uses the same context if same host is being tried out. This change won't affect the existing multi-host behavior.
This PR goal is to close issue [jackc/pgx/issues/1672](https://github.com/jackc/pgx/issues/1672)
The context timeouts for tests are designed to give a better error
message when something hangs rather than the test just timing out.
Unfortunately, the potato CI frequently has some test or another
randomly take a long time. While the increased times are somewhat less
than optimal on a real computer, hopefully this will solve the
flickering CI.
on Windows connecting on a closed port takes about 2 seconds.
You can test with something like this
start := time.Now()
_, err := d.DialContext(context.Background(), "tcp", "127.0.0.1:1")
fmt.Printf("finished, time %s, err: %v\n", time.Since(start), err)
This seems by design
https://groups.google.com/g/comp.os.ms-windows.programmer.win32/c/jV6kRVY3BqM
Generally TestConnectWithFallback takes about 8-9 seconds on Windows.
Increase timeout to avoid random failures under load
This commit adds a background reader that can optionally buffer reads.
It is used whenever a potentially blocking write is made to the server.
The background reader is started on a slight delay so there should be no
meaningful performance impact as it doesn't run for quick queries and
its overhead is minimal relative to slower queries.
The non-blocking IO system was designed to solve three problems:
1. Deadlock that can occur when both sides of a connection are blocked
writing because all buffers between are full.
2. The inability to use a write deadline with a TLS.Conn without killing
the connection.
3. Efficiently check if a connection has been closed before writing.
This reduces the cases where the application doesn't know if a query
that does a INSERT/UPDATE/DELETE was actually sent to the server or
not.
However, the nbconn package is extraordinarily complex, has been a
source of very tricky bugs, and has OS specific code paths. It also does
not work at all with underlying net.Conn implementations that do not
have platform specific non-blocking IO syscall support and do not
properly implement deadlines. In particular, this is the case with
golang.org/x/crypto/ssh.
I believe the deadlock problem can be solved with a combination of a
goroutine for CopyFrom like v4 used and a watchdog for regular queries
that uses time.AfterFunc.
The write deadline problem actually should be ignorable. We check for
context cancellation before sending a query and the actual Write should
be almost instant as long as the underlying connection is not blocked.
(We should only have to wait until it is accepted by the OS, not until
it is fully sent.)
Efficiently checking if a connection has been closed is probably the
hardest to solve without non-blocking reads. However, the existing code
only solves part of the problem. It can detect a closed or broken
connection the OS knows about, but it won't actually detect other types
of broken connections such as a network interruption. This is currently
implemented in CheckConn and called automatically when checking a
connection out of the pool that has been idle for over one second. I
think that changing CheckConn to a very short deadline read and changing
the pool to do an actual Ping would be an acceptable solution.
Remove nbconn and non-blocking code. This does not leave the system in
an entirely working state. In particular, CopyFrom is broken, deadlocks
can occur for extremely large queries or batches, and PgConn.CheckConn
is now a `select 1` ping. These will be resolved in subsequent commits.
Tests should timeout in a reasonable time if something is stuck. In
particular this is important when testing deadlock conditions such as
can occur with the copy protocol if both the client and the server are
blocked writing until the other side does a read.