So far the code was frequently traversing all the keys (ignoring flag whether key is a bucket)
and trying to open each of the keys as bucket (seeking the same entry from the scratch).
In this proposal, we iterate only through bucket keys.
Signed-off-by: Piotr Tabor <ptab@google.com>
It makes it easy to find which page actually looks to be corrupted.
```
go build ./cmd/bbolt/ && ./bbolt check ~/Downloads/db
panic: freepages: failed to get all reachable pages (page 8314893338927566090: out of bounds: 6258 (stack: [4517 395 821]))
goroutine 18 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
/Users/ptab/gits/bbolt/db.go:1056 +0x8c
created by go.etcd.io/bbolt.(*DB).freepages
/Users/ptab/gits/bbolt/db.go:1054 +0x138
```
Signed-off-by: Piotr Tabor <ptab@google.com>
After checkptr fixes by 2fc6815c, it was discovered that new issues
were hit in production systems, in particular when a single process
opened and updated multiple separate databases. This indicates that
some bug relating to bad unsafe usage was introduced during this
commit.
This commit combines several attempts at fixing this new issue. For
example, slices are once again created by slicing an array of "max
allocation" elements, but this time with the cap set to the intended
length. This operation is espressly permitted according to the Go
wiki, so it should be preferred to type converting a
reflect.SliceHeader.
When the database has a lot of freepages, the cost to sync all
freepages down to disk is high. If the total database size is
small (<10GB), and the application can tolerate ~10 seconds
recovery time, then it is reasonable to simply not sync freelist
and rescan the db to rebuild freelist on recovery.
Read txns would lock pages allocated after the txn, keeping those pages
off the free list until closing the read txn. Instead, track allocating
txid to compute page lifetime, freeing pages if all txns between
page allocation and page free are closed.
Using a large (50gb) database with a read-write-delete heavy load,
nearly 100% of allocated space came from freelists.
1/3 came from freelist.release, 1/3 from freelist.write,
and 1/3 came from tx.allocate to make space for freelist.write.
In the case of freelist.write, the newly allocated giant slice gets
copied to the space prepared by tx.allocate and then discarded.
To avoid this, add func mergepgids that accepts a destination slice,
and use it in freelist.write.
This has a mild negative impact on the existing benchmarks,
but cuts allocated space in my real world db by over 30%.
name old time/op new time/op delta
_FreelistRelease10K-8 18.7µs ±10% 18.2µs ± 4% ~ (p=0.548 n=5+5)
_FreelistRelease100K-8 233µs ± 5% 258µs ±20% ~ (p=0.151 n=5+5)
_FreelistRelease1000K-8 3.34ms ± 8% 3.13ms ± 8% ~ (p=0.151 n=5+5)
_FreelistRelease10000K-8 32.3ms ± 1% 32.2ms ± 7% ~ (p=0.690 n=5+5)
DBBatchAutomatic-8 2.18ms ± 3% 2.19ms ± 4% ~ (p=0.421 n=5+5)
DBBatchSingle-8 140ms ± 6% 140ms ± 4% ~ (p=0.841 n=5+5)
DBBatchManual10x100-8 4.41ms ± 2% 4.37ms ± 3% ~ (p=0.548 n=5+5)
name old alloc/op new alloc/op delta
_FreelistRelease10K-8 82.5kB ± 0% 82.5kB ± 0% ~ (all samples are equal)
_FreelistRelease100K-8 805kB ± 0% 805kB ± 0% ~ (all samples are equal)
_FreelistRelease1000K-8 8.05MB ± 0% 8.05MB ± 0% ~ (all samples are equal)
_FreelistRelease10000K-8 80.4MB ± 0% 80.4MB ± 0% ~ (p=1.000 n=5+5)
DBBatchAutomatic-8 384kB ± 0% 384kB ± 0% ~ (p=0.095 n=5+5)
DBBatchSingle-8 17.2MB ± 1% 17.2MB ± 1% ~ (p=0.310 n=5+5)
DBBatchManual10x100-8 908kB ± 0% 902kB ± 1% ~ (p=0.730 n=4+5)
name old allocs/op new allocs/op delta
_FreelistRelease10K-8 5.00 ± 0% 5.00 ± 0% ~ (all samples are equal)
_FreelistRelease100K-8 5.00 ± 0% 5.00 ± 0% ~ (all samples are equal)
_FreelistRelease1000K-8 5.00 ± 0% 5.00 ± 0% ~ (all samples are equal)
_FreelistRelease10000K-8 5.00 ± 0% 5.00 ± 0% ~ (all samples are equal)
DBBatchAutomatic-8 10.2k ± 0% 10.2k ± 0% +0.07% (p=0.032 n=5+5)
DBBatchSingle-8 58.6k ± 0% 59.6k ± 0% +1.70% (p=0.008 n=5+5)
DBBatchManual10x100-8 6.02k ± 0% 6.03k ± 0% +0.17% (p=0.029 n=4+4)
This commits fixes a timing bug where `DB.StrictMode` can panic
before the goroutine reading the database can finish. If an error
is found in strict mode then it now finishes reading the entire
database before panicking.
This commit changes `Tx.WriteTo()` to use the transaction's
in-memory meta page instead of copying from the disk. This is
needed because the transaction uses the size from its meta page
but writes the current meta page on disk which may have allocated
additional pages since the transaction started.
Fixes#513
This commit refactors the test suite to make it cleaner and to use the
standard testing library better. The `assert()`, `equals()`, and `ok()`
functions have been removed and some test names have been changed for
clarity.
No functionality has been changed.
Only grow the database size when the high watermark increases.
We also grows the database size a little bit aggressively to
save a few ftruncates.
I have tested this on various environments. The performance impact
is ignorable with 16MB over allocation. Without over allocation,
the performance might decrease 100% when each Tx.Commit needs a new
page on a very slow disk (seek time dominates the total write).
This commit removes references to the last write transaction and
and its dirty pages after the transaction has been closed. This
bug did not affect the safety of the data, however, it would
cause dirty pages to linger in-memory until the next write
transaction began.
Added a few lines of documentation to clarify that read-only
transactions need to be rolled back and not committed, as per the
discussion in issue #344
when accessing the node data we used to use cast to
*[maxAllocSize]byte, which breaks if we try to go across maxAllocSize boundary.
This leads to occasional panics.
Sample stacktrace:
```
panic: runtime error: slice bounds out of range
goroutine 1 [running]:
github.com/boltdb/bolt.(*node).write(0xc208010f50, 0xc27452a000)
$GOPATH/src/github.com/boltdb/bolt/node.go:228 +0x5a5
github.com/boltdb/bolt.(*node).spill(0xc208010f50, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/node.go:364 +0x506
github.com/boltdb/bolt.(*node).spill(0xc208010700, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/node.go:336 +0x12d
github.com/boltdb/bolt.(*node).spill(0xc208010620, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/node.go:336 +0x12d
github.com/boltdb/bolt.(*Bucket).spill(0xc22b6ae880, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/bucket.go:535 +0x1c4
github.com/boltdb/bolt.(*Bucket).spill(0xc22b6ae840, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/bucket.go:502 +0xac2
github.com/boltdb/bolt.(*Bucket).spill(0xc22f4e2018, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/bucket.go:502 +0xac2
github.com/boltdb/bolt.(*Tx).Commit(0xc22f4e2000, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/tx.go:150 +0x1ee
github.com/boltdb/bolt.(*DB).Update(0xc2080e4000, 0xc24d077508, 0x0, 0x0)
$GOPATH/src/github.com/boltdb/bolt/db.go:483 +0x169
```
It usually happens when working with large (50M/100M) values.
One way to reproduce it is to change maxAllocSize in bolt_amd64.go to 70000 and run the tests.
TestBucket_Put_Large crashes.
This commit moves the functionality in Tx.Copy() to Tx.WriteTo(). This
allows Tx to be used as an io.WriterTo which makes it easier to mock.
The Tx.Copy() function still exists but it's simply a wrapper around
Tx.WriteTo().
OpenBSD does not include a UBC kernel and writes must be synchronized
with the msync(2) syscall. In addition, the NoSync field of the DB
struct should be ignored on OpenBSD, since unlike other platforms,
missing msyncs will result in data corruption.
Depends on PR #258.
Fixes#257.