You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With track_and_verify_wals_in_manifest=true a WalAddition should only be recorded in the MANIFEST once when it is synced, no matter what write options we use.
Actual behavior
When writing with write_options.sync=true, the WalAddition is duplicated hundreds of times in the MANIFEST. This causes MANIFEST to get very large and causes additional useless disk IO. Raw output from ldb manifest_dump is at the bottom of this issue.
Don't set track_and_verify_wals_in_manifest=true. However, the RocksDB wiki recommends setting this to true: "We recommend to set track_and_verify_wals_in_manifest to true for production".
Set atomic_flush=true, which handles this better (it only duplicates the record twice instead of hundreds of times, see below).
Additional details
I attempted to debug this problem, and while I think I understand it, I do not know how to fix it.
My test executes many writes so RocksDB switches to a new WAL 3 times (4 total WAL files). It then counts the number of WalAddition records in MANIFEST. It tries the following options, with both atomic_flush=true and atomic_flush=false.
Writes with sync=false
Writes with sync=false followed by FlushWAL(true)
Writes with sync=true
The results are the following:
sync=false; atomic_flush=false : 0 records in MANIFEST (CORRECT): Nothing is flushed, so I think this is right.
sync=false; atomic_flush=true : 3 records in MANIFEST (CORRECT): Strictly speaking I'm not sure these records are needed, but this is correct.
sync=false + FlushWal(true); atomic_flush=*: 3 records in MANIFEST (CORRECT)
sync=true; atomic_flush=false: 194 records in MANIFEST (ERROR): This should be 3. Each record is replicated many times with the exact same values (see below)
sync=true; atomic_flush=true: 6 records in MANIFEST (ERROR): This should be 3. This is not as bad, since it only does this twice (see below)
My (maybe incorrect) analysis of the two incorrect cases:
sync=true; atomic_flush=false:
In the "correct" case when we use FlushWal(true), the WAL file is closed by SyncWalImpl, then removed from DBImpl::logs_ by MarkLogsSynced. Since the log is removed from logs_, it only gets added to the MANIFEST once.
In this "incorrect" case, the new log file is added to logs_ in SwitchMemtable, but the file itself is closed by a background task: BGWorkFlush which eventually calls FindObsoleteFiles, which actually closes the file. Util FindObsoleteFiles closes the WAL file, each write finds two files in logs_, so each call to Put then calls MarkLogsSynced, which decides it needs to add the WAL file to synced_wals here:
I'm not sure how to fix this. Maybe we need SwitchMemtable to close the log file, so this code in MarkLogsSynced only gets executed once? Or maybe we need this code in MarkLogsSynced to close the log file?
sync=true; atomic_flush=true: I think the problem is that the MANIFEST is updated FIRST by the Put(sync=true), then it is updated AGAIN by SyncWalImpl, which is called because atomic_flush=true. This is similar to the atomic_flush=false case, but less bad. I think if we fix the above case, it should fix this also.
Example MANIFEST dumps
sync=true; atomic_flush=false
Notice that the WalAdditions repeat many times. This is truncated output from the unit test, showing the MANIFEST records switching from one WAL to the next.
Expected behavior
With
track_and_verify_wals_in_manifest=true
a WalAddition should only be recorded in the MANIFEST once when it is synced, no matter what write options we use.Actual behavior
When writing with
write_options.sync=true
, the WalAddition is duplicated hundreds of times in the MANIFEST. This causes MANIFEST to get very large and causes additional useless disk IO. Raw output fromldb manifest_dump
is at the bottom of this issue.Steps to reproduce the behavior
I have created a unit test in a branch. See
DBTestTrackWalCountRecords
indb_basic_test.cc
here: https://github.com/evanj/rocksdb/blob/evan.jones/test-track-with-sync/db/db_basic_test.cc#L4149-L4184Workarounds
track_and_verify_wals_in_manifest=true
. However, the RocksDB wiki recommends setting this totrue
: "We recommend to set track_and_verify_wals_in_manifest to true for production".atomic_flush=true
, which handles this better (it only duplicates the record twice instead of hundreds of times, see below).Additional details
I attempted to debug this problem, and while I think I understand it, I do not know how to fix it.
My test executes many writes so RocksDB switches to a new WAL 3 times (4 total WAL files). It then counts the number of
WalAddition
records inMANIFEST
. It tries the following options, with bothatomic_flush=true
andatomic_flush=false
.sync=false
sync=false
followed byFlushWAL(true)
sync=true
The results are the following:
sync=false; atomic_flush=false
: 0 records in MANIFEST (CORRECT): Nothing is flushed, so I think this is right.sync=false; atomic_flush=true
: 3 records in MANIFEST (CORRECT): Strictly speaking I'm not sure these records are needed, but this is correct.sync=false + FlushWal(true); atomic_flush=*
: 3 records in MANIFEST (CORRECT)sync=true; atomic_flush=false
: 194 records in MANIFEST (ERROR): This should be 3. Each record is replicated many times with the exact same values (see below)sync=true; atomic_flush=true
: 6 records in MANIFEST (ERROR): This should be 3. This is not as bad, since it only does this twice (see below)My (maybe incorrect) analysis of the two incorrect cases:
sync=true; atomic_flush=false
:In the "correct" case when we use FlushWal(true), the WAL file is closed by
SyncWalImpl
, then removed fromDBImpl::logs_
byMarkLogsSynced
. Since the log is removed fromlogs_
, it only gets added to theMANIFEST
once.In this "incorrect" case, the new log file is added to
logs_
inSwitchMemtable
, but the file itself is closed by a background task:BGWorkFlush
which eventually callsFindObsoleteFiles
, which actually closes the file. UtilFindObsoleteFiles
closes the WAL file, each write finds two files inlogs_
, so each call toPut
then callsMarkLogsSynced
, which decides it needs to add the WAL file tosynced_wals
here:rocksdb/db/db_impl/db_impl.cc
Lines 1860 to 1864 in 1095810
I'm not sure how to fix this. Maybe we need
SwitchMemtable
to close the log file, so this code inMarkLogsSynced
only gets executed once? Or maybe we need this code in MarkLogsSynced to close the log file?sync=true; atomic_flush=true
: I think the problem is that theMANIFEST
is updated FIRST by thePut(sync=true)
, then it is updated AGAIN bySyncWalImpl
, which is called becauseatomic_flush=true
. This is similar to theatomic_flush=false
case, but less bad. I think if we fix the above case, it should fix this also.Example MANIFEST dumps
sync=true; atomic_flush=false
Notice that the
WalAdditions
repeat many times. This is truncated output from the unit test, showing the MANIFEST records switching from one WAL to the next.sync=true; atomic_flush=true
Notice that the
WalAdditions
are all exactly duplicated. This is the complete output from the unit test.The text was updated successfully, but these errors were encountered: