linux-kernel - Re: [PATCH] bcache: enhancing the security of dirty data writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mu7u23kbguzgzfovqpadr6id2pi5a3l6tca2gengjiqgndutw2@qu4aj5didb4h>
Date: Mon, 4 Aug 2025 21:31:38 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Zhou Jifeng <zhoujifeng@...inos.com.cn>
Cc: Coly Li <colyli@...nel.org>, 
	linux-bcache <linux-bcache@...r.kernel.org>, linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] bcache: enhancing the security of dirty data writeback

On Tue, Aug 05, 2025 at 09:17:31AM +0800, Zhou Jifeng wrote:
> On Tue, 5 Aug 2025 at 00:07, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
> >
> > On Mon, Aug 04, 2025 at 11:31:30PM +0800, Coly Li wrote:
> > > On Mon, Aug 04, 2025 at 12:17:28AM -0400, Kent Overstreet wrote:
> > > > On Mon, Aug 04, 2025 at 11:47:57AM +0800, Zhou Jifeng wrote:
> > > > > On Sun, 3 Aug 2025 at 01:30, Coly Li <colyli@...nel.org> wrote:
> > > > > >
> > > > > > On Fri, Aug 01, 2025 at 02:10:12PM +0800, Zhou Jifeng wrote:
> > > > > > > On Fri, 1 Aug 2025 at 11:42, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
> > > > > > > >
> > > > > > > > On Fri, Aug 01, 2025 at 11:30:43AM +0800, Zhou Jifeng wrote:
> > > > > > > > > On Fri, 1 Aug 2025 at 10:37, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 01, 2025 at 10:27:21AM +0800, Zhou Jifeng wrote:
> > > > > > > > > > > In the writeback mode, the current bcache code uses the
> > > > > > > > > > > REQ_OP_WRITE operation to handle dirty data, and clears the bkey
> > > > > > > > > > > dirty flag in the btree during the bio completion callback. I think
> > > > > > > > > > > there might be a potential risk: if in the event of an unexpected
> > > > > > > > > > > power outage, the data in the HDD hardware cache may not have
> > > > > > > > > > > had time to be persisted, then the data in the HDD hardware cache
> > > > > > > > > > > that is pending processing may be lost. Since at this time the bkey
> > > > > > > > > > > dirty flag in the btree has been cleared, the data status recorded
> > > > > > > > > > > by the bkey does not match the actual situation of the SSD and
> > > > > > > > > > > HDD.
> > > > > > > > > > > Am I understanding this correctly?
> > > > > > > > > >
> > > > > > > > > > For what you're describing, we need to make sure the backing device is
> > > > > > > > > > flushed when we're flushing the journal.
> > > > > > > > > >
> > > > > > > > > > It's possible that this isn't handled correctly in bcache; bcachefs
> > > > > > > > > > does, and I wrote that code after bcache - but the bcache version would
> > > > > > > > > > look quite different.
> > > > > > > > > >
> > > > > > > > > > You've read that code more recently than I have - have you checked for
> > > > > > > > > > that?
> > > > > > > > >
> > > > > > > > > In the `write_dirty_finish` function, there is an attempt to update the
> > > > > > > > > `bkey` status, but I did not observe any logging writing process. In the
> > > > > > > > > core function `journal_write_unlocked` of bcache for writing logs, I
> > > > > > > > > also couldn't find the code logic for sending a FLUSH command to the
> > > > > > > > > backend HDD.
> > > > > > > >
> > > > > > > > The right place for it would be in the journal code: before doing a
> > > > > > > > journal write, issue flushes to the backing devices.
> > > > > > > >
> > > > > > > > Can you check for that?
> > > > > > > >
> > > > > > >
> > > > > > > I checked and found that there was no code for sending a flush request
> > > > > > > to the backend device before the execution log was written. Additionally,
> > > > > > > in the callback function after the dirty data was written back, when it
> > > > > > > updated the bkey, it did not insert this update into the log.
> > > > > > >
> > > > > >
> > > > > > It doesn't have to. If the previous dirty version of the key is on cache device
> > > > > > already, and power off happens, even the clean version of the key is gone, the
> > > > > > dirty version and its data are all valid. If the LBA range of this key is
> > > > > > allocated to a new key, a GC must have alrady happened, and the dirty key is
> > > > > > invalid due to bucket generation increased. So don't worry, the clean key is
> > > > > > unncessary to go into journal in the writeback scenario.
> > > > > >
> > > > > > IMHO, you may try to flush backing device in a kworker before calling
> > > > > > set_gc_sectors() in bch_gc_thread(). The disk cache can be flushed in time
> > > > > > before the still-dirty-on-disk keys are invalidated by increase bucket key
> > > > > > gen. And also flushing backing device after searched_full_index becomes
> > > > > > true in the writeback thread main loop (as you did now).
> > > > > >
> > > > >
> > > > > The "flush" command previously issued by GC was supposed to alleviate
> > > > > the problems in some scenarios. However, I thought of a situation where
> > > > > this "flush" command issued before GC might not be sufficient to solve
> > > > > the issue.
> > > > >
> > > > > Suppose such a scenario: after a power failure, some hardware cache data
> > > > > in the HDD is lost, while the corresponding bkey(with the dirty flag cleared)
> > > > > update in the SSD has been persisted. After the power is restored, if
> > > > > bcache sends a flush before GC, will this cause data loss?
> > > >
> > > > Yes.
> > >
> > > The cleared key is updated in-place within the in-memory btree node,
> > > flushing backing devices before committing journal doesn't help.
> >
> > Yes, it would, although obviously we wouldn't want to do a flush every
> > time we clear the dirty bit - it needs batching.
> >
> > Any time you're doing writes to multiple devices that have ordering
> > dependencies, a flush needs to be involved.
> >
> > > I try to avoid adding the cleared key into journal, in high write pressure,
> > > such synchronized link between writeback, gc and journal makes me really
> > > uncomfortable.
> > >
> > > Another choice might be adding a tag in struct btree, and set the tag when
> > > the cleared key in-place updated in the btree node. When writing a bset and
> > > the tag is set, then flush corresponding backing devcie before writing the
> > > btree node. Maybe hurts less performance than flushing backing device before
> > > committing journal set.
> > >
> > > How do you think of it, Kent?
> >
> > Have a look at the code for this in bcachefs, fs/bcachefs/journal_io.c,
> > journal_write_preflush().
> >
> > If it's a multi device filesystem, we issue flushes separately from the
> > journal write and wait for them to complete before doing the REQ_FUA
> > journal write - that ensures that any cross device IO dependencies are
> > ordered correctly.
> >
> > That approach would work in bcache as well, but it'd have higher
> > performance overhead than in bcachefs because bcache doesn't have the
> > concept of noflush (non commit) journal writes - every journal write is
> > FLUSH/FUA, and there's also writes that bypass the cache, which we we'll
> > be flushing unnecessarily.
> >
> > Having a flag/bitmask for "we cleared dirty bits, these backing
> > device(s) need flushes" would probably have acceptable performance
> > overhead.
> >
> > Also, we're getting damn close to being ready to lift the experimental
> > label on bcachefs, so maybe have a look at that too :)
> >
> 
> Could we consider the solution I submitted, which is based on the
> following main principle:
> 1. Firstly, in the write_dirty_finish stage, the dirty marking bkeys are
> not inserted into the btree immediately. Instead, they are temporarily
> stored in an internal memory queue called Alist.
> 2. Then, when the number of bkeys in Alist exceeds a certain limit, a
> flush request is sent to the backend HDD.
> 3. After the flush is sent, the bkeys recorded in Alist are then
> inserted into the btree.
> This process ensures that the written dirty data is written to the disk
> before the btree is updated. The length of Alist can be configured,
> allowing for better control of the flush sending frequency and reducing
> the impact of the flush on the write speed.

That approach should work as well. You'll want to make the list size
rather bit, and add statistics for how ofter flushes are being issued.