linux-kernel - Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZeDQd7l6PQF5IpBa@bfoster>
Date: Thu, 29 Feb 2024 13:44:07 -0500
From: Brian Foster <bfoster@...hat.com>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: linux-bcachefs@...r.kernel.org, linux-kernel@...r.kernel.org,
	djwong@...nel.org
Subject: Re: [PATCH 03/21] bcachefs: btree write buffer knows how to
 accumulate bch_accounting keys

On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> > > +	if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> > > +		struct bkey u;
> > > +		struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> > > +
> > > +		if (k.k->type == KEY_TYPE_accounting)
> > > +			bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> > > +						   bkey_s_c_to_accounting(k));
> > 
> > So it looks like we're accumulating from the btree key into the write
> > buffer key. Is this so the following code will basically insert a new
> > btree key based on the value of the write buffer key?
> 
> Correct, this is where we go from "accounting keys is a delta" to
> "accounting key is new version of the key".
> 
> > >  	darray_for_each(wb->sorted, i) {
> > >  		struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
> > > +		bool accounting_accumulated = false;
> > 
> > Should this live within the interior flush loop?
> 
> We can't define it within the loop because then we'd be setting it to
> false on every loop iteration... but it does belong _with_ the loop, so
> I'll move it to right before.
> 

Ah, right.

> > > -			bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > -						bch2_btree_write_buffer_journal_flush);
> > > +			if (!accounting_replay_done &&
> > > +			    i->k.k.type == KEY_TYPE_accounting) {
> > > +				could_not_insert++;
> > > +				continue;
> > > +			}
> > > +
> > > +			if (!could_not_insert)
> > > +				bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > +							bch2_btree_write_buffer_journal_flush);
> > 
> > Hmm.. so this is sane because the slowpath runs in journal sorted order,
> > right?
> 
> yup, which means as soon as we hit a key we can't insert we can't
> release any more journal pins
> 
> > 
> > >  
> > >  			bch2_trans_begin(trans);
> > >  
> > > @@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> > >  					btree_write_buffered_insert(trans, i));
> > >  			if (ret)
> > >  				goto err;
> > > +
> > > +			i->journal_seq = 0;
> > > +		}
> > > +
> > 
> > 		/*
> > 		 * Condense the remaining keys <reasons reasons>...??
> > 		 */
> 
> yup, that's a good comment
> 
> > > +		if (could_not_insert) {
> > > +			struct btree_write_buffered_key *dst = wb->flushing.keys.data;
> > > +
> > > +			darray_for_each(wb->flushing.keys, i)
> > > +				if (i->journal_seq)
> > > +					*dst++ = *i;
> > > +			wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> > >  		}
> > >  	}
> > >  err:
> > > +	if (ret || !could_not_insert) {
> > > +		bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > +		wb->flushing.keys.nr = 0;
> > > +	}
> > > +
> > >  	bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
> > > -	trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> > > -	bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > -	wb->flushing.keys.nr = 0;
> > > +	trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
> > 
> > I feel like the last time I looked at the write buffer stuff the flush
> > wasn't reentrant in this way. I.e., the flush switched out the active
> > buffer and so had to process all entries in the current buffer (or
> > something like that). Has something changed or do I misunderstand?
> 
> Yeah, originally we were adding keys to the write buffer directly from
> the transaction commit path, so that necessitated the super fast
> lockless stuff where we'd toggle between buffers so one was always
> available.
> 
> Now keys are pulled from the journal, so we can use (somewhat) simpler
> locking and buffering; now the complication is that we can't predict in
> advance how many keys are going to come out of the journal for the write
> buffer.
> 

Ok.

> > 
> > >  	return ret;
> > >  }
> > >  
> > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > > index 6829d80bd181..b8289af66c8e 100644
> > > --- a/fs/bcachefs/recovery.c
> > > +++ b/fs/bcachefs/recovery.c
> > > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> > >  			goto err;
> > >  	}
> > >  
> > > +	set_bit(BCH_FS_accounting_replay_done, &c->flags);
> > > +
> > 
> > I assume this ties into the question on the previous patch..
> > 
> > Related question.. if the write buffer can't flush during journal
> > replay, is there concern/risk of overflowing it?
> 
> Shouldn't be any actual risk. It's just new accounting updates that the
> write buffer can't flush, and those are only going to be generated by
> interior btree node updates as journal replay has to split/rewrite nodes
> to make room for its updates.
> 
> And for those new acounting updates, updates to the same counters get
> accumulated as they're flushed from the journal to the write buffer -
> see the patch for eytzingcer tree accumulated. So we could only overflow
> if the number of distinct counters touched somehow was very large.
> 
> And the number of distinct counters will be growing significantly, but
> the new counters will all be for user data, not metadata.
> 
> (Except: that reminds me, we do want to add per-btree counters, so users
> can see "I have x amount of extents, x amount of dirents, etc.).
> 

Heh, Ok. This all does sound a little open ended to me. Maybe the better
question is: suppose this hypothetically does happen after adding a
bunch of new counters, what would the expected side effect be in the
recovery scenario where the write buffer can't be flushed?

If write buffer updates now basically just journal a special entry,
would that basically mean we'd deadlock during recovery due to no longer
being able to insert journal entries due to a pinned write buffer? If
so, that actually seems reasonable to me in the sense that in theory it
at least doesn't break the filesystem on-disk, but obviously it would
require some kind of enhancement in order to complete the recovery (even
if what that is is currently unknown). Hm?

Brian