linux-kernel - Re: [PATCH v7] zswap: replace RB tree with xarray

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zfs6TP8pdmNHcDfA@google.com>
Date: Wed, 20 Mar 2024 19:34:36 +0000
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Chris Li <chrisl@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Nhat Pham <nphamcs@...il.com>, "Matthew Wilcox (Oracle)" <willy@...radead.org>, 
	Chengming Zhou <zhouchengming@...edance.com>, Barry Song <v-songbaohua@...o.com>
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > [..]
> > > > > > > -     /* map */
> > > > > > > -     spin_lock(&tree->lock);
> > > > > > >       /*
> > > > > > > -      * The folio may have been dirtied again, invalidate the
> > > > > > > -      * possibly stale entry before inserting the new entry.
> > > > > > > +      * We finish initializing the entry while it's already in xarray.
> > > > > > > +      * This is safe because:
> > > > > > > +      *
> > > > > > > +      * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > +      *
> > > > > > > +      * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > +      *    The publishing order matters to prevent writeback from seeing
> > > > > > > +      *    an incoherent entry.
> > > > > >
> > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > so I don't think the publishing order actually matters here. Right?
> > > > > 
> > > > > Right. This comment is explaining why this publishing order does not
> > > > > matter. I think we are talking about the same thing here?
> > > > 
> > > > The comment literally says "the publishing order matters.." :)
> > > > 
> > > > I believe Johannes meant that we should only publish the entry to the
> > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > partially initialized entry.
> > > > 
> > > > What I am saying is that, even if we add a partially initialized entry
> > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > locked in the swapcache.
> > > > 
> > > > So basically I think the comment should say:
> > > > 
> > > > 	/*
> > > > 	 * We finish initializing the entry while it's already in the
> > > > 	 * xarray. This is safe because the folio is locked in the swap
> > > > 	 * cache, which should protect against concurrent stores,
> > > > 	 * invalidations, and writeback.
> > > > 	 */
> > > > 
> > > > Johannes, what do you think?
> > > 
> > > I don't think that's quite right.
> > > 
> > > Writeback will bail on swapcache insert, yes, but it will access the
> > > entry before attempting it. If LRU publishing happened before setting
> > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > it would be safe to rearrange the code like this.
> > > 
> > > So LRU publishing order does matter.
> > 
> > Ah yes, you are right. entry->swpentry should be set to make sure we
> > lookup the correct entry in the swapcache and the tree.
> > 
> > Perhaps we should spell this out in the comment and make the
> > initialization ordering more explicit? Maybe something like:
> > 
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index d8a14b27adcd7..70924b437743a 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> >  		goto put_pool;
> >  
> >  insert_entry:
> > -	entry->swpentry = swp;
> > -	entry->objcg = objcg;
> > -
> >  	old = xa_store(tree, offset, entry, GFP_KERNEL);
> >  	if (xa_is_err(old)) {
> >  		int err = xa_err(old);
> > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> >  	if (old)
> >  		zswap_entry_free(old);
> >  
> > +	entry->objcg = objcg;
> >  	if (objcg) {
> >  		obj_cgroup_charge_zswap(objcg, entry->length);
> >  		count_objcg_event(objcg, ZSWPOUT);
> > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> >  
> >  	/*
> >  	 * We finish initializing the entry while it's already in xarray.
> > -	 * This is safe because:
> > -	 *
> > -	 * 1. Concurrent stores and invalidations are excluded by folio lock.
> > +	 * This is safe because the folio is locked in the swapcache, which
> > +	 * protects against concurrent stores and invalidations.
> >  	 *
> > -	 * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -	 *    The publishing order matters to prevent writeback from seeing
> > -	 *    an incoherent entry.
> > +	 * Concurrent writeback is not possible until we add the entry to the
> > +	 * LRU. We need to at least initialize entry->swpentry *before* adding
> > +	 * the entry to the LRU to make sure writeback looks up the correct
> > +	 * entry in the swapcache.
> >  	 */
> >  	if (entry->length) {
> > +		entry->swpentry = swp;
> >  		INIT_LIST_HEAD(&entry->lru);
> >  		zswap_lru_add(&zswap_list_lru, entry);
> >  		atomic_inc(&zswap_nr_stored);
> > 
> > 
> > This also got me wondering, do we need a write barrier between
> > initializing entry->swpentry and zswap_lru_add()?
> > 
> > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > bug-prone.
> 
> I think it's more robust the way Chris has it now. Writeback only
> derefs ->swpentry today, but who knows if somebody wants to make a
> changes that relies on a different member. Having submembers follow
> different validity rules and timelines is error prone and makes the
> code less hackable without buying all that much. The concept of
> "publishing" an object like this is more common: if you can see it,
> you can expect it to be coherent.

Fair enough, but don't we still need a barrier there? Couldn't some
initializations still be reorder after zswap_lru_add()?