[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240130031727.GA780575@cmpxchg.org>
Date: Mon, 29 Jan 2024 22:17:27 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Chengming Zhou <zhouchengming@...edance.com>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, Nhat Pham <nphamcs@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Chris Li <chriscli@...gle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and
swapoff
On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
> On 2024/1/30 08:22, Yosry Ahmed wrote:
> > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >> {
> >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >> bool *encountered_page_in_swapcache = (bool *)arg;
> >> - struct zswap_tree *tree;
> >> - pgoff_t swpoffset;
> >> + swp_entry_t swpentry;
> >> enum lru_status ret = LRU_REMOVED_RETRY;
> >> int writeback_result;
> >>
> >> + /*
> >> + * Rotate the entry to the tail before unlocking the LRU,
> >> + * so that in case of an invalidation race concurrent
> >> + * reclaimers don't waste their time on it.
> >> + *
> >> + * If writeback succeeds, or failure is due to the entry
> >> + * being invalidated by the swap subsystem, the invalidation
> >> + * will unlink and free it.
> >> + *
> >> + * Temporary failures, where the same entry should be tried
> >> + * again immediately, almost never happen for this shrinker.
> >> + * We don't do any trylocking; -ENOMEM comes closest,
> >> + * but that's extremely rare and doesn't happen spuriously
> >> + * either. Don't bother distinguishing this case.
> >> + *
> >> + * But since they do exist in theory, the entry cannot just
> >> + * be unlinked, or we could leak it. Hence, rotate.
> >
> > The entry cannot be unlinked because we cannot get a ref on it without
> > holding the tree lock, and we cannot deref the tree before we acquire a
> > swap cache ref in zswap_writeback_entry() -- or if
> > zswap_writeback_entry() fails. This should be called out explicitly
> > somewhere. Perhaps we can describe this whole deref dance with added
> > docs to shrink_memcg_cb().
>
> Maybe we should add some comments before or after zswap_writeback_entry()?
> Or do you have some suggestions? I'm not good at this. :)
I agree with the suggestion of a central point to document this.
How about something like this:
/*
* As soon as we drop the LRU lock, the entry can be freed by
* a concurrent invalidation. This means the following:
*
* 1. We extract the swp_entry_t to the stack, allowing
* zswap_writeback_entry() to pin the swap entry and
* then validate the zwap entry against that swap entry's
* tree using pointer value comparison. Only when that
* is successful can the entry be dereferenced.
*
* 2. Usually, objects are taken off the LRU for reclaim. In
* this case this isn't possible, because if reclaim fails
* for whatever reason, we have no means of knowing if the
* entry is alive to put it back on the LRU.
*
* So rotate it before dropping the lock. If the entry is
* written back or invalidated, the free path will unlink
* it. For failures, rotation is the right thing as well.
*
* Temporary failures, where the same entry should be tried
* again immediately, almost never happen for this shrinker.
* We don't do any trylocking; -ENOMEM comes closest,
* but that's extremely rare and doesn't happen spuriously
* either. Don't bother distinguishing this case.
*/
> > We could also use a comment in zswap_writeback_entry() (or above it) to
> > state that we only deref the tree *after* we get the swapcache ref.
>
> I just notice there are some comments in zswap_writeback_entry(), should
> we add more comments here?
>
> /*
> * folio is locked, and the swapcache is now secured against
> * concurrent swapping to and from the slot. Verify that the
> * swap entry hasn't been invalidated and recycled behind our
> * backs (our zswap_entry reference doesn't prevent that), to
> * avoid overwriting a new swap folio with old compressed data.
> */
The bit in () is now stale, since we're not even holding a ref ;)
Otherwise, a brief note that entry can not be dereferenced until
validation would be helpful in zswap_writeback_entry(). The core of
the scheme I'd probably describe in shrink_memcg_cb(), though.
Can I ask a favor, though?
For non-critical updates to this patch, can you please make them
follow-up changes? I just sent out 20 cleanup patches on top of this
patch which would be super painful and error prone to rebase. I'd like
to avoid that if at all possible.
Powered by blists - more mailing lists