linux-kernel - Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7BL7cpnFtgKPL0Jud_2oaVwAa+jQ8V9hKWvJGVtFOtQGw@mail.gmail.com>
Date: Mon, 13 Jan 2025 14:33:12 +0800
From: Kairui Song <ryncsn@...il.com>
To: Baoquan He <bhe@...hat.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Chris Li <chrisl@...nel.org>, Barry Song <v-songbaohua@...o.com>, 
	Ryan Roberts <ryan.roberts@....com>, Hugh Dickins <hughd@...gle.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, "Huang, Ying" <ying.huang@...ux.alibaba.com>, 
	Nhat Pham <nphamcs@...il.com>, Johannes Weiner <hannes@...xchg.org>, 
	Kalesh Singh <kaleshsingh@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock

On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@...hat.com> wrote:
>
> On 01/09/25 at 10:15am, Kairui Song wrote:
> > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@...hat.com> wrote:
> > >
> >
> > Thanks for the very detailed review!
> >
> > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > ......snip.....
> > > > ---
> > > >  include/linux/swap.h |   3 +-
> > > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > >   * throughput.
> > > >   */
> > > >  struct percpu_cluster {
> > > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > >  };
> > > >
> > > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > >                                       /* list of cluster that contains at least one free slot */
> > > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > >                                       /* list of cluster that are fragmented or contented */
> > > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > >       unsigned int pages;             /* total of usable pages of swap */
> > > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index 7795a3d27273..dadd4fead689 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> ...snip...
> > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > > >
> > > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > >  {
> > > > -     lockdep_assert_held(&si->lock);
> > > >       lockdep_assert_held(&ci->lock);
> > > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > >       ci->order = 0;
> > > >  }
> > > >
> > > > +/*
> > > > + * Isolate and lock the first cluster that is not contented on a list,
> > > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > > + * with list status, so cluster updaters can always know the cluster
> > > > + * list status without touching si lock.
> > > > + *
> > > > + * Note it's possible that all clusters on a list are contented so
> > > > + * this returns NULL for an non-empty list.
> > > > + */
> > > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > > +             struct swap_info_struct *si, struct list_head *list)
> > > > +{
> > > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > > +
> > > > +     spin_lock(&si->lock);
> > > > +
> > > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > > +             goto out;
> > > > +
> > > > +     list_for_each_entry(ci, list, list) {
> > > > +             if (!spin_trylock(&ci->lock))
> > > > +                     continue;
> > > > +
> > > > +             /* We may only isolate and clear flags of following lists */
> > > > +             VM_BUG_ON(!ci->flags);
> > > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > > +
> > > > +             list_del(&ci->list);
> > > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > > +             ret = ci;
> > > > +             break;
> > > > +     }
> > > > +out:
> > > > +     spin_unlock(&si->lock);
> > > > +
> > > > +     return ret;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > > - * will be added to free cluster list. caller should hold si->lock.
> > > > -*/
> > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > > + */
> > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > >  {
> > > >       struct swap_cluster_info *ci;
> > > > +     bool ret = false;
> > > >       unsigned int idx;
> > > >
> > > > +     spin_lock(&si->lock);
> > > >       while (!list_empty(&si->discard_clusters)) {
> > > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > > +             /*
> > > > +              * Delete the cluster from list but don't clear its flags until
> > > > +              * discard is done, so isolation and relocation will skip it.
> > > > +              */
> > > >               list_del(&ci->list);
> > >
> > > I don't understand above comment. ci has been taken off list. While
> > > allocation need isolate from a usable list. Even though we clear
> > > ci->flags now, how come isolation and relocation will touch it. I may
> > > miss anything here.
> >
> > There are many cases, one possible and common situation is that the
> > percpu cluster (si->percpu_cluster of another CPU) is still pointing
> > to it.
> >
> > Also, this commit removed protection of si lock on allocation, and
> > allocation path may also drop ci lock to call reclaim, which means one
> > cluster could be used or freed by anyone before allocator reacquire
> > the ci lock again. In that case, the allocator could see a discard
> > cluster.
> >
> > So we don't clear the discard flag, in case anyone misuse it.
> >
> > I can add more inline comments on this, this is already some related
> > comments above the function relocate_cluster, could add some more
> > referencing that.

Hi Baoquan,

>
> Thanks for your great explanation. I understand that si->percpu_cluster
> could point to a discarded ci, and a ci could be got from non-full,
> frag lists but later become discarded if that ci is freed on other cpu
> during cluster_reclaim_range() invocation. I haven't got how isolation
> could see a discarded ci in cluster_isolate_lock(). Could you help give
> an exmaple on how that happen?

cluster_isolate_lock shouldn't see a discard cluster, and there is a
VM_BUG_ON for that.

>
> Surely, I understand keeping the discarded flag is very necessary so
> that checking like cluster_is_usable() will return expected value.
>
> And by the way, I haven't got when the ' if (!ci->count)' case could
> happen in relocate_cluster() since we have filtered away discarded ci
> with the 'if (cluster_is_discard(ci))' checking. I asked in another
> thread, could you help explain it?

Many swap devices doesn't need discard so the cluster could be freed
directly. And actually the ci->count check in relocate_cluster is not
necessarily related to that.

The caller of relocate_cluster, may fail an allocation (eg. race with
swapoff) and that could end up calling relocate_cluster with a empty
cluster, such cluster should go to the free list (swapoff might fail
too).

The swapoff case is extremely rare but let's just be more robust here,
covering free cluster have almost no overhead but save a lot of
efforts. I can add some comments on this.

>
> static void relocate_cluster(struct swap_info_struct *si,
>                              struct swap_cluster_info *ci)
> {
>         lockdep_assert_held(&ci->lock);
>
>         /* Discard cluster must remain off-list or on discard list */
>         if (cluster_is_discard(ci))
>                 return;
>
>         if (!ci->count) {
>                 free_cluster(si, ci);
> ...
> }