linux-kernel - Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7DDUx0EYHzQfpSYyLGMTnV3q_V+kKMku+uSPRd=BxxYhg@mail.gmail.com>
Date: Mon, 13 Jan 2025 16:07:38 +0800
From: Kairui Song <ryncsn@...il.com>
To: Baoquan He <bhe@...hat.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Chris Li <chrisl@...nel.org>, Barry Song <v-songbaohua@...o.com>, 
	Ryan Roberts <ryan.roberts@....com>, Hugh Dickins <hughd@...gle.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, "Huang, Ying" <ying.huang@...ux.alibaba.com>, 
	Nhat Pham <nphamcs@...il.com>, Johannes Weiner <hannes@...xchg.org>, 
	Kalesh Singh <kaleshsingh@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock

On Mon, Jan 13, 2025 at 2:33 PM Kairui Song <ryncsn@...il.com> wrote:
>
> On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@...hat.com> wrote:
> >
> > On 01/09/25 at 10:15am, Kairui Song wrote:
> > > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@...hat.com> wrote:
> > > >
> > >
> > > Thanks for the very detailed review!
> > >
> > > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > > ......snip.....
> > > > > ---
> > > > >  include/linux/swap.h |   3 +-
> > > > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > > > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > > >   * throughput.
> > > > >   */
> > > > >  struct percpu_cluster {
> > > > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > > > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > > >  };
> > > > >
> > > > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > > >                                       /* list of cluster that contains at least one free slot */
> > > > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > > >                                       /* list of cluster that are fragmented or contented */
> > > > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > > >       unsigned int pages;             /* total of usable pages of swap */
> > > > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > > > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index 7795a3d27273..dadd4fead689 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > ...snip...
> > > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > > > >
> > > > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > > >  {
> > > > > -     lockdep_assert_held(&si->lock);
> > > > >       lockdep_assert_held(&ci->lock);
> > > > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > > >       ci->order = 0;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * Isolate and lock the first cluster that is not contented on a list,
> > > > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > > > + * with list status, so cluster updaters can always know the cluster
> > > > > + * list status without touching si lock.
> > > > > + *
> > > > > + * Note it's possible that all clusters on a list are contented so
> > > > > + * this returns NULL for an non-empty list.
> > > > > + */
> > > > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > > > +             struct swap_info_struct *si, struct list_head *list)
> > > > > +{
> > > > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > > > +
> > > > > +     spin_lock(&si->lock);
> > > > > +
> > > > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > > > +             goto out;
> > > > > +
> > > > > +     list_for_each_entry(ci, list, list) {
> > > > > +             if (!spin_trylock(&ci->lock))
> > > > > +                     continue;
> > > > > +
> > > > > +             /* We may only isolate and clear flags of following lists */
> > > > > +             VM_BUG_ON(!ci->flags);
> > > > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > > > +
> > > > > +             list_del(&ci->list);
> > > > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > > > +             ret = ci;
> > > > > +             break;
> > > > > +     }
> > > > > +out:
> > > > > +     spin_unlock(&si->lock);
> > > > > +
> > > > > +     return ret;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > > > - * will be added to free cluster list. caller should hold si->lock.
> > > > > -*/
> > > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > > > + */
> > > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > >  {
> > > > >       struct swap_cluster_info *ci;
> > > > > +     bool ret = false;
> > > > >       unsigned int idx;
> > > > >
> > > > > +     spin_lock(&si->lock);
> > > > >       while (!list_empty(&si->discard_clusters)) {
> > > > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > > > +             /*
> > > > > +              * Delete the cluster from list but don't clear its flags until
> > > > > +              * discard is done, so isolation and relocation will skip it.
> > > > > +              */
> > > > >               list_del(&ci->list);
> > > >
> > > > I don't understand above comment. ci has been taken off list. While
> > > > allocation need isolate from a usable list. Even though we clear
> > > > ci->flags now, how come isolation and relocation will touch it. I may
> > > > miss anything here.
> > >
> > > There are many cases, one possible and common situation is that the
> > > percpu cluster (si->percpu_cluster of another CPU) is still pointing
> > > to it.
> > >
> > > Also, this commit removed protection of si lock on allocation, and
> > > allocation path may also drop ci lock to call reclaim, which means one
> > > cluster could be used or freed by anyone before allocator reacquire
> > > the ci lock again. In that case, the allocator could see a discard
> > > cluster.
> > >
> > > So we don't clear the discard flag, in case anyone misuse it.
> > >
> > > I can add more inline comments on this, this is already some related
> > > comments above the function relocate_cluster, could add some more
> > > referencing that.
>
> Hi Baoquan,
>
> >
> > Thanks for your great explanation. I understand that si->percpu_cluster
> > could point to a discarded ci, and a ci could be got from non-full,
> > frag lists but later become discarded if that ci is freed on other cpu
> > during cluster_reclaim_range() invocation. I haven't got how isolation
> > could see a discarded ci in cluster_isolate_lock(). Could you help give
> > an exmaple on how that happen?
>
> cluster_isolate_lock shouldn't see a discard cluster, and there is a
> VM_BUG_ON for that.

Oh, now I realize what you mean, the comment in
swap_do_scheduled_discard mentioned cluster_isolate_lock may see a
discard cluster, that is not true, it was added in an early version of
this series and forgot to update the comment. I'll just drop that.

>
> >
> > Surely, I understand keeping the discarded flag is very necessary so
> > that checking like cluster_is_usable() will return expected value.
> >
> > And by the way, I haven't got when the ' if (!ci->count)' case could
> > happen in relocate_cluster() since we have filtered away discarded ci
> > with the 'if (cluster_is_discard(ci))' checking. I asked in another
> > thread, could you help explain it?
>
> Many swap devices doesn't need discard so the cluster could be freed
> directly. And actually the ci->count check in relocate_cluster is not
> necessarily related to that.
>
> The caller of relocate_cluster, may fail an allocation (eg. race with
> swapoff) and that could end up calling relocate_cluster with a empty
> cluster, such cluster should go to the free list (swapoff might fail
> too).
>
> The swapoff case is extremely rare but let's just be more robust here,
> covering free cluster have almost no overhead but save a lot of
> efforts. I can add some comments on this.
>
> >
> > static void relocate_cluster(struct swap_info_struct *si,
> >                              struct swap_cluster_info *ci)
> > {
> >         lockdep_assert_held(&ci->lock);
> >
> >         /* Discard cluster must remain off-list or on discard list */
> >         if (cluster_is_discard(ci))
> >                 return;
> >
> >         if (!ci->count) {
> >                 free_cluster(si, ci);
> > ...
> > }