linux-kernel - Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMgjq7D9-6JXOzpd18t8MSBAotHgEG2YZbi4efNkJiwiSJyJmw@mail.gmail.com>
Date: Wed, 27 Mar 2024 15:14:03 +0800
From: Kairui Song <ryncsn@...il.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: linux-mm@...ck.org, Chris Li <chrisl@...nel.org>, Minchan Kim <minchan@...nel.org>, 
	Barry Song <v-songbaohua@...o.com>, Ryan Roberts <ryan.roberts@....com>, 
	Yu Zhao <yuzhao@...gle.com>, SeongJae Park <sj@...nel.org>, David Hildenbrand <david@...hat.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, Johannes Weiner <hannes@...xchg.org>, 
	Matthew Wilcox <willy@...radead.org>, Nhat Pham <nphamcs@...il.com>, 
	Chengming Zhou <zhouchengming@...edance.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin

On Wed, Mar 27, 2024 at 2:49 PM Huang, Ying <ying.huang@...el.com> wrote:
>
> Kairui Song <ryncsn@...il.com> writes:
>
> > On Wed, Mar 27, 2024 at 2:24 PM Huang, Ying <ying.huang@...el.com> wrote:
> >>
> >> Kairui Song <ryncsn@...il.com> writes:
> >>
> >> > From: Kairui Song <kasong@...cent.com>
> >> >
> >> > Interestingly the major performance overhead of synchronous is actually
> >> > from the workingset nodes update, that's because synchronous swap in
> >>
> >> If it's the major overhead, why not make it the first optimization?
> >
> > This performance issue became much more obvious after doing other
> > optimizations, and other optimizations are for general swapin not only
> > for synchronous swapin, that's also how I optimized things step by
> > step, so I kept my patch order...
> >
> > And it is easier to do this after Patch 8/10 which introduces the new
> > interface for swap cache.
> >
> >>
> >> > keeps adding single folios into a xa_node, making the node no longer
> >> > a shadow node and have to be removed from shadow_nodes, then remove
> >> > the folio very shortly and making the node a shadow node again,
> >> > so it has to add back to the shadow_nodes.
> >>
> >> The folio is removed only if should_try_to_free_swap() returns true?
> >>
> >> > Mark synchronous swapin folio with a special bit in swap entry embedded
> >> > in folio->swap, as we still have some usable bits there. Skip workingset
> >> > node update on insertion of such folio because it will be removed very
> >> > quickly, and will trigger the update ensuring the workingset info is
> >> > eventual consensus.
> >>
> >> Is this safe?  Is it possible for the shadow node to be reclaimed after
> >> the folio are added into node and before being removed?
> >
> > If a xa node contains any non-shadow entry, it can't be reclaimed,
> > shadow_lru_isolate will check and skip such nodes in case of race.
>
> In shadow_lru_isolate(),
>
>         /*
>          * The nodes should only contain one or more shadow entries,
>          * no pages, so we expect to be able to remove them all and
>          * delete and free the empty node afterwards.
>          */
>         if (WARN_ON_ONCE(!node->nr_values))
>                 goto out_invalid;
>         if (WARN_ON_ONCE(node->count != node->nr_values))
>                 goto out_invalid;
>
> So, this isn't considered normal and will cause warning now.

Yes, I added an exception in this patch:
-       if (WARN_ON_ONCE(node->count != node->nr_values))
+       if (WARN_ON_ONCE(node->count != node->nr_values &&
mapping->host != NULL))

The code is not a good final solution, but the idea might not be that
bad, list_lru provides many operations like LRU_ROTATE, we can even
lazy remove all the nodes as a general optimization, or add a
threshold for adding/removing a node from LRU.

>
> >>
> >> If so, we may consider some other methods.  Make shadow_nodes per-cpu?
> >
> > That's also an alternative solution if there are other risks.
>
> This appears a general optimization and more clean.

I'm not sure if synchronization between CPUs will make more burden,
because shadow nodes are globally shared, one node can be referenced
by multiple CPUs, I can have a try to see if this is doable. Maybe a
per-cpu batch is better but synchronization might still be an issue.