linux-kernel - Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACePvbWe9wraG2FjBcX9OmHN5ynB4et9WEHqh6NPSVK5mzJi2A@mail.gmail.com>
Date: Fri, 26 Jul 2024 00:10:31 -0700
From: Chris Li <chrisl@...nel.org>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: Ryan Roberts <ryan.roberts@....com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Kairui Song <kasong@...cent.com>, Hugh Dickins <hughd@...gle.com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Barry Song <baohua@...nel.org>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@...el.com> wrote:
>
> Chris Li <chrisl@...nel.org> writes:
>
> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@...el.com> wrote:
> >> > If the freeing of swap entry is random distribution. You need 16
> >> > continuous swap entries free at the same time at aligned 16 base
> >> > locations. The total number of order 4 free swap space add up together
> >> > is much lower than the order 0 allocatable swap space.
> >> > If having one entry free is 50% probability(swapfile half full), then
> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
> >>
> >> This depends on workloads.  Quite some workloads will show some degree
> >> of spatial locality.  For a workload with no spatial locality at all as
> >> above, mTHP may be not a good choice at the first place.
> >
> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> > have their own valid usage case, and should be separate from how you
> > use the order 0 entry. That is why I consider this kind of strategy
> > only works on the lucky case. I would much prefer the strategy that
> > can guarantee work not depend on luck.
>
> It seems that you have some perfect solution.  Will learn it when you
> post it.

No, I don't have perfect solutions. I see puting limit on order 0 swap
usage and writing out discontinuous swap entries from a folio are more
deterministic and not depend on luck. Both have their price to pay as
well.

>
> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> >>   clusters available.
> >> >
> >> > Exactly.
> >> >
> >> >>
> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> >> the various situations automatically.
> >> >
> >> > There is no easy way to migrate swap entries to different locations.
> >> > That is why I like to have discontiguous swap entries allocation for
> >> > mTHP.
> >>
> >> We suggest to migrate non-full swap clsuters among different lists, not
> >> swap entries.
> >
> > Then you have the down side of reducing the number of total high order
> > clusters. By chance it is much easier to fragment the cluster than
> > anti-fragment a cluster.  The orders of clusters have a natural
> > tendency to move down rather than move up, given long enough time of
> > random access. It will likely run out of high order clusters in the
> > long run if we don't have any separation of orders.
>
> As my example above, you may have almost 0 high-order clusters forever.
> So, your solution only works for very specific use cases.  It's not a
> general solution.

One simple solution is having an optional limitation of 0 order swap.
I understand you don't like that option, but there is no other easy
solution to achieve the same effectiveness, so far. If there is, I
like to hear it.

>
> >> >> But yes, data is needed for any performance related change.
> >>
> >> BTW: I think non-full cluster isn't a good name.  Partial cluster is
> >> much better and follows the same convention as partial slab.
> >
> > I am not opposed to it. The only reason I hold off on the rename is
> > because there are patches from Kairui I am testing depending on it.
> > Let's finish up the V5 patch with the swap cache reclaim code path
> > then do the renaming as one batch job. We actually have more than one
> > list that has the clusters partially full. It helps reduce the repeat
> > scan of the cluster that is not full but also not able to allocate
> > swap entries for this order.  Just the name of one of them as
> > "partial" is not precise either. Because the other lists are also
> > partially full. We'd better give them precise meaning systematically.
>
> I don't think that it's hard to do a search/replace before the next
> version.

The overhead is on the other internal experimental patches. Again,
I am not opposed to renaming it. Just want to do it at one batch not
many times, including other list names.

Chris