linux-kernel - Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbW9scoOJzA_O2fPBCvZBwa0yQumFnXuhdtO0pkutD2P+Q@mail.gmail.com>
Date: Mon, 19 Aug 2024 14:27:11 -0700
From: Chris Li <chrisl@...nel.org>
To: Kairui Song <ryncsn@...il.com>
Cc: "Huang, Ying" <ying.huang@...el.com>, Hugh Dickins <hughd@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Ryan Roberts <ryan.roberts@....com>, 
	Kalesh Singh <kaleshsingh@...gle.com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Barry Song <baohua@...nel.org>
Subject: Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order

Hi Kairui,

On Mon, Aug 19, 2024 at 1:48 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Mon, Aug 19, 2024 at 4:31 PM Huang, Ying <ying.huang@...el.com> wrote:
> >
> > Kairui Song <ryncsn@...il.com> writes:
> >
> > > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@...nel.org> wrote:
> > >>
> > >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@...el.com> wrote:
> > >> >
> > >> > Chris Li <chrisl@...nel.org> writes:
> > >> >
> > >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@...el.com> wrote:
> > >> > >>
> > >> > >> Hi, Chris,
> > >> > >>
> > >> > >> Chris Li <chrisl@...nel.org> writes:
> > >> > >>
> > >> > >> > This is the short term solutions "swap cluster order" listed
> > >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> > >> > >> > LSF/MM conference.
> > >> > >> >
> > >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > >> > >> > orders" is introduced, it only allocates the mTHP swap entries
> > >> > >> > from the new empty cluster list.  It has a fragmentation issue
> > >> > >> > reported by Barry.
> > >> > >> >
> > >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > >> > >> >
> > >> > >> > The reason is that all the empty clusters have been exhausted while
> > >> > >> > there are plenty of free swap entries in the cluster that are
> > >> > >> > not 100% free.
> > >> > >> >
> > >> > >> > Remember the swap allocation order in the cluster.
> > >> > >> > Keep track of the per order non full cluster list for later allocation.
> > >> > >> >
> > >> > >> > This series gives the swap SSD allocation a new separate code path
> > >> > >> > from the HDD allocation. The new allocator use cluster list only
> > >> > >> > and do not global scan swap_map[] without lock any more.
> > >> > >>
> > >> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> > >> > >> We may not need a swap entry allocator optimized for HDD.
> > >> > >
> > >> > > Yes, that is the plan as well. That way we can completely get rid of
> > >> > > the old scan_swap_map_slots() code.
> > >> >
> > >> > Good!
> > >> >
> > >> > > However, considering the size of the series, let's focus on the
> > >> > > cluster allocation path first, get it tested and reviewed.
> > >> >
> > >> > OK.
> > >> >
> > >> > > For HDD optimization, mostly just the new block allocations portion
> > >> > > need some separate code path from the new cluster allocator to not do
> > >> > > the per cpu allocation.  Allocating from the non free list doesn't
> > >> > > need to change too
> > >> >
> > >> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> > >> > to simplify.
> > >>
> > >> Adding a global next allocating CI rather than the per CPU next CI
> > >> pointer is pretty trivial as well. It is just a different way to fetch
> > >> the next cluster pointer.
> > >
> > > Yes, if we enable the new cluster based allocator for HDD, we can
> > > enable THP and mTHP for HDD too, and use a global cluster_next instead
> > > of Per-CPU for it.
> > > It's easy to do with minimal changes, and should actually boost
> > > performance for HDD SWAP. Currently testing this locally.
> >
> > I think that it's better to start with SSD algorithm.  Then, you can add
> > HDD specific optimization on top of it with supporting data.
>
> Yes, we are having the same idea.
>
> >
> > BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
> > writing is more important for HDD.
> > >> > >>
> > >> > >> Hi, Hugh,
> > >> > >>
> > >> > >> What do you think about this?
> > >> > >>
> > >> > >> > This streamline the swap allocation for SSD. The code matches the
> > >> > >> > execution flow much better.
> > >> > >> >
> > >> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> > >> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> > >> > >> > initial phase.
> > >> > >> >
> > >> > >> > It also performs faster when the swapfile is close to full, because the
> > >> > >> > allocator can get the non full cluster from a list rather than scanning
> > >> > >> > a lot of swap_map entries.
> > >> > >>
> > >> > >> Do you have some test results to prove this?  Or which test below can
> > >> > >> prove this?
> > >> > >
> > >> > > The two zram tests are already proving this. The system time
> > >> > > improvement is about 2% on my low CPU count machine.
> > >> > > Kairui has a higher core count machine and the difference is higher
> > >> > > there. The theory is that higher CPU count has higher contentions.
> > >> >
> > >> > I will interpret this as the performance is better in theory.  But
> > >> > there's almost no measurable results so far.
> > >>
> > >> I am trying to understand why don't see the performance improvement in
> > >> the zram setup in my cover letter as a measurable result?
> > >
> > > Hi Ying, you can check the test with the 32 cores AMD machine in the
> > > cover letter, as Chris pointed out the performance gain is higher as
> > > core number grows. The performance gain is still not much (*yet, based
> > > on this design thing can go much faster after HDD codes are
> > > dropped which enables many other optimizations, this series
> > > is mainly focusing on the fragmentation issue), but I think a
> > > stable ~4 - 8% improvement with a build linux kernel test
> > > could be considered measurable?
> >
> > Is this the test result for "when the swapfile is close to full"?
>
> Yes, it's about 60% to 90% full during the whole test process. If ZRAM
> is completely full the workload will go OOM, but testing with madvice

BTW, one trick to avoid ZRAM completely full causing OOM is to have
two zram devices and assign different priorities. Let the first zram
get 100% full then the swap overflow to the second ZRAM device, which
has more swap entries to avoid the OOM.

Chris

> showed no performance drop.