[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CM8-ZyQ5E-6Qwky5uwx+w=nrc4_nMX0oWHzv3Q3xz=Lg@mail.gmail.com>
Date: Thu, 22 May 2025 12:13:37 +0800
From: Kairui Song <ryncsn@...il.com>
To: Nhat Pham <nphamcs@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>, Hugh Dickins <hughd@...gle.com>, Chris Li <chrisl@...nel.org>,
David Hildenbrand <david@...hat.com>, Yosry Ahmed <yosryahmed@...gle.com>,
"Huang, Ying" <ying.huang@...ux.alibaba.com>, Johannes Weiner <hannes@...xchg.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Baoquan He <bhe@...hat.com>,
Barry Song <baohua@...nel.org>, Kalesh Singh <kaleshsingh@...gle.com>,
Kemeng Shi <shikemeng@...weicloud.com>, Tim Chen <tim.c.chen@...ux.intel.com>,
Ryan Roberts <ryan.roberts@....com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 28/28] mm, swap: implement dynamic allocation of swap table
On Thu, May 22, 2025 at 3:38 AM Nhat Pham <nphamcs@...il.com> wrote:
>
> On Wed, May 14, 2025 at 1:20 PM Kairui Song <ryncsn@...il.com> wrote:
> >
> > From: Kairui Song <kasong@...cent.com>
> >
> > Now swap table is cluster based, which means free clusters can free its
> > table since no one should modify it.
> >
> > There could be speculative readers, like swap cache look up, protect
> > them by making them RCU safe. All swap table should be filled with null
> > entries before free, so such readers will either see a NULL pointer or
> > a null filled table being lazy freed.
> >
> > On allocation, allocate the table when a cluster is used by any order.
> >
> > This way, we can reduce the memory usage of large swap device
> > significantly.
> >
> > This idea to dynamically release unused swap cluster data was initially
> > suggested by Chris Li while proposing the cluster swap allocator and
> > I found it suits the swap table idea very well.
> >
> > Suggested-by: Chris Li <chrisl@...nel.org>
> > Signed-off-by: Kairui Song <kasong@...cent.com>
>
> Nice optimization!
Thanks!
>
> However, please correct me if I'm wrong - but we are only dynamically
> allocating the swap table with this patch. What we are getting here is
> the dynamic allocation of the swap entries' metadata (through the swap
> table), which my virtual swap prototype already provides. The cluster
> metadata struct (struct swap_cluster_info) itself is statically
> allocated still (at swapon time), correct?
That's true for now, but noticing the static data is much smaller and unified
now, and that enables more work in the following ways:
(I didn't include it in the series because it is getting too long already..)
The static data is only 48 bytes per 2M swap space, so
for example if you have a 1TB swap device / space, it's only 20M
in total, previously it would be at least 768M (could be much higher
as I'm only counting swap_map and cgroup array here).
Now the memory overhead is 0.0019% of the swap space.
And the static data is now only an intermediate cluster table, and only
used in one place (si->cluster_info), so reallocating is doable now:
Readers of the actual swap table are protected by RCU and won't
modifying the cluster metadata, the only updater of cluster metadata
is allocation/freeing, and they can be organized in better ways to
allow the cluster data to be reallocated.
And due to the low memory overhead of cluster metadata, it's totally
acceptable to preallocate a much larger space now, for example we can
always preallocate a 4TB space on boot, tha't 80M in total. Might
seems not that trivial, but there is another planned series to make
the vmalloc space dynamic too, leverage the page table directly, so
the 20M per TB overhead can be avoided too. Not sure if it will be
needed though, the overhead is so tiny already.
So in summary what I have in mind is we can either:
- Extend the cluster data when it's not enough (or getting fragmented),
since the table data is still accessible during the reallocate and copied
data is minimal, so it shouldn't be a heavy lifting operation.
- Preallocate a larger amount of cluster data on swapon, the
overhead is still very controllable.
- (Once we have a dynamic vmalloc) preallocate a super large space
for swap and allocate each page when needed.
These ideas can be somehow combined, or related to each other.
> That will not work for a
> large virtual swap space :( So unfortunately, even with this swap
> table series, swap virtualization is still not trivial - definitely
> not as trivial as a new swap device type...
>
> Reading your physical swapfile allocator gives me some ideas though -
> let me build it into my prototype :) I'll send it out once it's ready.
>
Yeah, a virtual swap is definitely not trivial, instead it's
challenging and very important, just like you have demonstrated.
It requires quite some work other than just metadata level things,
I never expected it to be just as simple as a "just another swap
table entry type" :)
What I meant is that to be done with minimal overhead and better
flexibility, swap needs better infrastructures, which this series is working on.
Powered by blists - more mailing lists