[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zg0pfyux.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 11 Oct 2023 14:37:58 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Matthew Wilcox <willy@...radead.org>,
Gao Xiang <xiang@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Yang Shi <shy828301@...il.com>, Michal Hocko <mhocko@...e.com>,
<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>
Subject: Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
Ryan Roberts <ryan.roberts@....com> writes:
> Hi All,
>
> This is an RFC for a small series to add support for swapping out small-sized
> THP without needing to first split the large folio via __split_huge_page(). It
> closely follows the approach already used by PMD-sized THP.
>
> "Small-sized THP" is an upcoming feature that enables performance improvements
> by allocating large folios for anonymous memory, where the large folio size is
> smaller than the traditional PMD-size. See [1].
>
> In some circumstances I've observed a performance regression (see patch 2 for
> details), and this series is an attempt to fix the regression in advance of
> merging small-sized THP support.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). However, I have a few questions on
> whether we should consider relaxing those requirements in certain circumstances:
>
>
> 1) block-backed vs file-backed
> ==============================
>
> The code only attempts to allocate a contiguous set of entries if swap is backed
> by a block device (i.e. not file-backed). The original commit, f0eea189e8e9
> ("mm, THP, swap: don't allocate huge cluster for file backed swap device"),
> stated "It's hard to write a whole transparent huge page (THP) to a file backed
> swap device". But didn't state why. Does this imply there is a size limit at
> which it becomes hard? And does that therefore imply that for "small enough"
> sizes we should now allow use with file-back swap?
>
> This original commit was subsequently fixed with commit 41663430588c ("mm, THP,
> swap: fix allocating cluster for swapfile by mistake"), which said the original
> commit was using the wrong flag to determine if it was a block device and
> therefore in some cases was actually doing large allocations for a file-backed
> swap device, and this was causing file-system corruption. But that implies some
> sort of correctness issue to me, rather than the performance issue I inferred
> from the original commit.
>
> If anyone can offer an explanation, that would be helpful in determining if we
> should allow some large sizes for file-backed swap.
swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from
swap offset to storage block number. For block-backed swap, the mapping
is pure linear. So, you can use arbitrary large page size. But for
file-backed swap, only PAGE_SIZE alignment is guaranteed.
> 2) rotating vs non-rotating
> ===========================
>
> I notice that the clustered approach is only used for non-rotating swap. That
> implies that for rotating media, we will always fail a large allocation, and
> fall back to splitting THPs to single pages. Which implies that the regression
> I'm fixing here may still be present on rotating media? Or perhaps rotating disk
> is so slow that the cost of writing the data out dominates the cost of
> splitting?
>
> I considered that potentially the free swap entry search algorithm that is used
> in this case could be modified to look for (small) contiguous runs of entries;
> Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead
> of single byte.
>
> I haven't looked into this idea in detail, but wonder if anybody thinks it is
> worth the effort? Or perhaps it would end up causing bad fragmentation.
I doubt anybody will use rotating storage to back swap now.
> Finally on testing, I've run the mm selftests and see no regressions, but I
> don't think there is anything in there specifically aimed towards swap? Are
> there any functional or performance tests that I should run? It would certainly
> be good to confirm I haven't regressed PMD-size THP swap performance.
I have used swap sub test case of vm-scalbility to test.
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists