[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f61235d6-5d33-4853-a498-72db2fb13b10@redhat.com>
Date: Tue, 30 Jul 2024 10:47:17 +0200
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>, Matthew Wilcox
<willy@...radead.org>, Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, ying.huang@...el.com,
baolin.wang@...ux.alibaba.com, chrisl@...nel.org, hannes@...xchg.org,
hughd@...gle.com, kaleshsingh@...gle.com, kasong@...cent.com,
linux-kernel@...r.kernel.org, mhocko@...e.com, minchan@...nel.org,
nphamcs@...il.com, senozhatsky@...omium.org, shakeel.butt@...ux.dev,
shy828301@...il.com, surenb@...gle.com, v-songbaohua@...o.com,
xiang@...nel.org, yosryahmed@...gle.com
Subject: Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
On 30.07.24 10:36, Ryan Roberts wrote:
> On 29/07/2024 04:52, Matthew Wilcox wrote:
>> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>>> A user space interface can be implemented to select different swap-in
>>> order policies, similar to the mTHP allocation order policy. We need
>>> a distinct policy because the performance characteristics of memory
>>> allocation differ significantly from those of swap-in. For example,
>>> SSD read speeds can be much slower than memory allocation. With
>>> policy selection, I believe we can implement mTHP swap-in for
>>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>>> the implications of their choices. I think that it's better to start
>>> with at least always never. I believe that we will add auto in the
>>> future to tune automatically, which can be used as default finally.
>>
>> I strongly disagree. Use the same sysctl as the other anonymous memory
>> allocations.
>
> I vaguely recall arguing in the past that just because the user has requested 2M
> THP that doesn't mean its the right thing to do for performance to swap-in the
> whole 2M in one go. That's potentially a pretty huge latency, depending on where
> the backend is, and it could be a waste of IO if the application never touches
> most of the 2M. Although the fact that the application hinted for a 2M THP in
> the first place hopefully means that they are storing objects that need to be
> accessed at similar times. Today it will be swapped in page-by-page then
> eventually collapsed by khugepaged.
>
> But I think those arguments become weaker as the THP size gets smaller. 16K/64K
> swap-in will likely yield significant performance improvements, and I think
> Barry has numbers for this?
>
> So I guess we have a few options:
>
> - Just use the same sysfs interface as for anon allocation, And see if anyone
> reports performance regressions. Investigate one of the options below if an
> issue is raised. That's the simplest and cleanest approach, I think.
>
> - New sysfs interface as Barry has implemented; nobody really wants more
> controls if it can be helped.
>
> - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts
> and never got any traction.
>
> - Secret option 4: Can we allocate a full-size folio but only choose to swap-in
> to it bit-by-bit? You would need a way to mark which pages of the folio are
> valid (e.g. per-page flag) but guess that's a non-starter given the strategy to
> remove per-page flags?
Maybe we could allocate for folios in the swapcache a bitmap to store
that information (folio->private).
But I am not convinced that is the right thing to do.
If we know some basic properties of the backend, can't we automatically
make a pretty good decision regarding the folio size to use? E.g., slow
disk, avoid 2M ...
Avoiding sysctls if possible here would really be preferable...
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists