linux-kernel - Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f61235d6-5d33-4853-a498-72db2fb13b10@redhat.com>
Date: Tue, 30 Jul 2024 10:47:17 +0200
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>, Matthew Wilcox
 <willy@...radead.org>, Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, ying.huang@...el.com,
 baolin.wang@...ux.alibaba.com, chrisl@...nel.org, hannes@...xchg.org,
 hughd@...gle.com, kaleshsingh@...gle.com, kasong@...cent.com,
 linux-kernel@...r.kernel.org, mhocko@...e.com, minchan@...nel.org,
 nphamcs@...il.com, senozhatsky@...omium.org, shakeel.butt@...ux.dev,
 shy828301@...il.com, surenb@...gle.com, v-songbaohua@...o.com,
 xiang@...nel.org, yosryahmed@...gle.com
Subject: Re: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy

On 30.07.24 10:36, Ryan Roberts wrote:
> On 29/07/2024 04:52, Matthew Wilcox wrote:
>> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>>> A user space interface can be implemented to select different swap-in
>>> order policies, similar to the mTHP allocation order policy. We need
>>> a distinct policy because the performance characteristics of memory
>>> allocation differ significantly from those of swap-in. For example,
>>> SSD read speeds can be much slower than memory allocation. With
>>> policy selection, I believe we can implement mTHP swap-in for
>>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>>> the implications of their choices. I think that it's better to start
>>> with at least always never. I believe that we will add auto in the
>>> future to tune automatically, which can be used as default finally.
>>
>> I strongly disagree.  Use the same sysctl as the other anonymous memory
>> allocations.
> 
> I vaguely recall arguing in the past that just because the user has requested 2M
> THP that doesn't mean its the right thing to do for performance to swap-in the
> whole 2M in one go. That's potentially a pretty huge latency, depending on where
> the backend is, and it could be a waste of IO if the application never touches
> most of the 2M. Although the fact that the application hinted for a 2M THP in
> the first place hopefully means that they are storing objects that need to be
> accessed at similar times. Today it will be swapped in page-by-page then
> eventually collapsed by khugepaged.
> 
> But I think those arguments become weaker as the THP size gets smaller. 16K/64K
> swap-in will likely yield significant performance improvements, and I think
> Barry has numbers for this?
> 
> So I guess we have a few options:
> 
>   - Just use the same sysfs interface as for anon allocation, And see if anyone
> reports performance regressions. Investigate one of the options below if an
> issue is raised. That's the simplest and cleanest approach, I think.
> 
>   - New sysfs interface as Barry has implemented; nobody really wants more
> controls if it can be helped.
> 
>   - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts
> and never got any traction.
> 
>   - Secret option 4: Can we allocate a full-size folio but only choose to swap-in
> to it bit-by-bit? You would need a way to mark which pages of the folio are
> valid (e.g. per-page flag) but guess that's a non-starter given the strategy to
> remove per-page flags?

Maybe we could allocate for folios in the swapcache a bitmap to store 
that information (folio->private).

But I am not convinced that is the right thing to do.

If we know some basic properties of the backend, can't we automatically 
make a pretty good decision regarding the folio size to use? E.g., slow 
disk, avoid 2M ...

Avoiding sysctls if possible here would really be preferable...

-- 
Cheers,

David / dhildenb