linux-kernel - Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4a73b16e-9317-477a-ac23-8033004b0637@arm.com>
Date: Wed, 28 Feb 2024 14:57:33 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>, Huang Ying <ying.huang@...el.com>,
 Gao Xiang <xiang@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
 Yang Shi <shy828301@...il.com>, Michal Hocko <mhocko@...e.com>,
 Kefeng Wang <wangkefeng.wang@...wei.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from
 swap_cluster_info:flags

On 28/02/2024 12:12, David Hildenbrand wrote:
>>> How relevant is it? Relevant enough that someone decided to put that
>>> optimization in? I don't know :)
>>
>> I'll have one last go at convincing you: Huang Ying (original author) commented
>> "I believe this should be OK.  Better to compare the performance too." at [1].
>> That implies to me that perhaps the optimization wasn't in response to a
>> specific problem after all. Do you have any thoughts, Huang?
> 
> Might make sense to include that in the patch description!
> 
>> OK so if we really do need to keep this optimization, here are some ideas:
>>
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> Taking a step back: what about we simply batch unmapping of swap entries?
> 
> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
> lock) that reference consecutive swap offsets in the same swap file.

Yes in principle, but there are 4 places where free_swap_and_cache() is called,
and only 2 of those are really amenable to batching (zap_pte_range() and
madvise_free_pte_range()). So the other two users will still take the "slow"
path. Maybe those 2 callsites are the only ones that really matter? I can
certainly have a stab at this approach.

> 
> There, we can then first decrement all the swap counts, and then try minimizing
> how often we actually have to try reclaiming swap space (lookup folio, see it's
> a large folio that we cannot reclaim or could reclaim, ...).
> 
> Might need some fine-tuning in swap code to "advance" to the next entry to try
> freeing up, but we certainly can do better than what we would do right now.

I'm not sure I've understood this. Isn't advancing just a matter of:

entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);