[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <531c6702-1389-42c5-9cdd-062989d40133@arm.com>
Date: Wed, 28 Feb 2024 14:59:28 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Matthew Wilcox <willy@...radead.org>
Cc: David Hildenbrand <david@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, Huang Ying
<ying.huang@...el.com>, Gao Xiang <xiang@...nel.org>,
Yu Zhao <yuzhao@...gle.com>, Yang Shi <shy828301@...il.com>,
Michal Hocko <mhocko@...e.com>, Kefeng Wang <wangkefeng.wang@...wei.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from
swap_cluster_info:flags
On 28/02/2024 14:24, Ryan Roberts wrote:
> On 28/02/2024 13:33, Matthew Wilcox wrote:
>> On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> I can't quite bring myself to look up the encoding of swap entries
>> but as long as we're willing to restrict ourselves to naturally aligning
>> the clusters, there's an encoding (which I believe I invented) that lets
>> us encode arbitrary power-of-two sizes with a single bit.
>>
>> I describe it here:
>> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder
>>
>> Let me know if it's not clear.
>
> Ahh yes, I'm familiar with this encoding scheme from other settings. Although
> I've previously thought of it as having a bit to indicate whether the scheme is
> enabled or not, and if it is enabled then the encoded PFN is:
>
> PFNe = PFNd | (1 << (log2(n) - 1))
>
> Where n is the power-of-2 page count.
>
> Same thing, I think.
>
> I think we would have to steal a bit from the offset to make this work, and it
> looks like the size of that is bottlnecked on the arch's swp_entry PTE
> representation. Looks like there is a MIPS config that only has 17 bits for
> offset to begin with, so I doubt we would be able to spare a bit here? Although
> it looks possible that there are some unused low bits that could be used...
>
I think the other problem with this is that it won't tell us which slot in the
"swap slot block" each entry is targetting?
Powered by blists - more mailing lists