linux-kernel - Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7774edad-a194-4259-a95f-88bcef846f90@arm.com>
Date: Fri, 1 Mar 2024 17:14:00 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>, Matthew Wilcox <willy@...radead.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Huang Ying <ying.huang@...el.com>, Gao Xiang <xiang@...nel.org>,
 Yu Zhao <yuzhao@...gle.com>, Yang Shi <shy828301@...il.com>,
 Michal Hocko <mhocko@...e.com>, Kefeng Wang <wangkefeng.wang@...wei.com>,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from
 swap_cluster_info:flags

On 01/03/2024 17:00, David Hildenbrand wrote:
> On 01.03.24 17:44, Ryan Roberts wrote:
>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>> correct. The only problem is that during testing I can't provoke the code to
>>>> take the path. I've been pouring through the code but struggling to figure out
>>>> under what situation you would expect the swap entry passed to
>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>
>>>> This is the original (unbatched) function, after my change, which caused
>>>> David's
>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>
>>>> int free_swap_and_cache(swp_entry_t entry)
>>>> {
>>>>     struct swap_info_struct *p;
>>>>     unsigned char count;
>>>>
>>>>     if (non_swap_entry(entry))
>>>>         return 1;
>>>>
>>>>     p = _swap_info_get(entry);
>>>>     if (p) {
>>>>         count = __swap_entry_free(p, entry);
>>>>         if (count == SWAP_HAS_CACHE)
>>>>             __try_to_reclaim_swap(p, swp_offset(entry),
>>>>                           TTRS_UNMAPPED | TTRS_FULL);
>>>>     }
>>>>     return p != NULL;
>>>> }
>>>>
>>>> The trouble is, whenever its called, count is always 0, so
>>>> __try_to_reclaim_swap() never gets called.
>>>>
>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
>>>> over
>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
>>>> this
>>>> function to be called for every PTE, but count is always 0 after
>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>
>>> I think you have to page it back in again, then it will have an entry in
>>> the swap cache.  Maybe.  I know little about anon memory ;-)
>>
>> Ahh, I was under the impression that the original folio is put into the swap
>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> 
> I thought with most (disk) backends you will add it to the swapcache and leave
> it there until there is actual memory pressure. Only then, under memory
> pressure, you'd actually reclaim the folio.

OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
the swap subsystem refuses to swap out THPs to that and they get split. To solve
that, (and to speed up testing) I moved to the block ram disk, which convinces
swap to swap-out THPs. But that causes the folios to be removed from the swap
cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
under macos. Perhaps the easiest thing is to hack it to ignore the rotating
media flag.

> 
> You can fault it back in from the swapcache without having to go to disk.
> 
> That's how you can today end up with a THP in the swapcache: during swapin from
> disk (after the folio was reclaimed) you'd currently only get order-0 folios.
> 
> At least that was my assumption with my MADV_PAGEOUT testing so far :)
>