[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7774edad-a194-4259-a95f-88bcef846f90@arm.com>
Date: Fri, 1 Mar 2024 17:14:00 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>, Matthew Wilcox <willy@...radead.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Huang Ying <ying.huang@...el.com>, Gao Xiang <xiang@...nel.org>,
Yu Zhao <yuzhao@...gle.com>, Yang Shi <shy828301@...il.com>,
Michal Hocko <mhocko@...e.com>, Kefeng Wang <wangkefeng.wang@...wei.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from
swap_cluster_info:flags
On 01/03/2024 17:00, David Hildenbrand wrote:
> On 01.03.24 17:44, Ryan Roberts wrote:
>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>> correct. The only problem is that during testing I can't provoke the code to
>>>> take the path. I've been pouring through the code but struggling to figure out
>>>> under what situation you would expect the swap entry passed to
>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>
>>>> This is the original (unbatched) function, after my change, which caused
>>>> David's
>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>
>>>> int free_swap_and_cache(swp_entry_t entry)
>>>> {
>>>> struct swap_info_struct *p;
>>>> unsigned char count;
>>>>
>>>> if (non_swap_entry(entry))
>>>> return 1;
>>>>
>>>> p = _swap_info_get(entry);
>>>> if (p) {
>>>> count = __swap_entry_free(p, entry);
>>>> if (count == SWAP_HAS_CACHE)
>>>> __try_to_reclaim_swap(p, swp_offset(entry),
>>>> TTRS_UNMAPPED | TTRS_FULL);
>>>> }
>>>> return p != NULL;
>>>> }
>>>>
>>>> The trouble is, whenever its called, count is always 0, so
>>>> __try_to_reclaim_swap() never gets called.
>>>>
>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
>>>> over
>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
>>>> this
>>>> function to be called for every PTE, but count is always 0 after
>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>
>>> I think you have to page it back in again, then it will have an entry in
>>> the swap cache. Maybe. I know little about anon memory ;-)
>>
>> Ahh, I was under the impression that the original folio is put into the swap
>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
>
> I thought with most (disk) backends you will add it to the swapcache and leave
> it there until there is actual memory pressure. Only then, under memory
> pressure, you'd actually reclaim the folio.
OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
the swap subsystem refuses to swap out THPs to that and they get split. To solve
that, (and to speed up testing) I moved to the block ram disk, which convinces
swap to swap-out THPs. But that causes the folios to be removed from the swap
cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
under macos. Perhaps the easiest thing is to hack it to ignore the rotating
media flag.
>
> You can fault it back in from the swapcache without having to go to disk.
>
> That's how you can today end up with a THP in the swapcache: during swapin from
> disk (after the folio was reclaimed) you'd currently only get order-0 folios.
>
> At least that was my assumption with my MADV_PAGEOUT testing so far :)
>
Powered by blists - more mailing lists