[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <804524c8-772c-42d0-93a5-90d77f13f304@redhat.com>
Date: Mon, 4 Mar 2024 22:15:02 +0100
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
linux-mm@...ck.org, chrisl@...nel.org, yuzhao@...gle.com,
hanchuanhua@...o.com, linux-kernel@...r.kernel.org, willy@...radead.org,
ying.huang@...el.com, xiang@...nel.org, mhocko@...e.com,
shy828301@...il.com, wangkefeng.wang@...wei.com,
Barry Song <v-songbaohua@...o.com>, Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a
large folio
>>> Do we need a Fixes tag?
>
> I am not quite sure which commit should be here for a fixes tag.
> I think it's more of an optimization.
Good, that helps!
>
>>>
>>
>> What would be the description of the problem we are fixing?
>>
>> 1) failing to unmap?
>>
>> That can happen with small folios as well IIUC.
>>
>> 2) Putting the large folio on the deferred split queue?
>>
>> That sounds more reasonable.
>
> I don't feel it is reasonable. Avoiding this kind of accident splitting
> from the kernel's improper code is a more reasonable approach
> as there is always a price to pay for splitting and unfolding PTEs
> etc.
>
> While we can't avoid splitting coming from userspace's
> MADV_DONTNEED, munmap, mprotect, we have a way
> to ensure the kernel itself doesn't accidently break up a
> large folio.
Note that on the next vmscan we would retry, find the remaining present
entries and swapout that thing completely :)
>
> In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> in try_to_unmap_one. hardly could we fix it from the root cause. with
> various races, figuring out their timings was really a big pain :-)
>
I can imagine. I assume, though, that it might be related to the way the
cont-pte bit was handled. Ryan's implementation should be able to cope
with that.
> But we did "resolve" those bugs by entirely untouching all PTEs if we
> found some PTEs were skipped in try_to_unmap_one [1].
>
> While we find we only get the PTL from 2nd, 3rd but not
> 1st PTE, we entirely give up on try_to_unmap_one, and leave
> all PTEs untouched.
>
> /* we are not starting from head */
> if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> ret = false;
> atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> set_pte_at(mm, address, pvmw.pte, pteval);
> page_vma_mapped_walk_done(&pvmw);
> break;
> }
> This will ensure all PTEs still have a unified state such as CONT-PTE
> after try_to_unmap fails.
> I feel this could have some false postive because when racing
> with unmap, 1st PTE might really become pte_none. So explicitly
> holding PTL from 1st PTE seems a better way.
Can we estimate the "cost" of holding the PTL?
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists