[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yqUW46xyDtZ4X1wQZ2_0bLM85Euz2BufERa75Rg+gVyw@mail.gmail.com>
Date: Tue, 5 Mar 2024 11:29:31 +1300
From: Barry Song <21cnbao@...il.com>
To: David Hildenbrand <david@...hat.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org, linux-mm@...ck.org,
chrisl@...nel.org, yuzhao@...gle.com, hanchuanhua@...o.com,
linux-kernel@...r.kernel.org, willy@...radead.org, ying.huang@...el.com,
xiang@...nel.org, mhocko@...e.com, shy828301@...il.com,
wangkefeng.wang@...wei.com, Barry Song <v-songbaohua@...o.com>,
Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a
large folio
On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@...hat.com> wrote:
>
>
> >>> Do we need a Fixes tag?
> >
> > I am not quite sure which commit should be here for a fixes tag.
> > I think it's more of an optimization.
>
> Good, that helps!
>
> >
> >>>
> >>
> >> What would be the description of the problem we are fixing?
> >>
> >> 1) failing to unmap?
> >>
> >> That can happen with small folios as well IIUC.
> >>
> >> 2) Putting the large folio on the deferred split queue?
> >>
> >> That sounds more reasonable.
> >
> > I don't feel it is reasonable. Avoiding this kind of accident splitting
> > from the kernel's improper code is a more reasonable approach
> > as there is always a price to pay for splitting and unfolding PTEs
> > etc.
> >
> > While we can't avoid splitting coming from userspace's
> > MADV_DONTNEED, munmap, mprotect, we have a way
> > to ensure the kernel itself doesn't accidently break up a
> > large folio.
>
> Note that on the next vmscan we would retry, find the remaining present
> entries and swapout that thing completely :)
This is true, but since we can finish the job the first time, it seems
second retry is a cost :-)
>
> >
> > In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> > in try_to_unmap_one. hardly could we fix it from the root cause. with
> > various races, figuring out their timings was really a big pain :-)
> >
>
> I can imagine. I assume, though, that it might be related to the way the
> cont-pte bit was handled. Ryan's implementation should be able to cope
> with that.
I guess you are probably right. Ryan's implementation decouples CONT-PTE
from mm core. nice to have it.
>
> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> > found some PTEs were skipped in try_to_unmap_one [1].
> >
> > While we find we only get the PTL from 2nd, 3rd but not
> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> > all PTEs untouched.
> >
> > /* we are not starting from head */
> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> > ret = false;
> > atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> > set_pte_at(mm, address, pvmw.pte, pteval);
> > page_vma_mapped_walk_done(&pvmw);
> > break;
> > }
> > This will ensure all PTEs still have a unified state such as CONT-PTE
> > after try_to_unmap fails.
> > I feel this could have some false postive because when racing
> > with unmap, 1st PTE might really become pte_none. So explicitly
> > holding PTL from 1st PTE seems a better way.
>
> Can we estimate the "cost" of holding the PTL?
>
This is just moving PTL acquisition one or two PTE earlier in those corner
cases. In normal cases, it doesn't affect when PTL is held.
In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
PTL immediately. in corner cases, page_vma_mapped_walk races with break-
before-make, after skipping one or two PTEs whose states are transferring,
it will find a present pte then acquire lock.
> --
> Cheers,
>
> David / dhildenb
Thanks
Barry
Powered by blists - more mailing lists