linux-kernel - Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <804524c8-772c-42d0-93a5-90d77f13f304@redhat.com>
Date: Mon, 4 Mar 2024 22:15:02 +0100
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
 linux-mm@...ck.org, chrisl@...nel.org, yuzhao@...gle.com,
 hanchuanhua@...o.com, linux-kernel@...r.kernel.org, willy@...radead.org,
 ying.huang@...el.com, xiang@...nel.org, mhocko@...e.com,
 shy828301@...il.com, wangkefeng.wang@...wei.com,
 Barry Song <v-songbaohua@...o.com>, Hugh Dickins <hughd@...gle.com>
Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a
 large folio


>>> Do we need a Fixes tag?
> 
> I am not quite sure which commit should be here for a fixes tag.
> I think it's more of an optimization.

Good, that helps!

> 
>>>
>>
>> What would be the description of the problem we are fixing?
>>
>> 1) failing to unmap?
>>
>> That can happen with small folios as well IIUC.
>>
>> 2) Putting the large folio on the deferred split queue?
>>
>> That sounds more reasonable.
> 
> I don't feel it is reasonable. Avoiding this kind of accident splitting
> from the kernel's improper code is a more reasonable approach
> as there is always a price to pay for splitting and unfolding PTEs
> etc.
> 
> While we can't avoid splitting coming from userspace's
> MADV_DONTNEED, munmap, mprotect, we have a way
> to ensure the kernel itself doesn't accidently break up a
> large folio.

Note that on the next vmscan we would retry, find the remaining present 
entries and swapout that thing completely :)

> 
> In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> in try_to_unmap_one. hardly could we fix it from the root cause. with
> various races, figuring out their timings was really a big pain :-)
> 

I can imagine. I assume, though, that it might be related to the way the 
cont-pte bit was handled. Ryan's implementation should be able to cope 
with that.

> But we did "resolve" those bugs by entirely untouching all PTEs if we
> found some PTEs were skipped in try_to_unmap_one [1].
> 
> While we find we only get the PTL from 2nd, 3rd but not
> 1st PTE, we entirely give up on try_to_unmap_one, and leave
> all PTEs untouched.
> 
> /* we are not starting from head */
> if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
>                     ret = false;
>                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
>                     set_pte_at(mm, address, pvmw.pte, pteval);
>                     page_vma_mapped_walk_done(&pvmw);
>                     break;
> }
> This will ensure all PTEs still have a unified state such as CONT-PTE
> after try_to_unmap fails.
> I feel this could have some false postive because when racing
> with unmap, 1st PTE might really become pte_none. So explicitly
> holding PTL from 1st PTE seems a better way.

Can we estimate the "cost" of holding the PTL?

-- 
Cheers,

David / dhildenb