linux-kernel - Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1fe10a54-64c9-4808-9d4a-50edf5d1662a@oracle.com>
Date: Mon, 2 Jun 2025 15:02:38 -0700
From: Anthony Yznaga <anthony.yznaga@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: akpm@...ux-foundation.org, willy@...radead.org, markhemm@...glemail.com,
        viro@...iv.linux.org.uk, david@...hat.com, khalid@...nel.org,
        andreyknvl@...il.com, dave.hansen@...el.com, luto@...nel.org,
        brauner@...nel.org, arnd@...db.de, ebiederm@...ssion.com,
        catalin.marinas@....com, linux-arch@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, mhiramat@...nel.org,
        rostedt@...dmis.org, vasily.averin@...ux.dev, xhao@...ux.alibaba.com,
        pcc@...gle.com, neilb@...e.de, maz@...nel.org,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Liam Howlett <liam.howlett@...cle.com>
Subject: Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing
 support



On 6/2/25 8:26 AM, Jann Horn wrote:
> On Fri, May 30, 2025 at 6:42 PM Anthony Yznaga
> <anthony.yznaga@...cle.com> wrote:
>> On 5/30/25 7:56 AM, Jann Horn wrote:
>>> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@...cle.com> wrote:
>>>> In preparation for enabling the handling of page faults in an mshare
>>>> region provide a way to link an mshare shared page table to a process
>>>> page table and otherwise find the actual vma in order to handle a page
>>>> fault. Modify the unmap path to ensure that page tables in mshare regions
>>>> are unlinked and kept intact when a process exits or an mshare region
>>>> is explicitly unmapped.
>>>>
>>>> Signed-off-by: Khalid Aziz <khalid@...nel.org>
>>>> Signed-off-by: Matthew Wilcox (Oracle) <willy@...radead.org>
>>>> Signed-off-by: Anthony Yznaga <anthony.yznaga@...cle.com>
>>> [...]
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index db558fe43088..68422b606819 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>> [...]
>>>> @@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
>>>>                   next = p4d_addr_end(addr, end);
>>>>                   if (p4d_none_or_clear_bad(p4d))
>>>>                           continue;
>>>> -               free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>>>> +               if (unlikely(shared_pud))
>>>> +                       p4d_clear(p4d);
>>>> +               else
>>>> +                       free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>>>>           } while (p4d++, addr = next, addr != end);
>>>>
>>>>           start &= PGDIR_MASK;
>>> [...]
>>>> +static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
>>>> +                               struct vm_area_struct *vma,
>>>> +                               unsigned long addr, unsigned long end,
>>>> +                               struct zap_details *details)
>>>> +{
>>>> +       /*
>>>> +        * The msharefs vma is being unmapped. Do not unmap pages in the
>>>> +        * mshare region itself.
>>>> +        */
>>>> +}
>>>
>>> Unmapping a VMA has three major phases:
>>>
>>> 1. unlinking the VMA from the VMA tree
>>> 2. removing the VMA contents
>>> 3. removing unneeded page tables
>>>
>>> The MM subsystem broadly assumes that after phase 2, no stuff is
>>> mapped in the region anymore and therefore changes to the backing file
>>> don't need to TLB-flush this VMA anymore, and unlinks the mapping from
>>> rmaps and such. If munmap() of an mshare region only removes the
>>> mapping of shared page tables in step 3, as implemented here, that
>>> means things like TLB flushes won't be able to discover all
>>> currently-existing mshare mappings of a host MM through rmap walks.
>>>
>>> I think it would make more sense to remove the links to shared page
>>> tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
>>> hugetlb does, and not modify free_pgtables().
>>
>> That makes sense. I'll make this change.
> 
> Related: I think there needs to be a strategy for preventing walking
> of mshare host page tables through an mshare VMA by codepaths relying
> on MM/VMA locks, because those locks won't have an effect on the
> underlying host MM. For example, I think the only reason fork() is
> safe with your proposal is that copy_page_range() skips shared VMAs,
> and I think non-fast get_user_pages() could maybe hit use-after-free
> of page tables or such?
> 
> I guess the only clean strategy for that is to ensure that all
> locking-based page table walking code does a check for "is this an
> mshare VMA?" and, if yes, either bails immediately or takes extra
> locks on the host MM (which could get messy).

Thanks. Yes, I need to audit all VMA / page table scanning. This series 
already has a patch to avoid scanning mshare VMAs for numa migration, 
but more issues are lurking.