linux-kernel - Re: [PATCH] mm: reuse the unshared swapcache page in do_wp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <43e41259-b228-2a75-e59d-0c6a1e81912f@redhat.com>
Date:   Thu, 20 Jan 2022 16:46:45 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Matthew Wilcox <willy@...radead.org>,
        "zhangliang (AG)" <zhangliang5@...wei.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        wangzhigang17@...wei.com
Subject: Re: [PATCH] mm: reuse the unshared swapcache page in do_wp_page

On 20.01.22 16:37, Linus Torvalds wrote:
> On Thu, Jan 20, 2022 at 5:26 PM David Hildenbrand <david@...hat.com> wrote:
>>
>> So your claim is that read-only, PTE mapped pages are weird? How do you
>> come to that conclusion?
>>
>> If we adjust the THP reuse logic to split on additional references
>> (page_count() == 1) -- similarly as suggested by Linus to fix the CVE --
>> we're going to end up with exactly that more frequently.
> 
> If you write to a THP page that has page_count() elevated - presumably
> because of a fork() - then that COW is exactly what you want to
> happen.
> 
> And presumably you want it to happen page-by-page.
> 
> So I fail to see what the problem is.
> 
> The *normal* THP case is that there hasn't been a fork, and there is
> no COW activity. That's the only thing worth trying to optimize for
> and worry about.
> 
> If you do some kind of fork with huge-pages, and actually write to
> those pages (as opposed to just execve() in the child and wait in the
> parent), you only have yourself to blame. You *will* take COW faults,
> and you have to do it, and at that point spliting the THP in whoever
> did the COW is likely the right thing to do just to hope that you
> don't have to allocate a whole new hugepage. So it will cause new
> (small-page) allocations and copies.
> 
> And yes, at that point, writes to the THP page will cause COW's for
> both sides as they both end up making that "split it" decision.
> 
> Honestly, would anything else ever even make sense?
> 
> If you care about THP performance, you make sure that the COW THP case
> simply never happens.

I'm, not concerned about fork(), I'm concerned about other false positives.

Here is what I currently have, I hope that makes sense:

commit f8fe5e15e04f7563606d58daee795aa0cc0e454c (HEAD)
Author: David Hildenbrand <david@...hat.com>
Date:   Fri Jan 14 09:53:35 2022 +0100

    mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
    
    We currently have a different COW logic for THP than we have for
    ordinary pages in do_wp_page(): the effect is that the issue reported in
    CVE-2020-29374 is currently still possible for THP: a child process can
    observe memory modifications of the parent process to MAP_PRIVATE pages.
    
    Let's apply the same logic (page_count() == 1), conditionally trying to
    remove the page from the swapcache after freeing the swap entry, however,
    before actually mapping our page.
    
    Note that KSM does not apply to THP and that we won't be waiting on
    writeback. In the future, we might want to wait for writeback in case
    the page is not swapped by any other process. For now, let's keep it
    simple.
    
    If we find a page_count() > 1, we'll have to split and fallback to
    do_wp_page(), which will copy the page.
    
    This change has the following effects:
    
    (1) Mapping a THP with GUP references read-only and triggering a write
    fault will result in a split and disconnect of GUP from the actual page
    table just as we know it for ordinary pages already: this will require a
    fix for all anonymous pages, for example, by marking anon pages exclusive
    to a single process and allowing GUP pins only on exclusive anon pages.
    
    (2) Additional references on the compound page, or concurrent writeback,
    will now result in the THP getting mapped via PTEs, whereby each PTE
    requires a reference on the compound page. Once we have a PTE-mapped
    PMD, do_wp_page() will always copy and never reuse.
    
    Signed-off-by: David Hildenbrand <david@...hat.com>

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..a4f2754142aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1303,7 +1303,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
        page = pmd_page(orig_pmd);
        VM_BUG_ON_PAGE(!PageHead(page), page);
 
-       /* Lock page for reuse_swap_page() */
        if (!trylock_page(page)) {
                get_page(page);
                spin_unlock(vmf->ptl);
@@ -1319,10 +1318,14 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
        }
 
        /*
-        * We can only reuse the page if nobody else maps the huge page or it's
-        * part.
+        * See do_wp_page(): we can reuse only if there are no additional
+        * references (page_count() is 1). Try to free the swapcache to get
+        * rid of the swapcache references in case it's likely that we will
+        * succeed.
         */
-       if (reuse_swap_page(page)) {
+       if (PageSwapCache(page) && page_count(page) == 1 + thp_nr_pages(page))
+               try_to_free_swap(page);
+       if (page_count(page) == 1) {
                pmd_t entry;
                entry = pmd_mkyoung(orig_pmd);
                entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);


-- 
Thanks,

David / dhildenb