linux-kernel - Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <B1B85771-B211-4FCC-AEEF-BDFD37332C25@vmware.com>
Date:   Tue, 5 Jan 2021 21:22:51 +0000
From:   Nadav Amit <namit@...are.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
CC:     linux-mm <linux-mm@...ck.org>, lkml <linux-kernel@...r.kernel.org>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on
 soft-dirty cleanup

> On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli <aarcange@...hat.com> wrote:
> 
> On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote:
>>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@...hat.com> wrote:
>>> 
>>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
>>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking")
>>> 
>>> Targeting a backport down to 2013 when nothing could wrong in practice
>>> with page_mapcount sounds backwards and unnecessarily risky.
>>> 
>>> In theory it was already broken and in theory
>>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the
>>> previous code of 2013 is completely wrong, but in practice the code
>>> from 2013 worked perfectly until Aug 21 2020.
>> 
>> Well… If you consider the bug that Will recently fixed [1], then soft-dirty
>> was broken (for a different, yet related reason) since 0758cd830494
>> ("asm-generic/tlb: avoid potential double flush”).
>> 
>> This is not to say that I argue that the patch should be backported to 2013,
>> just to say that memory corruption bugs can be unnoticed.
>> 
>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/
> 
> Is this a fix or a cleanup?
> 
> The above is precisely what I said earlier that tlb_gather had no
> reason to stay in clear_refs and it had to use inc_tlb_flush_pending
> as mprotect, but it's not a fix? Is it? I suggested it as a pure
> cleanup. So again no backport required. The commit says fix this but
> it means "clean this up".

It is actually a fix. I think the commit log is not entirely correct and
should include:

  Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”).

Since 0758cd830494, calling tlb_finish_mmu() without any previous call to
pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug
producer that I sent fails without this patch of Will.

> So setting a Fixes back to 2013 that would go mess with all stable
> tree by actively backporting a performance regressions to clear_refs
> that can break runtime performance to fix a philosophical issue that
> isn't even a theoretical issue, doesn't sound ideal to me.

Point taken.

> 
>> To summarize my action items based your (and others) feedback on both
>> patches:
>> 
>> 1. I will break the first patch into two different patches, one with the
>> “optimization” for write-unprotect, based on your feedback. It will not
>> be backported.
>> 
>> 2. I will try to add a patch to avoid TLB flushes on
>> userfaultfd-writeunprotect. It will also not be backported.
> 
> I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the
> actual fix the memory corruption wasn't ideal, but doing the same
> optimization to both wrprotect and un-wrprotect in the same patch
> sounds ideal. The commit explanation would be identical and it can be
> de-duplicated this way.
> 
> I'd suggest to coordinate with Peter on that, since I wasn't planning
> to work on this if somebody else offered to do it.
> 
>> 3. Let me know if you want me to use your version of testing
>> mm_tlb_flush_pending() and conditionally flushing, wait for new version fro
>> you or Peter or to go with taking mmap_lock for write.
> 
> Yes, as you suggested, I'm trying to clean it up and send a new
> version.
> 
> Ultimately my view is there are an huge number of cases where
> mmap_write_lock or some other heavy lock that will require
> occasionally to block on I/O is beyond impossible not to take. Even
> speculative page faults only attack the low hanging anon memory and
> there's still MADV_DONTNEED/FREE and other stuff that may have to run
> in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page
> faults.
> 
> As a reminder: the only case when modifying the vmas is allowed under
> mmap_read_lock (I already tried once to make it safer by adding
> READ_ONCE/WRITE_ONCE but wasn't merged see
> https://www.spinics.net/lists/linux-mm/msg173420.html), is when
> updating vm_end/vm_start in growsdown/up, where the vma is extended
> down or up in the page fault under only mmap_read_lock.
> 
> I'm doing all I can to document and make it more explicit the
> complexity we deal with in the code (as well as reducing the gcc
> dependency in emitting atomic writes to update vm_end/vm_start, as we
> should do in ptes as well in theory). As you may notice in the
> feedback from the above submission not all even realized that we're
> modifying vmas already under mmap_read_lock. So it'd be great to get
> help to merge that READ_ONCE/WRITE_ONCE cleanup that is still valid
> and pending for merge but it needs forward porting.
> 
> This one, for both soft dirty and uffd_wrprotect, is a walk in the
> park to optimize in comparison to the vma modifications.

I am sure you are right.

> 
> From my point of view in fact, doing the tlb flush or the wait on the
> atomic to be released, does not increase kernel complexity compared to
> what we had until now.

It is also about performance due to unwarranted TLB flushes.

I think avoiding them requires some finer granularity detection of pending
page-faults. But anyhow, I still owe some TLB optimization patches (and v2
for userfaultfd+iouring) before I can even look at that.

In addition, as I stated before, having some clean interfaces that tell
whether a TLB flush is needed or not would be helpful and simpler to follow.
For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure
out whether a TLB flush is needed in change_pte_range() and avoid
unnecessary flushes when unprotecting pages with either mprotect() or
userfaultfd.