[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <719DF2CD-A0BC-4B67-9FBA-A9E0A98AA45E@gmail.com>
Date: Tue, 22 Dec 2020 12:58:18 -0800
From: Nadav Amit <nadav.amit@...il.com>
To: Andy Lutomirski <luto@...nel.org>
Cc: Andrea Arcangeli <aarcange@...hat.com>,
linux-mm <linux-mm@...ck.org>, Peter Xu <peterx@...hat.com>,
lkml <linux-kernel@...r.kernel.org>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
stable <stable@...r.kernel.org>,
Minchan Kim <minchan@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
> On Dec 22, 2020, at 12:34 PM, Andy Lutomirski <luto@...nel.org> wrote:
>
> On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit <nadav.amit@...il.com> wrote:
>>> [ I have in mind another solution, such as keeping in each page-table a
>>> “table-generation” which is the mm-generation at the time of the change,
>>> and only flush if “table-generation”==“mm-generation”, but it requires
>>> some thought on how to avoid adding new memory barriers. ]
>>>
>>> IOW: I think the change that you suggest is insufficient, and a proper
>>> solution is too intrusive for “stable".
>>>
>>> As for performance, I can add another patch later to remove the TLB flush
>>> that is unnecessarily performed during change_protection_range() that does
>>> permission promotion. I know that your concern is about the “protect” case
>>> but I cannot think of a good immediate solution that avoids taking mmap_lock
>>> for write.
>>>
>>> Thoughts?
>>
>> On a second thought (i.e., I don’t know what I was thinking), doing so —
>> checking mm_tlb_flush_pending() on every PTE read which is potentially
>> dangerous and flushing if needed - can lead to huge amount of TLB flushes
>> and shootodowns as the counter might be elevated for considerable amount of
>> time.
>
> I've lost track as to whether we still think that this particular
> problem is really a problem,
If you mean “problem” as to whether there is a correctness issue with
userfaultfd and soft-dirty deferred flushes under mmap_read_lock() - yes
there is a problem and I produced these failures on upstream.
If you mean “problem” as to performance - I do not know.
> but could we perhaps make the
> tlb_flush_pending field be per-ptl instead of per-mm? Depending on
> how it gets used, it could plausibly be done without atomics or
> expensive barriers by using PTL to protect the field.
>
> FWIW, x86 has a mm generation counter, and I don't think it would be
> totally crazy to find a way to expose an mm generation to core code.
> I don't think we'd want to expose the specific data structures that
> x86 uses to track it -- they're very tailored to the oddities of x86
> TLB management. x86 also doesn't currently have any global concept of
> which mm generation is guaranteed to have been propagated to all CPUs
> -- we track the generation in the pagetables and, per cpu, the
> generation that we know that CPU has seen. x86 could offer a function
> "ensure that all CPUs catch up to mm generation G and don't return
> until this happens" and its relative "have all CPUs caught up to mm
> generation G", but these would need to look at data from multiple CPUs
> and would probably be too expensive on very large systems to use in
> normal page faults unless we were to cache the results somewhere.
> Making a nice cache for this is surely doable, but maybe more
> complexity than we'd want.
I had somewhat similar ideas - saving in each page-struct the generation,
which would allow to: (1) extend pte_same() to detect interim changes
that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.
Obviously, I cannot do it as part of this fix. But moreover, it seems to me
that it would require a memory barrier after updating the PTEs and before
reading the current generation (that would be saved per page-table). I
try to think about schemes that would use the per-CPU generation instead,
but still could not and did not have the time to figure it out.
Powered by blists - more mailing lists