linux-kernel - Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a7377c8-1852-88d3-a768-ed1a6eac0c0c@redhat.com>
Date:   Thu, 21 Jul 2022 09:45:59 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Linux MM <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Andrew Cooper <andrew.cooper3@...rix.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Peter Xu <peterx@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Nick Piggin <npiggin@...il.com>
Subject: Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be
 writable

On 20.07.22 19:25, Nadav Amit wrote:
> On Jul 20, 2022, at 8:19 AM, David Hildenbrand <david@...hat.com> wrote:
> 
>> On 18.07.22 14:02, Nadav Amit wrote:
>>> From: Nadav Amit <namit@...are.com>
>>>
>>> Anonymous pages might have the dirty bit clear, but this should not
>>> prevent mprotect from making them writable if they are exclusive.
>>> Therefore, skip the test whether the page is dirty in this case.
>>>
>>> Cc: Andrea Arcangeli <aarcange@...hat.com>
>>> Cc: Andrew Cooper <andrew.cooper3@...rix.com>
>>> Cc: Andrew Morton <akpm@...ux-foundation.org>
>>> Cc: Andy Lutomirski <luto@...nel.org>
>>> Cc: Dave Hansen <dave.hansen@...ux.intel.com>
>>> Cc: David Hildenbrand <david@...hat.com>
>>> Cc: Peter Xu <peterx@...hat.com>
>>> Cc: Peter Zijlstra <peterz@...radead.org>
>>> Cc: Thomas Gleixner <tglx@...utronix.de>
>>> Cc: Will Deacon <will@...nel.org>
>>> Cc: Yu Zhao <yuzhao@...gle.com>
>>> Cc: Nick Piggin <npiggin@...il.com>
>>> Signed-off-by: Nadav Amit <namit@...are.com>
>>> ---
>>> mm/mprotect.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 34c2dfb68c42..da5b9bf8204f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>>
>>> 	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>>>
>>> -	if (pte_protnone(pte) || !pte_dirty(pte))
>>> +	if (pte_protnone(pte))
>>> 		return false;
>>>
>>> 	/* Do we need write faults for softdirty tracking? */
>>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>> 		page = vm_normal_page(vma, addr, pte);
>>> 		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>>> 			return false;
>>> -	}
>>> +	} else if (!pte_dirty(pte))
>>> +		return false;
>>>
>>> 	return true;
>>> }
>>
>> When I wrote that code, I was wondering how often that would actually
>> happen in practice -- and if we care about optimizing that. Do you have
>> a gut feeling in which scenarios this would happen and if we care?
>>
>> If the page is in the swapcache and was swapped out, you'd be requiring
>> a writeback even though nobody modified the page and possibly isn't
>> going to do so in the near future.
> 
> So here is my due diligence: I did not really encounter a scenario in which
> it showed up. When I looked at your code, I assumed this was an oversight
> and not a thoughtful decision. For me the issue is more of the discrepancy
> between how a certain page is handled before and after it was pages out.
> 
> The way that I see it, there is a tradeoff in the way dirty-bit should
> be handled:
> (1) Writable-clean PTEs introduce some non-negligible overhead.
> (2) Marking a PTE dirty speculatively would require a write back.
> 
> … But this tradeoff should not affect whether a PTE is writable, i.e.,
> mapping the PTE as writable-clean should not cause a writeback. In other
> words, if you are concerned about unnecessary writebacks, which I think is a
> fair concern, then do not set the dirty-bit. In a later patch I try to avoid
> TLB flushes on clean-writable entries that are write-protected.
> 
> So I do not think that the writeback you mentioned should be a real issue.
> Yet if you think that using the fact that the page is not-dirty is a good
> hueristics to avoid future TLB flushes (for P->NP; as I said there is a
> solution for RW->RO), or if you are concerned about the cost of
> vm_normal_page(), perhaps those are valid concerned (although I do not think
> so).

I think I now understand what you mean. I somehow remembered that some
architectures set a PTE dirty when marking it writable, but I guess this
is not true -- and setting it writable will keep it !dirty until really
accessed (either by the HW or by a fault). [I'll do some more digging
just to confirm]

With that in mind, your patch makes sense, and I guess you'd want that
as patch #1, because otherwise I fail to see how current patch #1 would
even succeed in reaching the "pte_mkdirty" call -- if !pte_dirty()
protects the code from running.

> 
> --
> 
> [ Regarding (1): After some discussions with Peter and reading more code, I
> thought at some point that perhaps avoiding having writable-clean PTE as
> much as possible makes sense [*], since setting the dirty-bit costs ~550
> cycles and a page fault is not a lot more than 1000. But with all the
> mitigations (and after adding IBRS for retbless) page-fault entry is kind of
> expensive. 

I understand the reasoning for anonymous memory, but not for page cache
pages. And for anonymous memory I think there are still cases where we
don't want to do that (swapcache and MADV_FREE comes to mind) -- and
IMHO some of the scenarios where we have clean anonymous memory at all,
we just don't care about optimizing for setting the dirty bit faster on
x86 ;) .

So my gut feeling is to keep it as simple as possible. To me, it
translates to setting the dirty bit only if we have clear indication
that write access is likely next.

Thanks Nadav for refreshing my memory :) !

-- 
Thanks,

David / dhildenb