[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <79e0a85e-1ec4-e359-649d-618ca79c36f7@redhat.com>
Date: Thu, 26 Jan 2023 09:57:27 +0100
From: David Hildenbrand <david@...hat.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"bsingharora@...il.com" <bsingharora@...il.com>,
"hpa@...or.com" <hpa@...or.com>,
"Syromiatnikov, Eugene" <esyr@...hat.com>,
"peterz@...radead.org" <peterz@...radead.org>,
"rdunlap@...radead.org" <rdunlap@...radead.org>,
"keescook@...omium.org" <keescook@...omium.org>,
"Eranian, Stephane" <eranian@...gle.com>,
"kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"fweimer@...hat.com" <fweimer@...hat.com>,
"nadav.amit@...il.com" <nadav.amit@...il.com>,
"jannh@...gle.com" <jannh@...gle.com>,
"dethoma@...rosoft.com" <dethoma@...rosoft.com>,
"kcc@...gle.com" <kcc@...gle.com>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
"pavel@....cz" <pavel@....cz>, "oleg@...hat.com" <oleg@...hat.com>,
"hjl.tools@...il.com" <hjl.tools@...il.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"Yang, Weijiang" <weijiang.yang@...el.com>,
"Lutomirski, Andy" <luto@...nel.org>,
"arnd@...db.de" <arnd@...db.de>,
"jamorris@...ux.microsoft.com" <jamorris@...ux.microsoft.com>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"bp@...en8.de" <bp@...en8.de>,
"Schimpe, Christina" <christina.schimpe@...el.com>,
"x86@...nel.org" <x86@...nel.org>,
"mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"andrew.cooper3@...rix.com" <andrew.cooper3@...rix.com>,
"john.allen@....com" <john.allen@....com>,
"rppt@...nel.org" <rppt@...nel.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"corbet@....net" <corbet@....net>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
"gorcunov@...il.com" <gorcunov@...il.com>
Cc: "Yu, Yu-cheng" <yu-cheng.yu@...el.com>
Subject: Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk
On 25.01.23 19:43, Edgecombe, Rick P wrote:
> On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
>>>> Roughly speaking: if we abstract it that way and get all of the
>>>> "how
>>>> to
>>>> set it writable now?" out of core-MM, it not only is cleaner and
>>>> less
>>>> error prone, it might even allow other architectures that
>>>> implement
>>>> something comparable (e.g., using a dedicated HW bit) to actually
>>>> reuse
>>>> some of that work. Otherwise most of that "shstk" is really just
>>>> x86
>>>> specific ...
>>>>
>>>> I guess the only cases we have to special case would be page
>>>> pinning
>>>> code where pte_write() would indicate that the PTE is writable
>>>> (well,
>>>> it
>>>> is, just not by "ordinary CPU instruction" context directly): but
>>>> you
>>>> do
>>>> that already, so ... :)
>>>>
>>>> Sorry for stumbling over that this late, I only started looking
>>>> into
>>>> this when you CCed me on that one patch.
>>>
>>> Sorry for not calling more attention to it earlier. Appreciate your
>>> comments.
>>>
>>> Previously versions of this series had changed some of these
>>> pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
>>> vma.
>>> This way an x86 implementation could use the VM_SHADOW_STACK vma
>>> flag
>>> to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
>>> feedback
>>> was that in some of these code paths "maybe" isn't really an
>>> option, it
>>> *needs* to make it writable. Even though the logic was the same,
>>> the
>>> name of the function made it look wrong.
>>>
>>> But another option could be to change pte_mkwrite() to take a vma.
>>> This
>>> would save using another software bit on x86, but instead requires
>>> a
>>> small change to each arch's pte_mkwrite().
>>
>> I played with that idea shortly as well, but discarded it. I was not
>> able to convince myself that it wouldn't be required to pass in the
>> VMA
>> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ...
>> which would end up fairly ugly (or even impossible in thing slike
>> GUP-fast).
>>
>> For example, I wonder how we'd be handling stuff like do_numa_page()
>> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
>> either
>> call might set the PTE writable and maintain dirty bit ...
>
> pte_modify() is handled like this currently:
>
> https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/
>
> There has been a couple iterations on that. The current solution is to
> do the Dirty->SavedDirty fixup if needed after the new prots are added.
>
> Of course pte_modify() can't know whether you are are attempting to
> create a shadow stack PTE with the prot you are passing in. But the
> callers today explicitly call pte_mkwrite() after filling in the other
> bits with pte_modify().
See below on my MAP_PRIVATE vs. MAP_SHARED comment.
> Today this patch causes the pte_mkwrite() to be
> skipped and another fault may be required in the mprotect() and numa
> cases, but if we change pte_mkwrite() to take a VMA we can just make it
> shadow stack to start.
>
> It might be worth mentioning, there was a suggestion in the past to try
> to have the shadow stack bits come out of vm_get_page_prot(), but MM
> code would then try to map the zero page as (shadow stack) writable
> when there was a normal (non-shadow stack) read access. So I had to
> abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
> to make it shadow stack.
Thanks, do you have a pointer?
>
>>
>> Having that said, maybe it could work with only a single saved-dirty
>> bit
>> and passing in the VMA for pte_mkwrite() only.
>>
>> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
>> bit
>> to the soft-dirty bit instead, resulting in
>> "writable=0,dirty=0,saved-dirty=1",
>>
>> pte_dirty() would return dirty==1||saved-dirty==1.
>>
>> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
>> depending
>> on the writable bit.
>>
>> pte_mkclean() would clean both bits.
>>
>> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
>>
>> pte_mkwrite() would act according to the VMA, and in addition, merge
>> the
>> saved-dirty bit into the dirty bit.
>>
>> pte_modify() and mk_pte() .... would require more thought ...
>
> Not sure I'm following what the mk_pte() problem would be. You mean if
> Write=0,Dirty=1 is manually added to the prot?
>
> Shouldn't people generally use the pte_mkwrite() helpers unless they
> are drawing from a prot that was already created with the helpers or
> vm_get_page_prot()?
pte_mkwrite() is mostly only used (except for writenotify ...) for
MAP_PRIVATE memory ("COW-able"). For MAP_SHARED memory,
vma->vm_page_prot in a VM_WRITE mapping already contains the write
permissions. pte_mkwrite() is not necessary (again, unless writenotify
is active).
I assume shstk VMAs don't apply to MAP_SHARED VMAs, which is why you
didn't stumble over that issue yet? Because I don't see how it could
work with MAP_SHARED VMAs.
The other thing I had in mind was that we have to make sure that we're
not accidentally setting "Write=0,Dirty=1" in mk_pte() / pte_modify().
Assume we had a "Write=1,Dirty=1" PTE, and we effectively wrprotect
using pte_modify(), we have to make sure to move the dirty bit to the
saved_dirty bit.
> I think they can't manually create prot's from bits
> in core mm code, right? And x86 arch code already has to be aware of
> shadow stack. It's a bit of an assumption I guess, but I think maybe
> not too crazy of one?
I think that's true. Arch code is supposed to deal with that IIRC.
>
>>
>>
>> Further, ptep_modify_prot_commit() might have to be adjusted to
>> properly
>> flush in all relevant cases IIRC.
>
> Sorry, I'm not following. Can you elaborate? There is an adjustment
> made in pte_flags_need_flush().
Note that I did not fully review all bits of this patch set, just
throwing out what was on my mind. If already handled, great.
>
>>
>>>
>>> x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
>>> but
>>> maybe it could additionally warn if the vma is not writable. It
>>> also
>>> seems more aligned with your changes to stop taking hints from PTE
>>> bits
>>> and just look at the VMA? (I'm thinking about the dropping of the
>>> dirty
>>> check in GUP and dropping pte_saved_write())
>>
>> The soft-shstk bit wouldn't be a hint, it would be logically
>> changing
>> the "type" of the PTE such that any other PTE functions can do the
>> right
>> thing without having to consume the VMA.
>
> Yea, true.
>
> Thanks for your comments and ideas here, I'll give the:
> pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
> ...solution a try.
Good!
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists