linux-kernel - Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd06b54291ad5721da392a42f2d8e5636301ffef.camel@intel.com>
Date:   Wed, 25 Jan 2023 18:43:47 +0000
From:   "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To:     "david@...hat.com" <david@...hat.com>,
        "bsingharora@...il.com" <bsingharora@...il.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "Syromiatnikov, Eugene" <esyr@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "rdunlap@...radead.org" <rdunlap@...radead.org>,
        "keescook@...omium.org" <keescook@...omium.org>,
        "Eranian, Stephane" <eranian@...gle.com>,
        "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "fweimer@...hat.com" <fweimer@...hat.com>,
        "nadav.amit@...il.com" <nadav.amit@...il.com>,
        "jannh@...gle.com" <jannh@...gle.com>,
        "dethoma@...rosoft.com" <dethoma@...rosoft.com>,
        "kcc@...gle.com" <kcc@...gle.com>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "pavel@....cz" <pavel@....cz>, "oleg@...hat.com" <oleg@...hat.com>,
        "hjl.tools@...il.com" <hjl.tools@...il.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "Yang, Weijiang" <weijiang.yang@...el.com>,
        "Lutomirski, Andy" <luto@...nel.org>,
        "arnd@...db.de" <arnd@...db.de>,
        "jamorris@...ux.microsoft.com" <jamorris@...ux.microsoft.com>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "bp@...en8.de" <bp@...en8.de>,
        "Schimpe, Christina" <christina.schimpe@...el.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "mike.kravetz@...cle.com" <mike.kravetz@...cle.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "andrew.cooper3@...rix.com" <andrew.cooper3@...rix.com>,
        "john.allen@....com" <john.allen@....com>,
        "rppt@...nel.org" <rppt@...nel.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "corbet@....net" <corbet@....net>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
        "gorcunov@...il.com" <gorcunov@...il.com>
CC:     "Yu, Yu-cheng" <yu-cheng.yu@...el.com>
Subject: Re: [PATCH v5 18/39] mm: Handle faultless write upgrades for shstk

On Wed, 2023-01-25 at 10:27 +0100, David Hildenbrand wrote:
> > > Roughly speaking: if we abstract it that way and get all of the
> > > "how
> > > to
> > > set it writable now?" out of core-MM, it not only is cleaner and
> > > less
> > > error prone, it might even allow other architectures that
> > > implement
> > > something comparable (e.g., using a dedicated HW bit) to actually
> > > reuse
> > > some of that work. Otherwise most of that "shstk" is really just
> > > x86
> > > specific ...
> > > 
> > > I guess the only cases we have to special case would be page
> > > pinning
> > > code where pte_write() would indicate that the PTE is writable
> > > (well,
> > > it
> > > is, just not by "ordinary CPU instruction" context directly): but
> > > you
> > > do
> > > that already, so ... :)
> > > 
> > > Sorry for stumbling over that this late, I only started looking
> > > into
> > > this when you CCed me on that one patch.
> > 
> > Sorry for not calling more attention to it earlier. Appreciate your
> > comments.
> > 
> > Previously versions of this series had changed some of these
> > pte_mkwrite() calls to maybe_mkwrite(), which of course takes a
> > vma.
> > This way an x86 implementation could use the VM_SHADOW_STACK vma
> > flag
> > to decide between pte_mkwrite() and pte_mkwrite_shstk(). The
> > feedback
> > was that in some of these code paths "maybe" isn't really an
> > option, it
> > *needs* to make it writable. Even though the logic was the same,
> > the
> > name of the function made it look wrong.
> > 
> > But another option could be to change pte_mkwrite() to take a vma.
> > This
> > would save using another software bit on x86, but instead requires
> > a
> > small change to each arch's pte_mkwrite().
> 
> I played with that idea shortly as well, but discarded it. I was not 
> able to convince myself that it wouldn't be required to pass in the
> VMA 
> as well for things like pte_dirty(), pte_mkdirty(), pte_write(), ... 
> which would end up fairly ugly (or even impossible in thing slike
> GUP-fast).
> 
> For example, I wonder how we'd be handling stuff like do_numa_page() 
> cleanly correctly, where we use pte_modify() + pte_mkwrite(), and
> either 
> call might set the PTE writable and maintain dirty bit ...

pte_modify() is handled like this currently:

https://lore.kernel.org/lkml/20230119212317.8324-12-rick.p.edgecombe@intel.com/

There has been a couple iterations on that. The current solution is to
do the Dirty->SavedDirty fixup if needed after the new prots are added.

Of course pte_modify() can't know whether you are are attempting to
create a shadow stack PTE with the prot you are passing in. But the
callers today explicitly call pte_mkwrite() after filling in the other
bits with pte_modify(). Today this patch causes the pte_mkwrite() to be
skipped and another fault may be required in the mprotect() and numa
cases, but if we change pte_mkwrite() to take a VMA we can just make it
shadow stack to start.

It might be worth mentioning, there was a suggestion in the past to try
to have the shadow stack bits come out of vm_get_page_prot(), but MM
code would then try to map the zero page as (shadow stack) writable
when there was a normal (non-shadow stack) read access. So I had to
abandon that approach and rely on explicit calls to pte_mkwrite/shstk()
to make it shadow stack.

> 
> Having that said, maybe it could work with only a single saved-dirty
> bit 
> and passing in the VMA for pte_mkwrite() only.
> 
> pte_wrprotect() would detect "writable=0,dirty=1" and move the dirty
> bit 
> to the soft-dirty bit instead, resulting in 
> "writable=0,dirty=0,saved-dirty=1",
> 
> pte_dirty() would return dirty==1||saved-dirty==1.
> 
> pte_mkdirty() would set either set dirty=1 or saved-dirty=1,
> depending 
> on the writable bit.
> 
> pte_mkclean() would clean both bits.
> 
> pte_write() would detect "writable == 1 || (writable==0 && dirty==1)"
> 
> pte_mkwrite() would act according to the VMA, and in addition, merge
> the 
> saved-dirty bit into the dirty bit.
> 
> pte_modify() and mk_pte() .... would require more thought ...

Not sure I'm following what the mk_pte() problem would be. You mean if
Write=0,Dirty=1 is manually added to the prot?

Shouldn't people generally use the pte_mkwrite() helpers unless they
are drawing from a prot that was already created with the helpers or
vm_get_page_prot()? I think they can't manually create prot's from bits
in core mm code, right? And x86 arch code already has to be aware of
shadow stack. It's a bit of an assumption I guess, but I think maybe
not too crazy of one?

> 
> 
> Further, ptep_modify_prot_commit() might have to be adjusted to
> properly 
> flush in all relevant cases IIRC.

Sorry, I'm not following. Can you elaborate? There is an adjustment
made in pte_flags_need_flush().

> 
> > 
> > x86's pte_mkwrite() would then be pretty close to maybe_mkwrite(),
> > but
> > maybe it could additionally warn if the vma is not writable. It
> > also
> > seems more aligned with your changes to stop taking hints from PTE
> > bits
> > and just look at the VMA? (I'm thinking about the dropping of the
> > dirty
> > check in GUP and dropping pte_saved_write())
> 
> The soft-shstk bit wouldn't be a hint, it would be logically
> changing 
> the "type" of the PTE such that any other PTE functions can do the
> right 
> thing without having to consume the VMA.

Yea, true.

Thanks for your comments and ideas here, I'll give the:
pte_t pte_mkwrite(struct vm_area_struct *vma, pte_t pte)
...solution a try.