linux-kernel - Re: [GIT PULL] x86/shstk for 6.4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wiVLvz3RdZiSjLNGKKgR3s-=2goRPnNWg6cbrcwMVvndQ@mail.gmail.com>
Date:   Mon, 8 May 2023 16:31:09 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "keescook@...omium.org" <keescook@...omium.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>
Subject: Re: [GIT PULL] x86/shstk for 6.4

On Mon, May 8, 2023 at 3:57 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> There's a wrinkle to enforcing that universally.  From the SDM's
> "ACCESSED AND DIRTY FLAGS" section:
>
>         If software on one logical processor writes to a page while
>         software on another logical processor concurrently clears the
>         R/W flag in the paging-structure entry that maps the page,
>         execution on some processors may result in the entry’s dirty
>         flag being set.

I was actually wondering about that.

I had this memory that we've done special things in the past to make
sure that the dirty bit is guaranteed stable (ie the whole
"ptep_clear()" dance). But I wasn't sure.

> This behavior is gone on shadow stack CPUs

Ok, so Intel has actually tightened up the rules on setting dirty, and
now guarantees that it will set dirty only if the pte is actually
writable?

> We could probably tolerate the cost for some of the users like ksm.  But
> I can't think of a way to do it without making fork() suffer.  fork() of
> course modifies the PTE (RW->RO) and flushes the TLB now.  But there
> would need to be a Present=0 PTE in there somewhere before the TLB flush.

Yeah, we don't want to make fork() any worse than it already is.  No
question about that.

But if we make the rule be that having the exact dirty bit vs rw bit
semantics only matters for CPUs that do the shadow stack thing, and on
*those* CPU's it's ok to not go through the dance, can we then come up
with a sequence that works for everybody?

> So, the rule would be something like:
>
>         The *kernel* will never itself create Write=0,Dirty=1 PTEs
>
> That won't prevent the hardware from still being able to do it behind
> our backs on older CPUs.  But it does avoid a few of the special cases.

Right. So looking at the fork() case as a nasty example, right now we have

        ptep_set_wrprotect()

on the source pte of a fork(), which atomically just clears the WRITE
bit (and thus guarantees that dirty bits cannot get lost, simply
because it doesn't matter if some other CPU atomically sets another
bit concurrently).

On the destination we don't have any races with concurrent accesses,
and just do entirely non-atomic

                pte = pte_wrprotect(pte);

and then eventually (after other bit games) do

        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);

and basically you're saying that there is no possible common sequence
for that ptep_set_wrprotect() that doesn't penalize some case.

Hmm.

Yeah, right now the non-shadow-stack ptep_set_wrprotect() can just be
an atomic clear_bit(), which turns into just

        lock andb $-3, (%reg)

and I guess that would inevitably become a horror of a cmpxchg loop
when you need to move the dirty bit to the SW dirty on CPU's where the
dirty bit can come in late.

How very very horrid.

                     Linus