linux-kernel - Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ztuj7KapTJyBVCVR@google.com>
Date: Sat, 7 Sep 2024 00:53:00 +0000
From: Sean Christopherson <seanjc@...gle.com>
To: James Houghton <jthoughton@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Oliver Upton <oliver.upton@...ux.dev>, Marc Zyngier <maz@...nel.org>, Peter Xu <peterx@...hat.com>, 
	James Houghton <jthoughton@...gle.com>
Subject: Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)

On Fri, Sep 06, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@...gle.com> wrote:
> >
> > Add two phases to mmu_stress_test to verify that KVM correctly handles
> > guest memory that was writable, and then made read-only in the primary MMU,
> > and then made writable again.
> >
> > Add bonus coverage for x86 to verify that all of guest memory was marked
> > read-only.  Making forward progress (without making memory writable)
> > requires arch specific code to skip over the faulting instruction, but the
> > test can at least verify each vCPU's starting page was made read-only.
> >
> > Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> 
> Writing off-list because I just have some smaller questions that I
> don't want to bother the list with.

Pulling everyone and the lists back in :-)

IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99%
archival and 1% list, at best.  Odds are very, very good that if you have a
question, however trivial or small, then someone else has the exact same question,
or _will_ have the question in the future.

I strongly prefer that all questions, review, feedback, etc. happen on list, even
if the questions/feedback may seem trivial or noisy.  The only exception is if
information can't/shouldn't be made public, e.g. because of an embargo, NDA,
security implications, etc.

> For the next version, feel free to add:
> 
> Reviewed-by: James Houghton <jthoughton@...gle.com>
> 
> All of the selftest patches look fine to me.
> 
> > ---
> >  tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
> >  1 file changed, 84 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > index 50c3a17418c4..98f9a4660269 100644
> > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > @@ -16,6 +16,8 @@
> >  #include "guest_modes.h"
> >  #include "processor.h"
> >
> > +static bool mprotect_ro_done;
> > +
> >  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> >  {
> >         uint64_t gpa;
> > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> >                 *((volatile uint64_t *)gpa);
> >         GUEST_SYNC(2);
> >
> > +       /*
> > +        * Write to the region while mprotect(PROT_READ) is underway.  Keep
> > +        * looping until the memory is guaranteed to be read-only, otherwise
> > +        * vCPUs may complete their writes and advance to the next stage
> > +        * prematurely.
> > +        */
> > +       do {
> > +               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> > +#ifdef __x86_64__
> > +                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
> 
> Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :)
> 
> Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov
> BYTE PTR [rax], 0x0`. (just curious, no need to change it)

LOL, yes, but as evidenced by the trailing comment, my intent was to generate
"mov rax, [rax]", not "movb $0, [rax]".  I suspect I was too lazy to consult the
SDM to recall the correct opcode and simply copied an instruction from some random
disassembly output without looking too closely at the output.

	asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */

> And I take it you wrote it out like this (instead of using mnemonics)
> so that you could guarantee that IP + 4 would be the right way to skip
> forwards. Does it make sense to leave a comment about that? 

Yes and yes.

> The translation from mnemonic --> bytes won't change...

Heh, this is x86, by no means is that guaranteed.  E.g. see the above, where the
same mnemonic can be represented multiple ways.

> so could you just write the proper assembly? (not a request,  just curious)

In practice, probably.  But the rules for inline assembly are, at best, fuzzy.
So long as the correct instruction is generated, the assembler has quite a bit
of freedom.

E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as:

  48 89 00

but can also be encoded as

  48 89 40 00

Now, it's _extremely_ unlikely a compiler will actually generate the latter, but
it's perfectly legal to do so.  E.g. with gcc-13, this

  mov %rax, 0x0(%rax)

generates

  48 89 00

even though a more literal interpretation would be

  48 89 40 00

So yeah, while the hand-coded opcode is gross and annoying, given that a failure
due to the "wrong" instruction being generated would be painful and time consuming
to debug, hand-coding is worth avoiding the risk and potential pain if the compiler
decides to be mean :-)

> A comment that 0x40 corresponds to %rax and that "a" also corresponds
> to %rax would have been helpful for me. :)

Eh, I get what you're saying, but giving a play-by-play of the encoding isn't
really all that reasonable because _so_ much information needs to be conveyed to
capture the entire picture, and some things are essentially table stakes when it
comes to x86 kernel programming.

E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the
addressing mode, which in turn depends on the operating mode (64-bit).

And "a" isn't just %rax; it's specifically an input register constraint, e.g. is
distinctly different than:

  asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */

even though in this specific scenario they generate the same code.

And with the correct "48 89 00", understanding the full encoding requires describing
REX prefixes, which are a mess unto themselves.

So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even
though I 100% agree that it puts a decent sized load on the reader.  There's just
_too_ much information to communicate to the reader, at least for x86.

> > +#else
> > +                       vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> > +#endif
> > +       } while (!READ_ONCE(mprotect_ro_done));
> > +
> > +       /*
> > +        * Only x86 can explicitly sync, as other architectures will be stuck
> > +        * on the write fault.
> 
> It would also have been a little clearer if the comment also said how
> this is just because the test has been written to increment for the PC
> upon getting these write faults *for x86 only*. IDK something like
> "For x86, the test will adjust the PC for each write fault, allowing
> the above loop to complete. Other architectures will get stuck, so the
> #3 sync step is skipped."

Ya.  Untested, but how about this?

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 2d66c2724336..29acb22ea387 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * looping until the memory is guaranteed to be read-only, otherwise
         * vCPUs may complete their writes and advance to the next stage
         * prematurely.
+        *
+        * For architectures that support skipping the faulting instruction,
+        * generate the store via inline assembly to ensure the exact length
+        * of the instruction is known and stable (vcpu_arch_put_guest() on
+        * fixed-length architectures should work, but the cost of paranoia
+        * is low in this case).  For x86, hand-code the exact opcode so that
+        * there is no room for variability in the generated instruction.
         */
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
+                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
                        asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
@@ -163,7 +170,7 @@ static void *vcpu_worker(void *data)
                TEST_ASSERT_EQ(errno, EFAULT);
 #ifdef __x86_64__
                WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
-               vcpu->run->s.regs.regs.rip += 4;
+               vcpu->run->s.regs.regs.rip += 3;
 #endif
 #ifdef __aarch64__
                vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),