[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqz7bxfsz6l.fsf@google.com>
Date: Wed, 01 Oct 2025 10:18:10 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Christian Borntraeger <borntraeger@...ux.ibm.com>,
Janosch Frank <frankja@...ux.ibm.com>, Claudio Imbrenda <imbrenda@...ux.ibm.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, David Hildenbrand <david@...hat.com>,
Fuad Tabba <tabba@...gle.com>
Subject: Re: [PATCH 5/6] KVM: selftests: Add wrappers for mmap() and munmap()
to assert success
Sean Christopherson <seanjc@...gle.com> writes:
> On Tue, Sep 30, 2025, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@...gle.com> writes:
>> > To be perfectly honest, I forgot test_util.h existed :-)
>>
>> Merging/dropping one of kvm_util.h vs test_util.h is a good idea. The
>> distinction is not clear and it's already kind of messy between the two.
>
> That's a topic for another day.
>
>> It's a common pattern in KVM selftests to have a syscall/ioctl wrapper
>> foo() that asserts defaults and a __foo() that doesn't assert anything
>> and allows tests to assert something else, but I have a contrary
>> opinion.
>>
>> I think it's better that tests be explicit about what they're testing
>> for, so perhaps it's better to use macros like TEST_ASSERT_EQ() to
>> explicitly call a function and check the results.
>
> No, foo() and __foo() is a well-established pattern in the kernel, and in KVM
> selftests it is a very well-established pattern for syscalls and ioctls. And
> I feel very, very strong about handling errors in the core infrastructure.
>
> Relying on developers to remember to add an assert is 100% guaranteed to result
> in missed asserts. That makes everyone's life painful, because inevitably an
> ioctl will fail on someone else's system, and then they're stuck debugging a
> super random failure with no insight into what the developer _meant_ to do.
>
> And requiring developers to write (i.e. copy+paste) boring, uninteresting code
> to handle failures adds a lot of friction to development, is a terrible use of
> developers' time, and results in _awful_ error messages. Bad or missing error
> messages in tests have easily wasted tens of hours of just _my_ time; I suspect
> the total cost throughout the KVM community can be measured in tens of days.
>
> E.g. pop quiz, what state did I clobber that generated this error message with
> a TEST_ASSERT_EQ(ret, 0)? Answer at the bottom.
>
> ==== Test Assertion Failure ====
> lib/x86/processor.c:1128: ret == 0
> pid=2456 tid=2456 errno=22 - Invalid argument
> 1 0x0000000000415465: vcpu_load_state at processor.c:1128
> 2 0x0000000000402805: save_restore_vm at hyperv_evmcs.c:221
> 3 0x000000000040204d: main at hyperv_evmcs.c:286
> 4 0x000000000041df43: __libc_start_call_main at libc-start.o:?
> 5 0x00000000004200ec: __libc_start_main_impl at ??:?
> 6 0x0000000000402220: _start at ??:?
> 0xffffffffffffffff != 0 (ret != 0)
>
> You might say "oh, I can go look at the source". But what if you don't have the
> source because you got a test failure from CI? Or because the assert came from
> a bug report due to a failure in someone else's CI pipeline?
>
> That is not a contrived example. Before the ioctl assertion framework was added,
> KVM selftests was littered with such garbage. Note, I'm not blaming developers
> in any way. After having to add tens of asserts on KVM ioctls just to write a
> simple test, it's entirely natural to become fatigued and start throwing in
> TEST_ASSERT_EQ(ret, 0) or TEST_ASSERT(!ret, "ioctl failed").
>
> There's also the mechanics of requiring the caller to assert. KVM ioctls that
> return a single value, e.g. register accessors, then need to use an out-param to
> communicate the value or error code, e.g. this
>
> val = vcpu_get_reg(vcpu, reg_id);
> TEST_ASSERT_EQ(val, 0);
>
> would become this:
>
> ret = vcpu_get_reg(vcpu, reg_id, &val);
> TEST_ASSERT_EQ(ret, 0);
> TEST_ASSERT_EQ(val, 0);
>
> But of course, the developer would bundle that into:
>
> TEST_ASSERT(!ret && !val, "get_reg failed");
>
> And then the user is really sad when the "!val" condition fails, because they
> can't even tell. Again, this't a contrived example, it literally happend to me
> when dealing with the guest_memfd NUMA testcase, and was what prompted me to
> write this syscall framework. This also shows the typical error message that a
> developer will write.
>
> This TEST_ASSERT() failed on me due to a misguided cleanup I made:
>
> ret = syscall(__NR_get_mempolicy, &get_policy, &get_nodemask,
> maxnode, mem, MPOL_F_ADDR);
> TEST_ASSERT(!ret && get_policy == MPOL_DEFAULT && get_nodemask == 0,
> "Policy should be MPOL_DEFAULT and nodes zero");
>
> generating this error message:
>
> ==== Test Assertion Failure ====
> guest_memfd_test.c:120: !ret && get_policy == MPOL_DEFAULT && get_nodemask == 0
> pid=52062 tid=52062 errno=22 - Invalid argument
> 1 0x0000000000404113: test_mbind at guest_memfd_test.c:120 (discriminator 6)
> 2 (inlined by) __test_guest_memfd at guest_memfd_test.c:409 (discriminator 6)
> 3 0x0000000000402320: test_guest_memfd at guest_memfd_test.c:432
> 4 (inlined by) main at guest_memfd_test.c:529
> 5 0x000000000041eda3: __libc_start_call_main at libc-start.o:?
> 6 0x0000000000420f4c: __libc_start_main_impl at ??:?
> 7 0x00000000004025c0: _start at ??:?
> Policy should be MPOL_DEFAULT and nodes zero
>
> At first glance, it would appear that get_mempolicy() failed with -EINVAL. Nope.
> ret==0, but errno was left set from an earlier syscall. It took me a few minutes
> of digging and a run with strace to figure out that get_mempolicy() succeeded.
>
> Constrast that with:
>
> kvm_get_mempolicy(&policy, &nodemask, maxnode, mem, MPOL_F_ADDR);
> TEST_ASSERT(policy == MPOL_DEFAULT && !nodemask,
> "Wanted MPOL_DEFAULT (%u) and nodemask 0x0, got %u and 0x%lx",
> MPOL_DEFAULT, policy, nodemask);
>
> ==== Test Assertion Failure ====
> guest_memfd_test.c:120: policy == MPOL_DEFAULT && !nodemask
> pid=52700 tid=52700 errno=22 - Invalid argument
> 1 0x0000000000404915: test_mbind at guest_memfd_test.c:120 (discriminator 6)
> 2 (inlined by) __test_guest_memfd at guest_memfd_test.c:407 (discriminator 6)
> 3 0x0000000000402320: test_guest_memfd at guest_memfd_test.c:430
> 4 (inlined by) main at guest_memfd_test.c:527
> 5 0x000000000041eda3: __libc_start_call_main at libc-start.o:?
> 6 0x0000000000420f4c: __libc_start_main_impl at ??:?
> 7 0x00000000004025c0: _start at ??:?
> Wanted MPOL_DEFAULT (0) and nodemask 0x0, got 1 and 0x1
>
> Yeah, there's still some noise with errno=22, but it's fairly clear that the
> returned values mismatches, and super obvious that the syscall succeeded when
> looking at the code. This is not a cherry-picked example. There are hundreds,
> if not thousands, of such asserts in KVM selftests and KVM-Unit-Tests in
> particular. And that's when developers _aren't_ forced to manually add boilerplate
> asserts in ioctls succeeding.
>
> For people that are completely new to KVM selftests, I can appreciate that it
> might take a while to acclimate to the foo() and __foo() pattern, but I have a
> hard time believing that it adds significant cognitive load after you've spent
> a decent amount of time in KVM selftests. And I 100% want to cater to the people
> that are dealing with KVM selftests day in, day out.
>
Thanks for taking the time to write this up. I'm going to start a list
of "most useful explanations" and this will go on that list.
>> Or perhaps it should be more explicit, like in the name, that an
>> assertion is made within this function?
>
> No, that's entirely inflexible, will lead to confusion, and adds a copious amount
> of noise. E.g. this
>
> /* emulate hypervisor clearing CR4.OSXSAVE */
> vcpu_sregs_get(vcpu, &sregs);
> sregs.cr4 &= ~X86_CR4_OSXSAVE;
> vcpu_sregs_set(vcpu, &sregs);
>
> versus
>
> /* emulate hypervisor clearing CR4.OSXSAVE */
> vcpu_sregs_get_assert(vcpu, &sregs);
> sregs.cr4 &= ~X86_CR4_OSXSAVE;
> vcpu_sregs_set_assert(vcpu, &sregs);
>
> The "assert" is pure noise and makes it harder to see the "get" versus "set".
>
> If we instead annotate the the "no_assert" case, then we'll end up with ambigous
> cases where a developer won't be able to determine if an unannotated API asserts
> or not, and conflict cases where a "no_assert" API _does_ assert, just not on the
> primary ioctl it's invoking.
>
> IMO, foo() and __foo() is quite explicit once you become accustomed to the
> environment.
>
>> In many cases a foo() exists without the corresponding __foo(), which
>> seems to be discouraging testing for error cases.
>
> That's almost always because no one has needed __foo().
>
>> Also, I guess especially for vcpu_run(), tests would like to loop/take
>> different actions based on different errnos and then it gets a bit
>> unwieldy to have to avoid functions that have assertions within them.
>
> vcpu_run() is a special case. KVM_RUN is so much more than a normal ioctl, and
> so having vcpu_run() follow the "standard" pattern isn't entirely feasible.
>
> Speaking of vcpu_run(), and directly related to idea of having developers manually
> do TEST_ASSERT_EQ(), one of the top items on my selftests todo list is to have
> vcpu_run() handle GUEST_ASSERT and GUEST_PRINTF whenever possible. Having to add
> UCALL_PRINTF handling just to get a debug message out of a test's guest code is
> beyond frustrating. Ditto for the 60+ tests that had to manually add UCALL_ABORT
> handling, which leads to tests having code like this, which then gets copy+pasted
> all over the place and becomes a nightmare to maintain.
+1000 this is exactly where I had to avoid assertions!
>
> static void __vcpu_run_expect(struct kvm_vcpu *vcpu, unsigned int cmd)
> {
> struct ucall uc;
>
> vcpu_run(vcpu);
> switch (get_ucall(vcpu, &uc)) {
> case UCALL_ABORT:
> REPORT_GUEST_ASSERT(uc);
> break;
> default:
> if (uc.cmd == cmd)
> return;
>
> TEST_FAIL("Unexpected ucall: %lu", uc.cmd);
> }
> }
>
>> I can see people forgetting to add TEST_ASSERT_EQ()s to check results of
>> setup/teardown functions but I think those errors would surface some
>> other way anyway.
>
> Heh, I don't mean to be condescending, but I highly doubt you'll have this
> opinion after you've had to debug a completely unfamiliar test that's failing
> in weird ways, for the tenth time.
>
>> Not a strongly-held opinion,
>
> As you may have noticed, I have extremely strong opinions in this area :-)
>
>> and no major concerns on the naming either. It's a selftest after all and
>> IIUC we're okay to have selftest interfaces change anyway?
>
> Yes, changes are fine. It's the churn I want to avoid.
>
> Oh, and here's the "answer" to the TEST_ASSERT_EQ() failure:
>
> ==== Test Assertion Failure ====
> include/kvm_util.h:794: !ret
> pid=43866 tid=43866 errno=22 - Invalid argument
> 1 0x0000000000415486: vcpu_sregs_set at kvm_util.h:794 (discriminator 4)
> 2 (inlined by) vcpu_load_state at processor.c:1125 (discriminator 4)
> 3 0x0000000000402805: save_restore_vm at hyperv_evmcs.c:221
> 4 0x000000000040204d: main at hyperv_evmcs.c:286
> 5 0x000000000041dfc3: __libc_start_call_main at libc-start.o:?
> 6 0x000000000042016c: __libc_start_main_impl at ??:?
> 7 0x0000000000402220: _start at ??:?
> KVM_SET_SREGS failed, rc: -1 errno: 22 (Invalid argument)
Powered by blists - more mailing lists