[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEkqtfcOJDrxAAcs@nvidia.com>
Date: Wed, 11 Jun 2025 00:05:25 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Jason Gunthorpe <jgg@...dia.com>, Thomas Weißschuh
<thomas.weissschuh@...utronix.de>
CC: Shuah Khan <shuah@...nel.org>, Shuah Khan <skhan@...uxfoundation.org>,
Willy Tarreau <w@....eu>, Thomas Weißschuh
<linux@...ssschuh.net>, Kees Cook <kees@...nel.org>, Andy Lutomirski
<luto@...capital.net>, Will Drewry <wad@...omium.org>, Mark Brown
<broonie@...nel.org>, Muhammad Usama Anjum <usama.anjum@...labora.com>,
<linux-kernel@...r.kernel.org>, <linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional
into test metadata
On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote:
> > On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote:
> > > > > ------------------------------------------------------------------
> > > > > # RUN iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ...
> > > > > # enforce_dirty: Test terminated unexpectedly by signal 11
> > >
> > > Sig 11 is weird..
> >
> > > > On another note, the selftest should use the kselftest_harness' ASSERT_*()
> > > > macros instead of plain assert().
> > >
> > > IIRC the kselftest stuff explodes if you try to use it's assert
> > > functions within a fixture setup/teardown context.
> > >
> > > I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe
> > > it is ARM specific, I think Nicolin is running on ARM..
> >
> > Yes. And I was running with 64KB page size. I just quickly retried
> > with 4KB page size (matching x86), and all failed tests pass now.
>
> That's a weird thing to be sensitive too. Can you get a backtrace from
> the crash, what function/line is crashing?
I think I am getting what's going on. Here the harness code has a
parent process and a child process:
--------------------------------------------------------------
428- /* _metadata and potentially self are shared with all forks. */ \
429: child = fork(); \
430: if (child == 0) { \
431- fixture_name##_setup(_metadata, self, variant->data); \
432- /* Let setup failure terminate early. */ \
433- if (_metadata->exit_code) \
434- _exit(0); \
435- *_metadata->no_teardown = false; \
436- fixture_name##_##test_name(_metadata, self, variant->data); \
437- _metadata->teardown_fn(false, _metadata, self, variant->data); \
438- _exit(0); \
439: } else if (child < 0 || child != waitpid(child, &status, 0)) { \
440- ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); \
441- _metadata->exit_code = KSFT_FAIL; \
442- } \
443- _metadata->teardown_fn(true, _metadata, self, variant->data); \
444- munmap(_metadata->no_teardown, sizeof(*_metadata->no_teardown)); \
445- _metadata->no_teardown = NULL; \
446- if (self && fixture_name##_teardown_parent) \
447- munmap(self, sizeof(*self)); \
448- if (WIFEXITED(status)) { \
449- if (WEXITSTATUS(status)) \
450- _metadata->exit_code = WEXITSTATUS(status); \
451- } else if (WIFSIGNALED(status)) { \
452- /* Forward signal to __wait_for_test(). */ \
453- kill(getpid(), WTERMSIG(status)); \
454- } \
....
456- static void wrapper_##fixture_name##_##test_name##_teardown( \
457- bool in_parent, struct __test_metadata *_metadata, \
458- void *self, const void *variant) \
459- { \
460- if (fixture_name##_teardown_parent == in_parent && \
461- !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
462- fixture_name##_teardown(_metadata, self, variant); \
463- } \
--------------------------------------------------------------
I found there is a race between those two processes, resulting in
the teardown() not getting invoked: I added some ksft_print_msg()
calls in-between the lines to debug, those tests can pass mostly,
as teardown() got invoked.
I think the reason why those huge page cases fail is just because
the huge version of setup() takes longer time..
I haven't figured out a proper fix yet, but something smells bad:
1) *no_teardown is set non-atomically, while both processes calls
__atomic_test_and_set()
2) parent doesn't seem to wait for the setup() to complete..
3) when parent runs faster than the child that is still running
setup(), the parent unmaps the no_teardown and set it to NULL,
then UAF in the child, i.e. signal 11?
I think Thomas should have an idea with these info :)
Thanks
Nicolin
Powered by blists - more mailing lists