[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEm6tuzy7WK12sMh@nvidia.com>
Date: Wed, 11 Jun 2025 10:19:50 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Thomas Weißschuh <thomas.weissschuh@...utronix.de>
CC: Jason Gunthorpe <jgg@...dia.com>, Shuah Khan <shuah@...nel.org>, "Shuah
Khan" <skhan@...uxfoundation.org>, Willy Tarreau <w@....eu>, Thomas
Weißschuh <linux@...ssschuh.net>, Kees Cook
<kees@...nel.org>, Andy Lutomirski <luto@...capital.net>, Will Drewry
<wad@...omium.org>, Mark Brown <broonie@...nel.org>, Muhammad Usama Anjum
<usama.anjum@...labora.com>, <linux-kernel@...r.kernel.org>,
<linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional
into test metadata
On Wed, Jun 11, 2025 at 10:04:35AM +0200, Thomas Weißschuh wrote:
> On Wed, Jun 11, 2025 at 12:05:25AM -0700, Nicolin Chen wrote:
> > On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote:
> > > > On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote:
> > > > > > > ------------------------------------------------------------------
> > > > > > > # RUN iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ...
> > > > > > > # enforce_dirty: Test terminated unexpectedly by signal 11
> > > > >
> > > > > Sig 11 is weird..
> > > >
> > > > > > On another note, the selftest should use the kselftest_harness' ASSERT_*()
> > > > > > macros instead of plain assert().
> > > > >
> > > > > IIRC the kselftest stuff explodes if you try to use it's assert
> > > > > functions within a fixture setup/teardown context.
> > > > >
> > > > > I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe
> > > > > it is ARM specific, I think Nicolin is running on ARM..
> > > >
> > > > Yes. And I was running with 64KB page size. I just quickly retried
> > > > with 4KB page size (matching x86), and all failed tests pass now.
> > >
> > > That's a weird thing to be sensitive too. Can you get a backtrace from
> > > the crash, what function/line is crashing?
> >
> > I think I am getting what's going on. Here the harness code has a
> > parent process and a child process:
> >
> > --------------------------------------------------------------
> > 428- /* _metadata and potentially self are shared with all forks. */ \
> > 429: child = fork(); \
> > 430: if (child == 0) { \
> > 431- fixture_name##_setup(_metadata, self, variant->data); \
> > 432- /* Let setup failure terminate early. */ \
> > 433- if (_metadata->exit_code) \
> > 434- _exit(0); \
> > 435- *_metadata->no_teardown = false; \
> > 436- fixture_name##_##test_name(_metadata, self, variant->data); \
> > 437- _metadata->teardown_fn(false, _metadata, self, variant->data); \
> > 438- _exit(0); \
> > 439: } else if (child < 0 || child != waitpid(child, &status, 0)) { \
> > 440- ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); \
> > 441- _metadata->exit_code = KSFT_FAIL; \
> > 442- } \
> > 443- _metadata->teardown_fn(true, _metadata, self, variant->data); \
> > 444- munmap(_metadata->no_teardown, sizeof(*_metadata->no_teardown)); \
> > 445- _metadata->no_teardown = NULL; \
> > 446- if (self && fixture_name##_teardown_parent) \
> > 447- munmap(self, sizeof(*self)); \
> > 448- if (WIFEXITED(status)) { \
> > 449- if (WEXITSTATUS(status)) \
> > 450- _metadata->exit_code = WEXITSTATUS(status); \
> > 451- } else if (WIFSIGNALED(status)) { \
> > 452- /* Forward signal to __wait_for_test(). */ \
> > 453- kill(getpid(), WTERMSIG(status)); \
> > 454- } \
> > ....
> > 456- static void wrapper_##fixture_name##_##test_name##_teardown( \
> > 457- bool in_parent, struct __test_metadata *_metadata, \
> > 458- void *self, const void *variant) \
> > 459- { \
> > 460- if (fixture_name##_teardown_parent == in_parent && \
> > 461- !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
> > 462- fixture_name##_teardown(_metadata, self, variant); \
> > 463- } \
> > --------------------------------------------------------------
> >
> > I found there is a race between those two processes, resulting in
> > the teardown() not getting invoked: I added some ksft_print_msg()
> > calls in-between the lines to debug, those tests can pass mostly,
> > as teardown() got invoked.
> >
> > I think the reason why those huge page cases fail is just because
> > the huge version of setup() takes longer time..
>
> Can you try to recreate this issue with changes to
> tools/testing/selftests/kselftest_harness/harness-selftest.c ?
Hmm, I assume all 9 cases should pass? Mine only passes 4 on rc1,
without any change (perhaps we should start from here?):
TAP version 13
1..9
# Starting 9 tests from 4 test cases.
# RUN global.standalone_pass ...
# harness-selftest.c:19:standalone_pass:before
# harness-selftest.c:23:standalone_pass:after
# OK global.standalone_pass
ok 1 global.standalone_pass
# RUN global.standalone_fail ...
# harness-selftest.c:27:standalone_fail:before
# harness-selftest.c:29:standalone_fail:Expected 0 (0) == 1 (1)
# harness-selftest.c:30:standalone_fail:Expected 0 (0) == 1 (1)
# standalone_fail: Test terminated by assertion
# FAIL global.standalone_fail
not ok 2 global.standalone_fail
# RUN global.signal_pass ...
# harness-selftest.c:35:signal_pass:before
# harness-selftest.c:37:signal_pass:after
# OK global.signal_pass
ok 3 global.signal_pass
# RUN global.signal_fail ...
# harness-selftest.c:42:signal_fail:before
# harness-selftest.c:43:signal_fail:Expected 0 (0) == 1 (1)
# signal_fail: Test terminated by assertion
# FAIL global.signal_fail
not ok 4 global.signal_fail
# RUN fixture.pass ...
# harness-selftest.c:53:pass:setup
# harness-selftest.c:62:pass:before
# harness-selftest.c:19:pass:before
# harness-selftest.c:23:pass:after
# harness-selftest.c:66:pass:after
# harness-selftest.c:58:pass:teardown same-process=1
# OK fixture.pass
ok 5 fixture.pass
# RUN fixture.fail ...
# harness-selftest.c:53:fail:setup
# harness-selftest.c:70:fail:before
# harness-selftest.c:71:fail:Expected 0 (0) == 1 (1)
# harness-selftest.c:58:fail:teardown same-process=1
# fail: Test terminated by assertion
# FAIL fixture.fail
not ok 6 fixture.fail
# RUN fixture.timeout ...
# harness-selftest.c:53:timeout:setup
# harness-selftest.c:77:timeout:before
# timeout: Test terminated by timeout
# FAIL fixture.timeout
not ok 7 fixture.timeout
# RUN fixture_parent.pass ...
# harness-selftest.c:87:pass:setup
# harness-selftest.c:96:pass:before
# harness-selftest.c:98:pass:after
# harness-selftest.c:92:pass:teardown same-process=0
# OK fixture_parent.pass
ok 8 fixture_parent.pass
# RUN fixture_setup_failure.pass ...
# harness-selftest.c:106:pass:setup
# harness-selftest.c:108:pass:Expected 0 (0) == 1 (1)
# pass: Test terminated by assertion
# FAIL fixture_setup_failure.pass
not ok 9 fixture_setup_failure.pass
# FAILED: 4 / 9 tests passed.
# Totals: pass:4 fail:5 xfail:0 xpass:0 skip:0 error:0
> > I haven't figured out a proper fix yet, but something smells bad:
> > 1) *no_teardown is set non-atomically, while both processes calls
> > __atomic_test_and_set()
>
> Does this make a difference?
>
> diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
> index 2925e47db995..89fb37a21d9d 100644
> --- a/tools/testing/selftests/kselftest_harness.h
> +++ b/tools/testing/selftests/kselftest_harness.h
> @@ -410,7 +410,7 @@
> /* Makes sure there is only one teardown, even when child forks again. */ \
> _metadata->no_teardown = mmap(NULL, sizeof(*_metadata->no_teardown), \
> PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); \
> - *_metadata->no_teardown = true; \
> + __atomic_store_n(_metadata->no_teardown, true, __ATOMIC_SEQ_CST); \
> if (sizeof(*self) > 0) { \
> if (fixture_name##_teardown_parent) { \
> self = mmap(NULL, sizeof(*self), PROT_READ | PROT_WRITE, \
> @@ -429,7 +429,7 @@
> /* Let setup failure terminate early. */ \
> if (_metadata->exit_code) \
> _exit(0); \
> - *_metadata->no_teardown = false; \
> + __atomic_store_n(_metadata->no_teardown, false, __ATOMIC_SEQ_CST); \
> fixture_name##_##test_name(_metadata, self, variant->data); \
> _metadata->teardown_fn(false, _metadata, self, variant->data); \
> _exit(0); \
> @@ -455,7 +455,7 @@
> void *self, const void *variant) \
> { \
> if (fixture_name##_teardown_parent == in_parent && \
> - !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
> + !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_SEQ_CST)) \
> fixture_name##_teardown(_metadata, self, variant); \
> } \
> static struct __test_metadata *_##fixture_name##_##test_name##_object; \
Unfortunately, no :(
>
> > 2) parent doesn't seem to wait for the setup() to complete..
>
> setup() is called in the child (L431) right before the testcase itself is
> called (L436). The parent waits for the child to exit (L439) before unmapping.
>
> > 3) when parent runs faster than the child that is still running
> > setup(), the parent unmaps the no_teardown and set it to NULL,
> > then UAF in the child, i.e. signal 11?
>
> That should never happen as the waitpid() will block until the child running
> setup() and the testcase itself have exited.
Ah, maybe I was wrong about these narratives. But the results show
that iommufd_dirty_tracking_teardown() was not called in the failed
cases:
// I added a huge.c file to run only 4 cases on one variant
// And I added two fprintf to its FIXTURE_SETUP/TEARDOWN().
TAP version 13
1..4
# Starting 4 tests from 1 test cases.
# RUN iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ...
---------iommufd_dirty_tracking_setup
---------iommufd_dirty_tracking_teardown
# OK iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking
ok 1 iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking
# RUN iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability ...
---------iommufd_dirty_tracking_setup
# device_dirty_capability: Test terminated unexpectedly by signal 11
# FAIL iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability
not ok 2 iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability
# RUN iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap ...
---------iommufd_dirty_tracking_setup
# get_dirty_bitmap: Test terminated unexpectedly by signal 11
# FAIL iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap
not ok 3 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap
# RUN iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear ...
---------iommufd_dirty_tracking_setup
# get_dirty_bitmap_no_clear: Test terminated unexpectedly by signal 11
# FAIL iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear
not ok 4 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear
# FAILED: 1 / 4 tests passed.
>
> Does the issue also happen when you only execute a single testcase?
> $ ./iommufd -r iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty
$ sudo tools/testing/selftests/iommu/iommufd -r iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
# RUN iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ...
# OK iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
ok 1 iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
This one passes. Looks like the first hugepage case would pass but
the following ones would fail if running them sequentially..
Thanks
Nicolin
Powered by blists - more mailing lists