linux-kernel - Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional into test metadata

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEm6tuzy7WK12sMh@nvidia.com>
Date: Wed, 11 Jun 2025 10:19:50 -0700
From: Nicolin Chen <nicolinc@...dia.com>
To: Thomas Weißschuh <thomas.weissschuh@...utronix.de>
CC: Jason Gunthorpe <jgg@...dia.com>, Shuah Khan <shuah@...nel.org>, "Shuah
 Khan" <skhan@...uxfoundation.org>, Willy Tarreau <w@....eu>, Thomas
 Weißschuh <linux@...ssschuh.net>, Kees Cook
	<kees@...nel.org>, Andy Lutomirski <luto@...capital.net>, Will Drewry
	<wad@...omium.org>, Mark Brown <broonie@...nel.org>, Muhammad Usama Anjum
	<usama.anjum@...labora.com>, <linux-kernel@...r.kernel.org>,
	<linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional
 into test metadata

On Wed, Jun 11, 2025 at 10:04:35AM +0200, Thomas Weißschuh wrote:
> On Wed, Jun 11, 2025 at 12:05:25AM -0700, Nicolin Chen wrote:
> > On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote:
> > > > On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote:
> > > > > > > ------------------------------------------------------------------
> > > > > > > #  RUN           iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ...
> > > > > > > # enforce_dirty: Test terminated unexpectedly by signal 11
> > > > > 
> > > > > Sig 11 is weird..
> > > > 
> > > > > > On another note, the selftest should use the kselftest_harness' ASSERT_*()
> > > > > > macros instead of plain assert().
> > > > > 
> > > > > IIRC the kselftest stuff explodes if you try to use it's assert
> > > > > functions within a fixture setup/teardown context.
> > > > > 
> > > > > I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe
> > > > > it is ARM specific, I think Nicolin is running on ARM..
> > > > 
> > > > Yes. And I was running with 64KB page size. I just quickly retried
> > > > with 4KB page size (matching x86), and all failed tests pass now.
> > > 
> > > That's a weird thing to be sensitive too. Can you get a backtrace from
> > > the crash, what function/line is crashing?
> > 
> > I think I am getting what's going on. Here the harness code has a
> > parent process and a child process:
> > 
> > --------------------------------------------------------------
> > 428-            /* _metadata and potentially self are shared with all forks. */ \
> > 429:            child = fork(); \
> > 430:            if (child == 0) { \
> > 431-                    fixture_name##_setup(_metadata, self, variant->data); \
> > 432-                    /* Let setup failure terminate early. */ \
> > 433-                    if (_metadata->exit_code) \
> > 434-                            _exit(0); \
> > 435-                    *_metadata->no_teardown = false; \
> > 436-                    fixture_name##_##test_name(_metadata, self, variant->data); \
> > 437-                    _metadata->teardown_fn(false, _metadata, self, variant->data); \
> > 438-                    _exit(0); \
> > 439:            } else if (child < 0 || child != waitpid(child, &status, 0)) { \
> > 440-                    ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); \
> > 441-                    _metadata->exit_code = KSFT_FAIL; \
> > 442-            } \
> > 443-            _metadata->teardown_fn(true, _metadata, self, variant->data); \
> > 444-            munmap(_metadata->no_teardown, sizeof(*_metadata->no_teardown)); \
> > 445-            _metadata->no_teardown = NULL; \
> > 446-            if (self && fixture_name##_teardown_parent) \
> > 447-                    munmap(self, sizeof(*self)); \
> > 448-            if (WIFEXITED(status)) { \
> > 449-                    if (WEXITSTATUS(status)) \
> > 450-                            _metadata->exit_code = WEXITSTATUS(status); \
> > 451-            } else if (WIFSIGNALED(status)) { \
> > 452-                    /* Forward signal to __wait_for_test(). */ \
> > 453-                    kill(getpid(), WTERMSIG(status)); \
> > 454-            } \
> > ....
> > 456-    static void wrapper_##fixture_name##_##test_name##_teardown( \
> > 457-            bool in_parent, struct __test_metadata *_metadata, \
> > 458-            void *self, const void *variant) \
> > 459-    { \
> > 460-            if (fixture_name##_teardown_parent == in_parent && \
> > 461-                            !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
> > 462-                    fixture_name##_teardown(_metadata, self, variant); \
> > 463-    } \
> > --------------------------------------------------------------
> > 
> > I found there is a race between those two processes, resulting in
> > the teardown() not getting invoked: I added some ksft_print_msg()
> > calls in-between the lines to debug, those tests can pass mostly,
> > as teardown() got invoked.
> > 
> > I think the reason why those huge page cases fail is just because 
> > the huge version of setup() takes longer time..
> 
> Can you try to recreate this issue with changes to
> tools/testing/selftests/kselftest_harness/harness-selftest.c ?

Hmm, I assume all 9 cases should pass? Mine only passes 4 on rc1,
without any change (perhaps we should start from here?):

TAP version 13
1..9
# Starting 9 tests from 4 test cases.
#  RUN           global.standalone_pass ...
# harness-selftest.c:19:standalone_pass:before
# harness-selftest.c:23:standalone_pass:after
#            OK  global.standalone_pass
ok 1 global.standalone_pass
#  RUN           global.standalone_fail ...
# harness-selftest.c:27:standalone_fail:before
# harness-selftest.c:29:standalone_fail:Expected 0 (0) == 1 (1)
# harness-selftest.c:30:standalone_fail:Expected 0 (0) == 1 (1)
# standalone_fail: Test terminated by assertion
#          FAIL  global.standalone_fail
not ok 2 global.standalone_fail
#  RUN           global.signal_pass ...
# harness-selftest.c:35:signal_pass:before
# harness-selftest.c:37:signal_pass:after
#            OK  global.signal_pass
ok 3 global.signal_pass
#  RUN           global.signal_fail ...
# harness-selftest.c:42:signal_fail:before
# harness-selftest.c:43:signal_fail:Expected 0 (0) == 1 (1)
# signal_fail: Test terminated by assertion
#          FAIL  global.signal_fail
not ok 4 global.signal_fail
#  RUN           fixture.pass ...
# harness-selftest.c:53:pass:setup
# harness-selftest.c:62:pass:before
# harness-selftest.c:19:pass:before
# harness-selftest.c:23:pass:after
# harness-selftest.c:66:pass:after
# harness-selftest.c:58:pass:teardown same-process=1
#            OK  fixture.pass
ok 5 fixture.pass
#  RUN           fixture.fail ...
# harness-selftest.c:53:fail:setup
# harness-selftest.c:70:fail:before
# harness-selftest.c:71:fail:Expected 0 (0) == 1 (1)
# harness-selftest.c:58:fail:teardown same-process=1
# fail: Test terminated by assertion
#          FAIL  fixture.fail
not ok 6 fixture.fail
#  RUN           fixture.timeout ...
# harness-selftest.c:53:timeout:setup
# harness-selftest.c:77:timeout:before
# timeout: Test terminated by timeout
#          FAIL  fixture.timeout
not ok 7 fixture.timeout
#  RUN           fixture_parent.pass ...
# harness-selftest.c:87:pass:setup
# harness-selftest.c:96:pass:before
# harness-selftest.c:98:pass:after
# harness-selftest.c:92:pass:teardown same-process=0
#            OK  fixture_parent.pass
ok 8 fixture_parent.pass
#  RUN           fixture_setup_failure.pass ...
# harness-selftest.c:106:pass:setup
# harness-selftest.c:108:pass:Expected 0 (0) == 1 (1)
# pass: Test terminated by assertion
#          FAIL  fixture_setup_failure.pass
not ok 9 fixture_setup_failure.pass
# FAILED: 4 / 9 tests passed.
# Totals: pass:4 fail:5 xfail:0 xpass:0 skip:0 error:0

> > I haven't figured out a proper fix yet, but something smells bad:
> > 1) *no_teardown is set non-atomically, while both processes calls
> >    __atomic_test_and_set()
> 
> Does this make a difference?
> 
> diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
> index 2925e47db995..89fb37a21d9d 100644
> --- a/tools/testing/selftests/kselftest_harness.h
> +++ b/tools/testing/selftests/kselftest_harness.h
> @@ -410,7 +410,7 @@
>                 /* Makes sure there is only one teardown, even when child forks again. */ \
>                 _metadata->no_teardown = mmap(NULL, sizeof(*_metadata->no_teardown), \
>                         PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); \
> -               *_metadata->no_teardown = true; \
> +               __atomic_store_n(_metadata->no_teardown, true, __ATOMIC_SEQ_CST); \
>                 if (sizeof(*self) > 0) { \
>                         if (fixture_name##_teardown_parent) { \
>                                 self = mmap(NULL, sizeof(*self), PROT_READ | PROT_WRITE, \
> @@ -429,7 +429,7 @@
>                         /* Let setup failure terminate early. */ \
>                         if (_metadata->exit_code) \
>                                 _exit(0); \
> -                       *_metadata->no_teardown = false; \
> +                       __atomic_store_n(_metadata->no_teardown, false, __ATOMIC_SEQ_CST); \
>                         fixture_name##_##test_name(_metadata, self, variant->data); \
>                         _metadata->teardown_fn(false, _metadata, self, variant->data); \
>                         _exit(0); \
> @@ -455,7 +455,7 @@
>                 void *self, const void *variant) \
>         { \
>                 if (fixture_name##_teardown_parent == in_parent && \
> -                               !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
> +                               !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_SEQ_CST)) \
>                         fixture_name##_teardown(_metadata, self, variant); \
>         } \
>         static struct __test_metadata *_##fixture_name##_##test_name##_object; \

Unfortunately, no :(

> 
> > 2) parent doesn't seem to wait for the setup() to complete..
> 
> setup() is called in the child (L431) right before the testcase itself is
> called (L436). The parent waits for the child to exit (L439) before unmapping.
> 
> > 3) when parent runs faster than the child that is still running
> >    setup(), the parent unmaps the no_teardown and set it to NULL,
> >    then UAF in the child, i.e. signal 11?
> 
> That should never happen as the waitpid() will block until the child running
> setup() and the testcase itself have exited.

Ah, maybe I was wrong about these narratives. But the results show
that iommufd_dirty_tracking_teardown() was not called in the failed
cases:

// I added a huge.c file to run only 4 cases on one variant
// And I added two fprintf to its FIXTURE_SETUP/TEARDOWN().

TAP version 13
1..4
# Starting 4 tests from 1 test cases.
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ...
---------iommufd_dirty_tracking_setup
---------iommufd_dirty_tracking_teardown
#            OK  iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking
ok 1 iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability ...
---------iommufd_dirty_tracking_setup
# device_dirty_capability: Test terminated unexpectedly by signal 11
#          FAIL  iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability
not ok 2 iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap ...
---------iommufd_dirty_tracking_setup
# get_dirty_bitmap: Test terminated unexpectedly by signal 11
#          FAIL  iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap
not ok 3 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear ...
---------iommufd_dirty_tracking_setup
# get_dirty_bitmap_no_clear: Test terminated unexpectedly by signal 11
#          FAIL  iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear
not ok 4 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear
# FAILED: 1 / 4 tests passed.

> 
> Does the issue also happen when you only execute a single testcase?
> $ ./iommufd -r iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty

$ sudo tools/testing/selftests/iommu/iommufd -r iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ...
#            OK  iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
ok 1 iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

This one passes. Looks like the first hugepage case would pass but
the following ones would fail if running them sequentially..

Thanks
Nicolin