lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9bfeae26-b4b1-eedb-6cbd-b4f9f1e1cc55@redhat.com>
Date:   Tue, 16 Aug 2022 20:02:14 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        kernel test robot <oliver.sang@...el.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Oliver Upton <oliver.upton@...ux.dev>,
        Andrew Jones <andrew.jones@...ux.dev>,
        LKML <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
        lkp@...ts.01.org, lkp@...el.com, xudong.hao@...el.com,
        regressions@...ts.linux.dev
Subject: Re: [KVM] e923b0537d: kernel-selftests.kvm.rseq_test.fail

Hi Sean,

On 8/16/22 3:02 PM, Gavin Shan wrote:
> On 8/16/22 7:42 AM, Sean Christopherson wrote:
>> On Mon, Aug 15, 2022, kernel test robot wrote:
>>> commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: selftests: Fix target thread to be migrated in rseq_test")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> ...
>>
>>> # selftests: kvm: rseq_test
>>> # ==== Test Assertion Failure ====
>>> #   rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2)
>>> #   pid=49599 tid=49599 errno=4 - Interrupted system call
>>> #      1    0x000000000040265d: main at rseq_test.c:278
>>> #      2    0x00007fe44eed07fc: ?? ??:0
>>> #      3    0x00000000004026d9: _start at ??:?
>>> #   Only performed 23174 KVM_RUNs, task stalled too much?
>>> #
>>> not ok 56 selftests: kvm: rseq_test # exit=254
>>
>> ...
>>
>>> # Automatically generated file; DO NOT EDIT.
>>> # Linux/x86_64 5.19.0-rc6 Kernel Configuration
>>> #
>>> CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0"
>>> CONFIG_CC_IS_GCC=y
>>> CONFIG_GCC_VERSION=110300
>>> CONFIG_CLANG_VERSION=0
>>> CONFIG_AS_IS_GNU=y
>>> CONFIG_AS_VERSION=23800
>>> CONFIG_LD_IS_BFD=y
>>> CONFIG_LD_VERSION=23800
>>> CONFIG_LLD_VERSION=0
>>
>> Assuming 23800 == 2.38, this is a known issue.
>>
>> https://lore.kernel.org/all/20220810104114.6838-1-gshan@redhat.com
>>
> 
> It's probably different story this time. The assert is triggered because
> of the following instructions. I would guess the reason is vcpu thread
> has been running on CPU where we has high CPU load. In this case, the
> vcpu thread can't be run in time. More specific, the vcpu thread can't
> be run in the 1 - 10us time window, which is specified by the migration
> worker (thread).
> 
>      TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2),
>                  "Only performed %d KVM_RUNs, task stalled too much?\n", i);
> 
> I think we need to improve the handshake mechanism between the vcpu thread
> and migration worker. In current implementation, the handshake is done through
> the atomic counter. The mechanism is simple enough, but vcpu thread can miss
> the aforementioned time window. Another issue is the test case much more time
> than expected to finish.
> 
> Sean, if you think it's reasonable, I can figure out something to do:
> 
> - Reuse the atomic counter for a full synchronization between these two
>    threads. Something like below:
> 
>    #define RSEQ_TEST_STATE_RUN_VCPU       0     // vcpu_run()
>    #define RSEQ_TEST_STATE_MIGRATE        1     // sched_setaffinity()
>    #define RSEQ_TEST_STATE_CHECK          2     // Check rseq.cpu_id and get_cpu()
> 
>    The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after RSEQ_TEST_STATE_RUN_VCPU
> 
> - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). With this,
>    less time is needed to finish the test case.
> 

I'm able to recreate the issue on my local arm64 system.

- From the source code, the iteration count is changed from 100000 to 1000
- Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other CPUs
   are cleared from @possible_mask
- Run some CPU bound task on CPU#0 and CPU#1
   # while true; do taskset -c 0 ./a; done
   # while true; do taskset -c 1 ./a; done
- Run 'rseq_test' and hit the issue
   # ./rseq_test
   calc_min_max_cpu: nproc=224
   calc_min_max_cpu: min_cpu=0, max_cpu=1
   main: Required tests: 1000   Succeeds: 1
   ==== Test Assertion Failure ====
     rseq_test.c:280: i > (NR_TASK_MIGRATIONS / 2)
     pid=9624 tid=9624 errno=4 - Interrupted system call
        1	0x0000000000401cf3: main at rseq_test.c:280
        2	0x0000ffffbc64679b: ?? ??:0
        3	0x0000ffffbc64686b: ?? ??:0
        4	0x0000000000401def: _start at ??:?
     Only performed 1 KVM_RUNs, task stalled too much?

Thanks,
Gavin


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ