lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 18 Sep 2021 11:37:39 +0200
From:   Marco Elver <elver@...gle.com>
To:     Liu Shixin <liushixin2@...wei.com>
Cc:     Kefeng Wang <wangkefeng.wang@...wei.com>,
        akpm@...ux-foundation.org, glider@...gle.com, dvyukov@...gle.com,
        jannh@...gle.com, mark.rutland@....com,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        kasan-dev@...glegroups.com, hdanton@...a.com
Subject: Re: [PATCH v2 2/3] kfence: maximize allocation wait timeout duration

On Sat, 18 Sept 2021 at 10:07, Liu Shixin <liushixin2@...wei.com> wrote:
>
> On 2021/9/16 16:49, Marco Elver wrote:
> > On Thu, 16 Sept 2021 at 03:20, Kefeng Wang <wangkefeng.wang@...wei.com> wrote:
> >> Hi Marco,
> >>
> >> We found kfence_test will fails  on ARM64 with this patch with/without
> >> CONFIG_DETECT_HUNG_TASK,
> >>
> >> Any thought ?
> > Please share log and instructions to reproduce if possible. Also, if
> > possible, please share bisection log that led you to this patch.
> >
> > I currently do not see how this patch would cause that, it only
> > increases the timeout duration.
> >
> > I know that under QEMU TCG mode, there are occasionally timeouts in
> > the test simply due to QEMU being extremely slow or other weirdness.
> >
> >
> Hi Marco,
>
> There are some of the results of the current test:
> 1. Using qemu-kvm on arm64 machine, all testcase can pass.
> 2. Using qemu-system-aarch64 on x86_64 machine, randomly some testcases fail.
> 3. Using qemu-system-aarch64 on x86_64, but removing the judgment of kfence_allocation_key in kfence_alloc(), all testcase can pass.
>
> I add some printing to the kernel and get very strange results.
> I add a new variable kfence_allocation_key_gate to track the
> state of kfence_allocation_key. As shown in the following code, theoretically,
> if kfence_allocation_key_gate is zero, then kfence_allocation_key must be
> enabled, so the value of variable error in kfence_alloc() should always be
> zero. In fact, all the passed testcases fit this point. But as shown in the
> following failed log, although kfence_allocation_key has been enabled, it's
> still check failed here.
>
> So I think static_key might be problematic in my qemu environment.
> The change of timeout is not a problem but caused us to observe this problem.
> I tried changing the wait_event to a loop. I set timeout to HZ and re-enable/disabled
> in each loop, then the failed testcase disappears.

Nice analysis, thanks! What I gather is that static_keys/jump_labels
are somehow broken in QEMU.

This does remind me that I found a bug in QEMU that might be relevant:
https://bugs.launchpad.net/qemu/+bug/1920934
Looks like it was never fixed. :-/

The failures I encountered caused the kernel to crash, but never saw
the kfence test to fail due to that (never managed to get that far).
Though the bug I saw was on x86 TCG mode, and I never tried arm64. If
you can, try to build a QEMU with ASan and see if you also get the
same use-after-free bug.

Unless we observe the problem on a real machine, I think for now we
can conclude with fairly high confidence that QEMU TCG still has
issues and cannot be fully trusted here (see bug above).

Thanks,
-- Marco

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ