linux-kernel - Re: Kernel Concurrency Sanitizer (KCSAN)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 20 Sep 2019 19:51:04 +0200
From:   Marco Elver <elver@...gle.com>
To:     Dmitry Vyukov <dvyukov@...gle.com>
Cc:     Mark Rutland <mark.rutland@....com>,
        kasan-dev <kasan-dev@...glegroups.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Andrey Konovalov <andreyknvl@...gle.com>,
        Alexander Potapenko <glider@...gle.com>,
        "Paul E. McKenney" <paulmck@...ux.ibm.com>,
        Paul Turner <pjt@...gle.com>, Daniel Axtens <dja@...ens.net>,
        Anatol Pomazau <anatol@...gle.com>,
        Will Deacon <willdeacon@...gle.com>,
        Andrea Parri <parri.andrea@...il.com>,
        Alan Stern <stern@...land.harvard.edu>,
        LKMM Maintainers -- Akira Yokosawa <akiyks@...il.com>,
        Nicholas Piggin <npiggin@...il.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Daniel Lustig <dlustig@...dia.com>,
        Jade Alglave <j.alglave@....ac.uk>,
        Luc Maranget <luc.maranget@...ia.fr>
Subject: Re: Kernel Concurrency Sanitizer (KCSAN)

On Fri, 20 Sep 2019 at 18:47, Dmitry Vyukov <dvyukov@...gle.com> wrote:
>
> On Fri, Sep 20, 2019 at 6:31 PM Mark Rutland <mark.rutland@....com> wrote:
> >
> > On Fri, Sep 20, 2019 at 04:18:57PM +0200, Marco Elver wrote:
> > > We would like to share a new data-race detector for the Linux kernel:
> > > Kernel Concurrency Sanitizer (KCSAN) --
> > > https://github.com/google/ktsan/wiki/KCSAN  (Details:
> > > https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst)
> >
> > Nice!
> >
> > BTW kcsan_atomic_next() is missing a stub definition in <linux/kcsan.h>
> > when !CONFIG_KCSAN:
> >
> > https://github.com/google/ktsan/commit/a22a093a0f0d0b582c82cdbac4f133a3f61d207c#diff-19d7c475b4b92aab8ba440415ab786ec
> >
> > ... and I think the kcsan_{begin,end}_atomic() stubs need to be static
> > inline too.

Thanks for catching, fixed and pushed. Feel free to rebase your arm64 branch.

> > It looks like this is easy enough to enable on arm64, with the only real
> > special case being secondary_start_kernel() which we might want to
> > refactor to allow some portions to be instrumented.
> >
> > I pushed the trivial patches I needed to get arm64 booting to my arm64/kcsan
> > branch:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/kcsan

Cool, thanks for testing!

> > We have some interesting splats at boot time in stop_machine, which
> > don't seem to have been hit/fixed on x86 yet in the kcsan-with-fixes
> > branch, e.g.
> >
> > [    0.237939] ==================================================================
> > [    0.239431] BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
> > [    0.241189]
> > [    0.241606] write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
> > [    0.243435]  set_state+0x80/0xb0
> > [    0.244328]  multi_cpu_stop+0x16c/0x198
> > [    0.245406]  cpu_stopper_thread+0x170/0x298
> > [    0.246565]  smpboot_thread_fn+0x40c/0x560
> > [    0.247696]  kthread+0x1a8/0x1b0
> > [    0.248586]  ret_from_fork+0x10/0x18
> > [    0.249589]
> > [    0.250006] read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
> > [    0.251804]  multi_cpu_stop+0xa8/0x198
> > [    0.252851]  cpu_stopper_thread+0x170/0x298
> > [    0.254008]  smpboot_thread_fn+0x40c/0x560
> > [    0.255135]  kthread+0x1a8/0x1b0
> > [    0.256027]  ret_from_fork+0x10/0x18
> > [    0.257036]
> > [    0.257449] Reported by Kernel Concurrency Sanitizer on:
> > [    0.258918] CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
> > [    0.261241] Hardware name: linux,dummy-virt (DT)
> > [    0.262517] ==================================================================>

Thanks, the fixes in -with-fixes were ones I only encountered with
Syzkaller, where I disable KCSAN during boot. I've just added a fix
for this race and pushed to kcsan-with-fixes.

> > > To those of you who we mentioned at LPC that we're working on a
> > > watchpoint-based KTSAN inspired by DataCollider [1], this is it (we
> > > renamed it to KCSAN to avoid confusion with KTSAN).
> > > [1] http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf
> > >
> > > In the coming weeks we're planning to:
> > > * Set up a syzkaller instance.
> > > * Share the dashboard so that you can see the races that are found.
> > > * Attempt to send fixes for some races upstream (if you find that the
> > > kcsan-with-fixes branch contains an important fix, please feel free to
> > > point it out and we'll prioritize that).
> > >
> > > There are a few open questions:
> > > * The big one: most of the reported races are due to unmarked
> > > accesses; prioritization or pruning of races to focus initial efforts
> > > to fix races might be required. Comments on how best to proceed are
> > > welcome. We're aware that these are issues that have recently received
> > > attention in the context of the LKMM
> > > (https://lwn.net/Articles/793253/).
> >
> > I think the big risk here is drive-by "fixes" masking the warnings
> > rather than fixing the actual issue. It's easy for people to suppress a
> > warning with {READ,WRITE}_ONCE(), so they're liable to do that even the
> > resulting race isn't benign.
> >
> > I don't have a clue how to prevent that, though.
>
> I think this is mostly orthogonal problem. E.g. for some syzbot bugs I
> see fixes that also try to simply "shut up" the immediate
> manifestation with whatever means, e.g. sprinkling some slinlocks. So
> (1) it's not unique to atomics, (2) presence of READ/WRITE_ONCE will
> make the reader aware of the fact that this runs concurrently with
> something else, and then they may ask themselves why this runs
> concurrently with something when the object is supposed to be private
> to the thread, and then maybe they re-fix it properly. Whereas if it's
> completely unmarked, nobody will even notice that this code accesses
> the object concurrently with other code. So even if READ/WRITE_ONCE
> was a wrong fix, it's still better to have it rather than not.