linux-kernel - Re: [patch 00/13] preempt: Make preempt count unconditional

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKMK7uHL2dMv80b8uBXr=BqHD2TQeODQQM1MGYhAfCYbX7sLrA@mail.gmail.com>
Date:   Wed, 16 Sep 2020 23:43:02 +0200
From:   Daniel Vetter <daniel@...ll.ch>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ard Biesheuvel <ardb@...nel.org>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-arch <linux-arch@...r.kernel.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Valentin Schneider <valentin.schneider@....com>,
        Richard Henderson <rth@...ddle.net>,
        Ivan Kokshaysky <ink@...assic.park.msu.ru>,
        Matt Turner <mattst88@...il.com>,
        alpha <linux-alpha@...r.kernel.org>,
        Jeff Dike <jdike@...toit.com>,
        Richard Weinberger <richard@....at>,
        Anton Ivanov <anton.ivanov@...bridgegreys.com>,
        linux-um <linux-um@...ts.infradead.org>,
        Brian Cain <bcain@...eaurora.org>,
        linux-hexagon@...r.kernel.org,
        Geert Uytterhoeven <geert@...ux-m68k.org>,
        linux-m68k <linux-m68k@...ts.linux-m68k.org>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Will Deacon <will@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>, Ingo Molnar <mingo@...hat.com>,
        Russell King <linux@...linux.org.uk>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        Chris Zankel <chris@...kel.net>,
        Max Filippov <jcmvbkbc@...il.com>,
        linux-xtensa@...ux-xtensa.org,
        Jani Nikula <jani.nikula@...ux.intel.com>,
        Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
        Rodrigo Vivi <rodrigo.vivi@...el.com>,
        David Airlie <airlied@...ux.ie>,
        intel-gfx <intel-gfx@...ts.freedesktop.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        Josh Triplett <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Shuah Khan <shuah@...nel.org>, rcu@...r.kernel.org,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
> > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> > >
> > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > > > <torvalds@...ux-foundation.org> wrote:
> > > > >
> > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@...utronix.de> wrote:
> > > > > >
> > > > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > > > allocation modes or other things.
> > > > >
> > > > > No. I think that those kinds of decisions about actual behavior are
> > > > > always simply fundamentally wrong.
> > > > >
> > > > > Note that this is very different from having warnings about invalid
> > > > > use. THAT is correct. It may not warn in all configurations, but that
> > > > > doesn't matter: what matters is that it warns in common enough
> > > > > configurations that developers will catch it.
> > > > >
> > > > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > > > because you have a limited configuration that can't even detect the
> > > > > situation, that's fine and dandy and intentional.
> > > > >
> > > > > But having code like
> > > > >
> > > > >        if (can_schedule())
> > > > >            .. do something different ..
> > > > >
> > > > > is fundamentally complete and utter garbage.
> > > > >
> > > > > It's one thing if you test for "am I in hardware interrupt context".
> > > > > Those tests aren't great either, but at least they make sense.
> > > > >
> > > > > But a driver - or some library routine - making a difference based on
> > > > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > > > >
> > > > > If some code changes behavior, it needs to be explicit to the *caller*
> > > > > of that code.
> > > > >
> > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > > > do_something_atomic()" is pure shite.
> > > > >
> > > > > And I am not IN THE LEAST interested in trying to help people doing
> > > > > pure shite. We need to fix them. Like the crypto code is getting
> > > > > fixed.
> > > >
> > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > > > Code that tries to cleverly adjust its behaviour depending upon the
> > > > context it's running in is harder to understand and blows up in more
> > > > interesting ways. We still have drm_can_sleep() and it's mostly just
> > > > used for debug code, and I've largely ended up just deleting
> > > > everything that used it because when you're driver is blowing up the
> > > > last thing you want is to realize your debug code and output can't be
> > > > relied upon. Or worse, that the only Oops you have is the one in the
> > > > debug code, because the real one scrolled away - the original idea
> > > > behind drm_can_sleep was to make all the modeset code work
> > > > automagically both in normal ioctl/kworker context and in the panic
> > > > handlers or kgdb callbacks. Wishful thinking at best.
> > > >
> > > > Also at least for me that extends to everything, e.g. I much prefer
> > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > > > locks shared with interrupt handlers, since the former two gives me
> > > > clear information from which contexts such function can be called.
> > > > Other end is the memalloc_no*_save/restore functions, where I recently
> > > > made a real big fool of myself because I didn't realize how much that
> > > > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > > > stuff never fails" is wrong everywhere.
> > > >
> > > > It's all great for debugging and sanity checks (and we run with all
> > > > that stuff enabled in our CI), but really semantic changes depending
> > > > upon magic context checks freak my out :-)
> > >
> > > All fair, but some of us need to write code that must handle being
> > > invoked from a wide variety of contexts.  Now perhaps you like the idea of
> > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> > > are held, and so on.  However, from what I can see, most people instead
> > > consistently prefer that the RCU API instead be consolidated.
> > >
> > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> > > needs to be able to allocate memory occasionally.  It can do that when
> > > invoked from some contexts, but not when invoked from others.  Right now,
> > > in !PREEMPT kernels, it cannot tell, and must either do things to the
> > > memory allocators that some of the MM hate or must unnecessarily invoke
> > > workqueues.  Thomas's patches would allow the code to just allocate in
> > > the common case when these primitives are invoked from contexts where
> > > allocation is permitted.
> > >
> > > If we want to restrict access to the can_schedule() or whatever primitive,
> > > fine and good.  We can add a check to checkpatch.pl, for example.  Maybe
> > > we can go back to the old brlock approach of requiring certain people's
> > > review for each addition to the kernel.
> > >
> > > But there really are use cases that it would greatly help.
> >
> > We can deadlock in random fun places if random stuff we're calling
> > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
> > make it extra fun to reproduce. Maybe most driver subsystems are less
> > brittle, but gpu drivers definitely need to know about the details for
> > exactly this example. And yes gpu drivers use rcu for freeing
> > dma_fence structures, and that tends to happen in code that we only
> > recently figured out should really not allocate memory.
> >
> > I think minimally you need to throw in an unconditional
> > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
> > with full debugging knows what might happen. It's kinda like
> > might_sleep, but a lot more specific. might_sleep() alone is not
> > enough, because in the specific code paths I'm thinking of (and
> > created special lockdep annotations for just recently) sleeping is
> > allowed, but any memory allocations with GFP_RECLAIM set are no-go.
>
> Completely agreed!  Any allocation on any free path must be handled
> -extremely- carefully.  To that end...
>
> First, there is always a fallback in case the allocation fails.  Which
> might have performance or corner-case robustness issues, but which will
> at least allow forward progress.  Second, we consulted with a number of
> MM experts to arrive at appropriate GFP_* flags (and their patience is
> greatly appreciated).  Third, the paths that can allocate will do so about
> one time of 500, so any issues should be spotted sooner rather than later.
>
> So you are quite right to be concerned, but I believe we will be doing the
> right things.  And based on his previous track record, I am also quite
> certain that Mr. Murphy will be on hand to provide me any additional
> education that I might require.
>
> Finally, I have noted down your point about fs_reclaim_acquire() and
> fs_reclaim_release().  Whether or not they prove to be needed, I do
> appreciate your calling them to my attention.

I just realized that since these dma_fence structs are refcounted and
userspace can hold references (directly, it can pass them around
behind file descriptors) we might never hit such a path until slightly
unusual or evil userspace does something interesting. Do you have
links to those patches? Some googling didn't turn up anything. I can
then figure out whether it's better to risk not spotting issues with
call_rcu vs slapping a memalloc_noio_save/restore around all these
critical section which force-degrades any allocation to GFP_ATOMIC at
most, but has the risk that we run into code that assumes "GFP_KERNEL
never fails for small stuff" and has a decidedly less tested fallback
path than rcu code.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch