linux-kernel - Re: [RFC-PATCH 1/2] mm: Add __GFP_NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874kp87mca.fsf@nanos.tec.linutronix.de>
Date:   Wed, 12 Aug 2020 02:13:25 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     paulmck@...nel.org
Cc:     Michal Hocko <mhocko@...e.com>,
        Uladzislau Rezki <urezki@...il.com>,
        LKML <linux-kernel@...r.kernel.org>, RCU <rcu@...r.kernel.org>,
        linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        "Theodore Y . Ts'o" <tytso@....edu>,
        Joel Fernandes <joel@...lfernandes.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Oleksiy Avramchenko <oleksiy.avramchenko@...ymobile.com>
Subject: Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag

"Paul E. McKenney" <paulmck@...nel.org> writes:
> Hence Ulad's work on kfree_rcu().  The approach is to allocate a
> page-sized array to hold all the pointers, then fill in the rest of these
> pointers on each subsequent kfree_rcu() call.  These arrays of pointers
> also allows use of kfree_bulk() instead of kfree(), which can improve
> performance yet more.  It is no big deal if kfree_rcu()'s allocation
> attempts fail occasionally because it can simply fall back to the old
> linked-list approach.  And given that the various lockless caches in
> the memory allocator are almost never empty, in theory life is good.

Of course, it's always the damned reality which ruins the fun.

> But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING,
> and for good reason -- this Kconfig option makes it at least a
> little bit harder for mainline developers to mess up RT.  But with
> CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now
> sometimes complain if you invoke kfree_rcu() while holding a raw spinlock.
> This happens when kfree_rcu() needs to invoke the memory allocator and
> the memory allocator's caches are empty, thus resulting in the memory
> allocator attempting to acquire a non-raw spinlock.

Right.

> Because kfree_rcu() has a fallback available (just using the old linked
> list), kfree_rcu() would work well given a way to tell the memory
> allocator to return NULL instead of acquiring a non-raw spinlock.
> Which is exactly what Ulad's recent patches are intended to do.

That much I understood, but I somehow failed to figure the why out
despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my
brain :)

> Since then, this thread has discussed various other approaches,
> including using existing combinations of GFP_ flags, converting
> the allocator's zone lock to a raw spinlock, and so on.
>
> Does that help, or am I missing the point of your question?

Yes, that helps so far that I understand what the actual problem is. It
does not really help to make me more happy. :)

That said, we can support atomic allocations on RT up to the point where
zone->lock comes into play. We don't know yet exactly where the
zone->lock induced damage happens. Presumably it's inside
free_pcppages_bulk() - at least that's where I have faint bad memories
from 15+ years ago. Aside of that I seriously doubt that it can be made
work within a reasonable time frame.

But what makes me really unhappy is that my defense line against
allocations from truly atomic contexts (from RT POV) which was enforced
on RT gets a real big gap shot into it.

It becomes pretty hard to argue why atomic allocations via kmalloc() or
kmem_cache_alloc() should be treated any different. Technically they can
work similar to the page allocations up to the point where regular
spinlocks come into play or the slab cache is exhausted. Where to draw
the line?

It's also unclear for the page allocator case whether we can and should
stick a limit on the number of pages and/or the pageorder.

Myself and others spent a considerable amount of time to kill off these
kind of allocations from various interesting places including the guts
of send IPI, the affinity setting path and others where people just
slapped allocations into them because the stack checker warned or
because they happened to copy the code from some other place.

RT was pretty much a quick crap detector whenever new incarnations of
this got added and to some extent continuous education about these
issues made them less prominent over the years. Using atomic allocations
should always have a real good rationale, not only in the cases where
they collide with RT.

I can understand your rationale and what you are trying to solve. So, if
we can actually have a distinct GFP variant:

  GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY

which is easy to grep for then having the page allocator go down to the
point where zone lock gets involved is not the end of the world for
RT in theory - unless that damned reality tells otherwise. :)

The page allocator allocations should also have a limit on the number of
pages and eventually also page order (need to stare at the code or let
Michal educate me that the order does not matter).

To make it consistent the same GFP_ variant should allow the slab
allocator go to the point where the slab cache is exhausted.

Having a distinct and clearly defined GFP_ variant is really key to
chase down offenders and to make reviewers double check upfront why this
is absolutely required.

Thanks,

        tglx