linux-kernel - Re: [RFC-PATCH 1/2] mm: Add __GFP_NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200814180141.GP4295@paulmck-ThinkPad-P72>
Date:   Fri, 14 Aug 2020 11:01:41 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Uladzislau Rezki <urezki@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>, RCU <rcu@...r.kernel.org>,
        linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        "Theodore Y . Ts'o" <tytso@....edu>,
        Joel Fernandes <joel@...lfernandes.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        Oleksiy Avramchenko <oleksiy.avramchenko@...ymobile.com>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag

On Fri, Aug 14, 2020 at 04:06:04PM +0200, Michal Hocko wrote:
> On Fri 14-08-20 06:34:50, Paul E. McKenney wrote:
> > On Fri, Aug 14, 2020 at 02:48:32PM +0200, Michal Hocko wrote:
> > > On Fri 14-08-20 14:15:44, Uladzislau Rezki wrote:
> > > > > On Thu 13-08-20 19:09:29, Thomas Gleixner wrote:
> > > > > > Michal Hocko <mhocko@...e.com> writes:
> > > > > [...]
> > > > > > > Why should we limit the functionality of the allocator for something
> > > > > > > that is not a real problem?
> > > > > > 
> > > > > > We'd limit the allocator for exactly ONE new user which was aware of
> > > > > > this problem _before_ the code hit mainline. And that ONE user is
> > > > > > prepared to handle the fail.
> > > > > 
> > > > > If we are to limit the functionality to this one particular user then
> > > > > I would consider a dedicated gfp flag a huge overkill. It would be much
> > > > > more easier to have a preallocated pool of pages and use those and
> > > > > completely avoid the core allocator. That would certainly only shift the
> > > > > complexity to the caller but if it is expected there would be only that
> > > > > single user then it would be probably better than opening a can of worms
> > > > > like allocator usable from raw spin locks.
> > > > > 
> > > > Vlastimil raised same question earlier, i answered, but let me answer again:
> > > > 
> > > > It is hard to achieve because the logic does not stick to certain static test
> > > > case, i.e. it depends on how heavily kfree_rcu(single/double) are used. Based
> > > > on that, "how heavily" - number of pages are formed, until the drain/reclaimer
> > > > thread frees them.
> > > 
> > > How many pages are talking about - ball park? 100s, 1000s?
> > 
> > Under normal operation, a couple of pages per CPU, which would make
> > preallocation entirely reasonable.  Except that if someone does something
> > that floods RCU callbacks (close(open) in a tight userspace loop, for but
> > one example), then 2000 per CPU might not be enough, which on a 64-CPU
> > system comes to about 500MB.  This is beyond excessive for preallocation
> > on the systems I am familiar with.
> > 
> > And the flooding case is where you most want the reclamation to be
> > efficient, and thus where you want the pages.
> 
> I am not sure the page allocator would help you with this scenario
> unless you are on very large machines. Pagesets scale with the available
> memory and percpu_pagelist_fraction sysctl (have a look at
> pageset_set_high_and_batch). It is roughly 1000th of the zone size for
> each zone. You can check that in /proc/vmstat (my 8G machine)

Small systems might have ~64G.  The medium-sized systems might have
~250G.  There are a few big ones that might have 1.5T.  None of the
/proc/vmstat files from those machines contain anything resembling
the list below, though.

> Node 0, zone      DMA
> Not interesting at all
> Node 0, zone    DMA32
>   pagesets
>     cpu: 0
>               count: 242
>               high:  378
>               batch: 63
>     cpu: 1
>               count: 355
>               high:  378
>               batch: 63
>     cpu: 2
>               count: 359
>               high:  378
>               batch: 63
>     cpu: 3
>               count: 366
>               high:  378
>               batch: 63
> Node 0, zone   Normal
>   pagesets
>     cpu: 0
>               count: 359
>               high:  378
>               batch: 63
>     cpu: 1
>               count: 241
>               high:  378
>               batch: 63
>     cpu: 2
>               count: 297
>               high:  378
>               batch: 63
>     cpu: 3
>               count: 227
>               high:  378
>               batch: 63
> 
> Besides that do you need to be per-cpu? Having 1000 pages available and
> managed under your raw spinlock should be good enough already no?

It needs to be almost entirely per-CPU for performance reasons.  Plus
a user could do a tight close(open()) loop on each CPU.

> > This of course raises the question of how much memory the lockless caches
> > contain, but fortunately these RCU callback flooding scenarios also
> > involve process-context allocation of the memory that is being passed
> > to kfree_rcu().  That allocation should keep the lockless caches from
> > going empty in the common case, correct?
> 
> Yes, those are refilled both on the allocation/free paths. But you
> cannot really rely on that to happen early enough.

So the really ugly scenarios with the tight loops normally allocate
something and immediately either call_rcu() or kfree_rcu() it.
But you are right, someone doing "rm -rf" on a large file tree
with lots of small files might not be doing that many allocations.

> Do you happen to have any numbers that would show the typical usage
> and how often the slow path has to be taken becase pcp lists are
> depleted? In other words even if we provide a functionality to give
> completely lockless way to allocate memory how useful that is?

Not yet, but let's see what we can do.

							Thanx, Paul