linux-kernel - Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160727182411.GE21859@dhcp22.suse.cz>
Date:	Wed, 27 Jul 2016 20:24:11 +0200
From:	Michal Hocko <mhocko@...nel.org>
To:	NeilBrown <neilb@...e.com>
Cc:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
	LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
	dm-devel@...hat.com, Mikulas Patocka <mpatocka@...hat.com>,
	Mel Gorman <mgorman@...e.de>,
	David Rientjes <rientjes@...gle.com>,
	Ondrej Kozina <okozina@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle
 PF_LESS_THROTTLE tasks

On Wed 27-07-16 13:43:35, NeilBrown wrote:
> On Mon, Jul 25 2016, Michal Hocko wrote:
> 
> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> >> > My thinking was that throttle_vm_writeout is there to prevent from
> >> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> >> > is part of the writeout so throttling it on too many dirty pages is
> >> > questionable (well we get some bias but that is not really reliable). It
> >> > still makes sense to throttle when the backing device is congested
> >> > because the writeout path wouldn't make much progress anyway and we also
> >> > do not want to cycle through LRU lists too quickly in that case.
> >> 
> >> "dirtying ... from the reclaim context" ??? What does that mean?
> >
> > Say you would cause a swapout from the reclaim context. You would
> > effectively dirty that anon page until it gets written down to the
> > storage.
> 
> I should probably figure out how swap really works.  I have vague ideas
> which are probably missing important details...
> Isn't the first step that the page gets moved into the swap-cache - and
> marked dirty I guess.  Then it gets written out and the page is marked
> 'clean'.
> Then further memory pressure might push it out of the cache, or an early
> re-use would pull it back from the cache.
> If so, then "dirtying in reclaim context" could also be described as
> "moving into the swap cache" - yes?

Yes that is basically correct

> So should there be a limit on dirty
> pages in the swap cache just like there is for dirty pages in any
> filesystem (the max_dirty_ratio thing) ??
> Maybe there is?

There is no limit AFAIK. We are relying that the reclaim is throttled
when necessary.
 
> >> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
> >> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
> >> memory when it is writing to the page cache.  So it only needs to be
> >> throttled when the backing device it is writing to is congested.  It is
> >> particularly important that it *doesn't* get throttled just because an
> >> NFS backing device is congested, because nfsd might be trying to clear
> >> that congestion.
> >
> > Thanks for the clarification. IIUC then removing throttle_vm_writeout
> > for the nfsd writeout should be harmless as well, right?
> 
> Certainly shouldn't hurt from the perspective of nfsd.
> 
> >> >> The purpose of that flag is to allow a thread to dirty a page-cache page
> >> >> as part of cleaning another page-cache page.
> >> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
> >> >> for dm-crypt if it was putting the encrypted version in the page cache.
> >> >> But if dm-crypt is just allocating a transient page (which I think it
> >> >> is), then a mempool should be sufficient (and we should make sure it is
> >> >> sufficient) and access to an extra 10% (or whatever) of the page cache
> >> >> isn't justified.
> >> >
> >> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> >> > appropriate then would a PF_MEMPOOL be any better?
> >> 
> >> Why a PF rather than a GFP flag?
> >
> > Well, short answer is that gfp masks are almost depleted.
> 
> Really?  We have 26.
> 
> pagemap has a cute hack to store both GFP flags and other flag bits in
> the one 32 it number per address_space.  'struct address_space' could
> afford an extra 32 number I think.
> 
> radix_tree_root adds 3 'tag' flags to the gfp_mask.
> There is 16bits of free space in radix_tree_node (between 'offset' and
> 'count').  That space on the root node could store a record of which tags
> are set anywhere.  Or would that extra memory de-ref be a killer?

Yes these are reasons why adding new gfp flags is more complicated.

> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> need all those 26).

Well, maybe we are able to remove those hacks, I wouldn't definitely
be opposed.  But right now I am not even convinced that the mempool
specific gfp flags is the right way to go.

-- 
Michal Hocko
SUSE Labs