[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87lh0n4ufs.fsf@notabene.neil.brown.name>
Date: Wed, 27 Jul 2016 13:43:35 +1000
From: NeilBrown <neilb@...e.com>
To: Michal Hocko <mhocko@...nel.org>
Cc: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
dm-devel@...hat.com, Mikulas Patocka <mpatocka@...hat.com>,
Mel Gorman <mgorman@...e.de>,
David Rientjes <rientjes@...gle.com>,
Ondrej Kozina <okozina@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
On Mon, Jul 25 2016, Michal Hocko wrote:
> On Sat 23-07-16 10:12:24, NeilBrown wrote:
>> Maybe that is impractical, but having firm rules like that would go a
>> long way to make it possible to actually understand and reason about how
>> MM works. As it is, there seems to be a tendency to put bandaids over
>> bandaids.
>
> Ohh, I would definitely wish for this to be more clear but as it turned
> out over time there are quite some interdependencies between MM/FS/IO
> layers which make the picture really blur. If there is a brave soul to
> make that more clear without breaking any of that it would be really
> cool ;)
Just need that comprehensive regression-test-suite and off we go....
>> > My thinking was that throttle_vm_writeout is there to prevent from
>> > dirtying too many pages from the reclaim the context. PF_LESS_THROTTLE
>> > is part of the writeout so throttling it on too many dirty pages is
>> > questionable (well we get some bias but that is not really reliable). It
>> > still makes sense to throttle when the backing device is congested
>> > because the writeout path wouldn't make much progress anyway and we also
>> > do not want to cycle through LRU lists too quickly in that case.
>>
>> "dirtying ... from the reclaim context" ??? What does that mean?
>
> Say you would cause a swapout from the reclaim context. You would
> effectively dirty that anon page until it gets written down to the
> storage.
I should probably figure out how swap really works. I have vague ideas
which are probably missing important details...
Isn't the first step that the page gets moved into the swap-cache - and
marked dirty I guess. Then it gets written out and the page is marked
'clean'.
Then further memory pressure might push it out of the cache, or an early
re-use would pull it back from the cache.
If so, then "dirtying in reclaim context" could also be described as
"moving into the swap cache" - yes? So should there be a limit on dirty
pages in the swap cache just like there is for dirty pages in any
filesystem (the max_dirty_ratio thing) ??
Maybe there is?
>> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
>> avoid a live-lock. A key premise is that nfsd only allocates unbounded
>> memory when it is writing to the page cache. So it only needs to be
>> throttled when the backing device it is writing to is congested. It is
>> particularly important that it *doesn't* get throttled just because an
>> NFS backing device is congested, because nfsd might be trying to clear
>> that congestion.
>
> Thanks for the clarification. IIUC then removing throttle_vm_writeout
> for the nfsd writeout should be harmless as well, right?
Certainly shouldn't hurt from the perspective of nfsd.
>> >> The purpose of that flag is to allow a thread to dirty a page-cache page
>> >> as part of cleaning another page-cache page.
>> >> So it makes sense for loop and sometimes for nfsd. It would make sense
>> >> for dm-crypt if it was putting the encrypted version in the page cache.
>> >> But if dm-crypt is just allocating a transient page (which I think it
>> >> is), then a mempool should be sufficient (and we should make sure it is
>> >> sufficient) and access to an extra 10% (or whatever) of the page cache
>> >> isn't justified.
>> >
>> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
>> > appropriate then would a PF_MEMPOOL be any better?
>>
>> Why a PF rather than a GFP flag?
>
> Well, short answer is that gfp masks are almost depleted.
Really? We have 26.
pagemap has a cute hack to store both GFP flags and other flag bits in
the one 32 it number per address_space. 'struct address_space' could
afford an extra 32 number I think.
radix_tree_root adds 3 'tag' flags to the gfp_mask.
There is 16bits of free space in radix_tree_node (between 'offset' and
'count'). That space on the root node could store a record of which tags
are set anywhere. Or would that extra memory de-ref be a killer?
I think we'd end up with cleaner code if we removed the cute-hacks. And
we'd be able to use 6 more GFP flags!! (though I do wonder if we really
need all those 26).
Thanks,
NeilBrown
Download attachment "signature.asc" of type "application/pgp-signature" (819 bytes)
Powered by blists - more mailing lists