lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1488916356.6405.4.camel@redhat.com>
Date:   Tue, 07 Mar 2017 14:52:36 -0500
From:   Rik van Riel <riel@...hat.com>
To:     Michal Hocko <mhocko@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>
Cc:     Mel Gorman <mgorman@...e.de>, Johannes Weiner <hannes@...xchg.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
        Michal Hocko <mhocko@...e.com>
Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@...e.com>
> 
> Tetsuo Handa has reported [1][2] that direct reclaimers might get
> stuck
> in too_many_isolated loop basically for ever because the last few
> pages
> on the LRU lists are isolated by the kswapd which is stuck on fs
> locks
> when doing the pageout or slab reclaim. This in turn means that there
> is
> nobody to actually trigger the oom killer and the system is basically
> unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan:
> throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> and throttles the allocation at that layer so we can loosen the
> direct
> reclaim throttling.

It only does this to some extent.  If reclaim made
no progress, for example due to immediately bailing
out because the number of already isolated pages is
too high (due to many parallel reclaimers), the code
could hit the "no_progress_loops > MAX_RECLAIM_RETRIES"
test without ever looking at the number of reclaimable
pages.

Could that create problems if we have many concurrent
reclaimers?

It may be OK, I just do not understand all the implications.

I like the general direction your patch takes the code in,
but I would like to understand it better...

-- 
All rights reversed

Download attachment "signature.asc" of type "application/pgp-signature" (474 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ