[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com>
Date: Tue, 13 Oct 2015 09:37:06 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc: Michal Hocko <mhocko@...nel.org>,
David Rientjes <rientjes@...gle.com>,
Oleg Nesterov <oleg@...hat.com>,
Kyle Walker <kwalker@...hat.com>,
Christoph Lameter <cl@...ux.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Vladimir Davydov <vdavydov@...allels.com>,
linux-mm <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Stanislav Kozina <skozina@...hat.com>
Subject: Re: Silent hang up caused by pages being not scanned?
On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
<penguin-kernel@...ove.sakura.ne.jp> wrote:
>
> If I remove
>
> /* Any of the zones still reclaimable? Don't OOM. */
> if (zones_reclaimable)
> return 1;
>
> the OOM killer is invoked even when there are so much memory which can be
> reclaimed after written to disk. This is definitely premature invocation of
> the OOM killer.
Right. The rest of the code knows that the return value right now
means "there is no memory at all" rather than "I made progress".
> Yes. But we can't simply do
>
> if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
>
> because we won't be able to call out_of_memory(), can we?
So I think that whole thing is kind of senseless. Not just that
particular conditional, but what it *does* too.
What can easily happen is that we are a blocking allocation, but
because we're __GFP_FS or something, the code doesn't actually start
writing anything out. Nor is anything congested. So the thing just
loops.
And looping is stupid, because we may be not able to actually free
anything exactly because of limitations like __GFP_FS.
So
(a) the looping condition is senseless
(b) what we do when looping is senseless
and we actually do try to wake up kswapd in the loop, but we never
*wait* for it, so that's largely pointless too.
So *of*course* the direct reclaim code has to set "I made progress",
because if it doesn't lie and say so, then the code will randomly not
loop, and will oom, and things go to hell.
But I hate the "let's tweak the zone_reclaimable" idea, because it
doesn't actually fix anything. It just perpetuates this "the code
doesn't make sense, so let's add *more* senseless heusristics to this
whole loop".
So instead of that senseless thing, how about trying something
*sensible*. Make the code do something that we can actually explain as
making sense.
I'd suggest something like:
- add a "retry count"
- if direct reclaim made no progress, or made less progress than the target:
if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;
- regardless of whether we made progress or not:
if (retry count < X) goto retry;
if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
goto retry
where 'X" is something sane that limits our CPU use, but also
guarantees that we don't end up waiting *too* long (if a single
allocation takes more than a big fraction of a second, we should
probably stop trying).
The whole time-based thing might even be explicit. There's nothing
wrong with doing something like
unsigned long timeout = jiffies + HZ/4;
at the top of the function, and making the whole retry logic actually
say something like
if (time_after(timeout, jiffies)) goto noretry;
(or make *that* trigger the oom logic, or whatever).
Now, I realize the above suggestions are big changes, and they'll
likely break things and we'll still need to tweak things, but dammit,
wouldn't that be better than just randomly tweaking the insane
zone_reclaimable logic?
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists