lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 13 Oct 2015 09:37:06 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc:	Michal Hocko <mhocko@...nel.org>,
	David Rientjes <rientjes@...gle.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Kyle Walker <kwalker@...hat.com>,
	Christoph Lameter <cl@...ux.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Vladimir Davydov <vdavydov@...allels.com>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Stanislav Kozina <skozina@...hat.com>
Subject: Re: Silent hang up caused by pages being not scanned?

On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
<penguin-kernel@...ove.sakura.ne.jp> wrote:
>
> If I remove
>
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
>
> the OOM killer is invoked even when there are so much memory which can be
> reclaimed after written to disk. This is definitely premature invocation of
> the OOM killer.

Right. The rest of the code knows that the return value right now
means "there is no memory at all" rather than "I made progress".

> Yes. But we can't simply do
>
>         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
>
> because we won't be able to call out_of_memory(), can we?

So I think that whole thing is kind of senseless. Not just that
particular conditional, but what it *does* too.

What can easily happen is that we are a blocking allocation, but
because we're __GFP_FS or something, the code doesn't actually start
writing anything out. Nor is anything congested. So the thing just
loops.

And looping is stupid, because we may be not able to actually free
anything exactly because of limitations like __GFP_FS.

So

 (a) the looping condition is senseless

 (b) what we do when looping is senseless

and we actually do try to wake up kswapd in the loop, but we never
*wait* for it, so that's largely pointless too.

So *of*course* the direct reclaim code has to set "I made progress",
because if it doesn't lie and say so, then the code will randomly not
loop, and will oom, and things go to hell.

But I hate the "let's tweak the zone_reclaimable" idea, because it
doesn't actually fix anything. It just perpetuates this "the code
doesn't make sense, so let's add *more* senseless heusristics to this
whole loop".

So instead of that senseless thing, how about trying something
*sensible*. Make the code do something that we can actually explain as
making sense.

I'd suggest something like:

 - add a "retry count"

 - if direct reclaim made no progress, or made less progress than the target:

      if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

 - regardless of whether we made progress or not:

      if (retry count < X) goto retry;

      if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
goto retry

   where 'X" is something sane that limits our CPU use, but also
guarantees that we don't end up waiting *too* long (if a single
allocation takes more than a big fraction of a second, we should
probably stop trying).

The whole time-based thing might even be explicit. There's nothing
wrong with doing something like

    unsigned long timeout = jiffies + HZ/4;

at the top of the function, and making the whole retry logic actually
say something like

    if (time_after(timeout, jiffies)) goto noretry;

(or make *that* trigger the oom logic, or whatever).

Now, I realize the above suggestions are big changes, and they'll
likely break things and we'll still need to tweak things, but dammit,
wouldn't that be better than just randomly tweaking the insane
zone_reclaimable logic?

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ