linux-kernel - Re: upcoming kerneloops.org item: get_page_from

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090625132544.GB9995@mit.edu>
Date:	Thu, 25 Jun 2009 09:25:44 -0400
From:	Theodore Tso <tytso@....edu>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	penberg@...helsinki.fi, arjan@...radead.org,
	linux-kernel@...r.kernel.org, cl@...ux-foundation.org,
	npiggin@...e.de
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 03:07:14PM -0700, Andrew Morton wrote:
> 
> fs/jbd/journal.c:       new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> 
> But that isn't :(

Well, we could recode it to do what journal_alloc_head() does, which
is call the allocator in a loop:

	ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
	if (ret == NULL) {
		jbd_debug(1, "out of memory for journal_head\n");
		if (time_after(jiffies, last_warning + 5*HZ)) {
			printk(KERN_NOTICE "ENOMEM in %s, retrying.\n",
			       __func__);
			last_warning = jiffies;
		}
		while (ret == NULL) {
			yield();
			ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
		}
	}

Like journal_write_metadata_buffer(), which you quoted, it's called
out of the commit code, where about the only choice we have other than
looping or using GFP_NOFAIL is to abort the filesystem and remount it
read-only or panic.  It's not at all clear to me that looping
repeatedly is helpful; for example, the allocator doesn't know that it
should try really hard, and perhaps fall back to an order 0 allocation
of an order 1 allocation won't work.

Hmm.... it may be possible to do the memory allocation in advance,
before we get to the commit, and make it be easier to fail and return
ENOMEM to userspace --- which I bet most applications won't handle
gracefully, either (a) not checking error codes and losing data, or
(b) dieing on the spot, so it would be effectively be an OOM kill.
And in some cases, we're calling journal_get_write_access() out of a
kernel daemon like pdflush, where the error recovery paths may get
rather interesting.

The question then is what is the right strategy?  Use GFP_NOFAIL, and
let the memory allocator loop; let the allocating kernel code loop;
remount filesystems read/only and/or panic; pass a "try _really_ hard"
flag to the allocator and fall back to a ro-remount/panic if the
allocator still wasn't successful?  None of the alternatives seem
particularly appealing to me....

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/