linux-kernel - Re: [RFC 2/2] x86_64: expand kernel stack to 16K

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140529072633.GH6677@dastard>
Date:	Thu, 29 May 2014 17:26:33 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Jens Axboe <axboe@...nel.dk>, Minchan Kim <minchan@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-mm <linux-mm@...ck.org>, "H. Peter Anvin" <hpa@...or.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>,
	Rusty Russell <rusty@...tcorp.com.au>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	Dave Hansen <dave.hansen@...el.com>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K

On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 6:30 PM, Dave Chinner <david@...morbit.com> wrote:
> >
> > You're focussing on the specific symptoms, not the bigger picture.
> > i.e. you're ignoring all the other "let's start IO" triggers in
> > direct reclaim. e.g there's two separate plug flush triggers in
> > shrink_inactive_list(), one of which is:
> 
> Fair enough. I certainly agree that we should look at the other cases here too.
> 
> In fact, I also find it distasteful just how much stack space some of
> those VM routines are just using up on their own, never mind any
> actual IO paths at all. The fact that __alloc_pages_nodemask() uses
> 350 bytes of stackspace on its own is actually quite disturbing. The
> fact that kernel_map_pages() apparently has almost 400 bytes of stack
> is just crazy. Obviously that case only happens with
> CONFIG_DEBUG_PAGEALLOC, but still..

What concerns me about both __alloc_pages_nodemask() and
kernel_map_pages is that when I look at the code I see functions
that have no obvious stack usage problem. However, the compiler is
producing functions with huge stack footprints and it's not at all
obvious when I read the code. So in this case I'm more concerned
that we have a major disconnect between the source code structure
and the code that the compiler produces...

> > I'm not saying we shouldn't turn of swap from direct reclaim, just
> > that all we'd be doing by turning off swap is playing whack-a-stack
> > - the next report will simply be from one of the other direct
> > reclaim IO schedule points.
> 
> Playing whack-a-mole with this for a while might not be a bad idea,
> though. It's not like we will ever really improve unless we start
> whacking the worst cases. And it should still be a fairly limited
> number.

I guess I've been playing whack-a-stack for so long now and some of
the overruns have been so large I just don't see it as a viable
medium to long term solution.

> After all, historically, some of the cases we've played whack-a-mole
> on have been in XFS, so I'd think you'd be thrilled to see some other
> code get blamed this time around ;)

Blame shifting doesn't thrill me - I'm still at the pointy end of
stack overrun reports, and we've still got to do the hard work of
solving the problem. However, I am happy to see acknowlegement of
the problem so we can work out how to solve the issues...

> > Regardless of whether it is swap or something external queues the
> > bio on the plug, perhaps we should look at why it's done inline
> > rather than by kblockd, where it was moved because it was blowing
> > the stack from schedule():
> 
> So it sounds like we need to do this for io_schedule() too.
> 
> In fact, we've generally found it to be a mistake every time we
> "automatically" unblock some IO queue. And I'm not saying that because
> of stack space, but because we've _often_ had the situation that eager
> unblocking results in IO that could have been done as bigger requests.
> 
> Of course, we do need to worry about latency for starting IO, but any
> of these kinds of memory-pressure writeback patterns are pretty much
> by definition not about the latency of one _particular_ IO, so they
> don't tent to be latency-sensitive. Quite the reverse: we start
> writeback and then end up waiting on something else altogether
> (possibly a writeback that got started much earlier).

*nod*

> swapout certainly is _not_ IO-latency-sensitive, especially these
> days. And while we _do_ want to throttle in direct reclaim, if it's
> about throttling I'd certainly think that it sounds quite reasonable
> to push any unplugging to kblockd than try to do that synchronously.
> If we are throttling in direct-reclaim, we need to slow things _down_
> for the writer, not worry about latency.

Right, we are adding latency to the caller by having to swap so
a small amount of additional IO dispatch latency for IO we aren't
going to wait directly on doesn't really matter at all.

> >                That implies no IO in direct reclaim context
> > is safe - either from swap or io_schedule() unplugging. It also
> > lends a lot of weight to my assertion that the majority of the stack
> > growth over the past couple of years has been ocurring outside the
> > filesystems....
> 
> I think Minchan's stack trace definitely backs you up on that. The
> filesystem part - despite that one ext4_writepages() function - is a
> very small part of the whole. It sits at about ~1kB of stack. Just the
> VM "top-level" writeback code is about as much, and then the VM page
> alloc/shrinking code when the filesystem needs memory is *twice* that,
> and then the block layer and the virtio code are another 1kB each.

*nod*

As i said early, look at this in the context of the bigger picture.
We can also have more stack using layers in the IO stack and/or more
stack-expensive layers. e.g.  it could be block -> dm -> md -> SCSI
-> mempool_alloc in that stack rather than block -> virtio ->
kmalloc. Hence 1k of virtio stack could be 1.5k of SCSI stack,
md/dm could contribute a few hundred bytes each (or more depending
on how many layers of dm/md there are), and so on.

When you start adding all that up, it doesn't paint a pretty
picture. That's one of the main reasons why I don't think the
whack-a-stack approach will solve the problem in the medium to long
term...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/