[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20111222145426.5844df96.akpm@linux-foundation.org>
Date: Thu, 22 Dec 2011 14:54:26 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Tejun Heo <tj@...nel.org>
Cc: avi@...hat.com, nate@...nel.net, cl@...ux-foundation.org,
oleg@...hat.com, axboe@...nel.dk, vgoyal@...hat.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and
fix blkcg percpu alloc deadlock
On Thu, 22 Dec 2011 14:41:17 -0800
Tejun Heo <tj@...nel.org> wrote:
> Hello, Andrew.
>
> On Thu, Dec 22, 2011 at 02:20:58PM -0800, Andrew Morton wrote:
> > Don't just consider my suggestions - please try to come up with your own
> > alternatives too! If all else fails then this patch is a last resort.
>
> Umm... this is my alternative.
We're beyond the point where aany additional kernel complexity should
be considered a regression.
> > > but apparently those percpu stats reduced
> > > CPU overhead significantly.
> >
> > Deleting them would save even more CPU.
> >
> > Or make them runtime or compile-time configurable, so only the
> > developers see the impact.
> >
> > Some specifics on which counters are causing the problems would help here.
>
> These stats are userland visible and quite useful ones if blkcg is in
> use. I don't really see how these can be removed.
What stats?
And why are we doing percpu *allocation* so deep in the code? You mean
we're *creating* stats counters on an IO path? Sounds odd. Where is
this code?
> > > > Or how about we fix the percpu memory allocation code so that it
> > > > propagates the gfp flags, then delete this patchset?
> > >
> > > Oh, no, this is gonna make things *way* more complex. I tried.
> >
> > But there's a difference between fixing a problem and working around it.
>
> Yeah, that was my first direction too. The reason why percpu can't do
> NOIO is the same one why vmalloc can't do it. It reaches pretty deep
> into page table code and I don't think doing all that churning is
> worthwhile or even desirable. An altnernative approach would be
> implementing transparent front buffer to percpu allocator, which I
> *might* do if there really are more of these users, but I think
> keeping percpu allocator painful to use from reclaim context isn't
> such a bad idea.
>
> There have been multiple requests for atomic allocation and they all
> have been successfully pushed back, but IMHO this is a valid one and I
> don't see a better way around the problem, so while I agree using
> mempool for this is a workaround, I think it is a right choice, for
> now, anyway.
For starters, doing pagetable allocation on the I/O path sounds nutty.
Secondly, GFP_NOIO is a *weaker* allocation mode than GFP_KERNEL. By
permitting it with this patchset, we have a kernel which is more likely
to get oom failures. Fixing the kernel to not perform GFP_NOIO
allocations for these counters will result in a more robust kernel.
This is a good thing, which improves the kernel while avoiding adding
more compexity elsewhere.
This patchset is the worst option and we should try much harder to avoid
applying it!
> > > If we're gonna have many more NOIO percpu users, which I don't
> > > think we would or should, that might make sense but, for fringe
> > > cases, extending mempool to cover percpu is a much better sized
> > > solution.
> >
> > I've long felt that we goofed with the gfp_flags thing and that it
> > should be a field in the task_struct. Now *that* would be a large
> > patch!
>
> Yeah, some of PF_* flags already carry related role information. I'm
> not too sure how much pushing the whole thing into task_struct would
> change tho. We would need push/popping. It could be simpler in some
> cases but in essence wouldn't we have just relocated the position of
> parameter?
The code would get considerably simpler. The big benefit comes when
you have deep call stacks - we're presently passing a gfp_t down five
layers of function call while none of the intermediate functions even
use the thing - they just pass it on to the next guy. Pass it via the
task_struct and all that goes away. It would make maintenance a lot
easier - at present if you want to add a new kmalloc() to a leaf
function you need to edit all five layers of caller functions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists