linux-kernel - Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation and remove stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20120307150549.955d6f9c.akpm@linux-foundation.org>
Date:	Wed, 7 Mar 2012 15:05:49 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Tejun Heo <tj@...nel.org>, axboe@...nel.dk, hughd@...gle.com,
	avi@...hat.com, nate@...nel.net, cl@...ux-foundation.org,
	linux-kernel@...r.kernel.org, dpshah@...gle.com,
	ctalbott@...gle.com, rni@...gle.com
Subject: Re: [PATCHSET] mempool, percpu, blkcg: fix percpu stat allocation
 and remove stats_lock

On Wed, 7 Mar 2012 09:55:56 -0500
Vivek Goyal <vgoyal@...hat.com> wrote:

> On Tue, Mar 06, 2012 at 01:55:31PM -0800, Andrew Morton wrote:
> 
> [..]
> > > > hoo boy that looks like an infinite loop.  What's going on here?
> > > 
> > > If allocation fails, I am trying to allocate it again in infinite loop.
> > > What should I do? Try it after sleeping a bit? Or give up after certain
> > > number of tries? This is in worker thread context though, so main IO path
> > > is not impacted.
> > 
> > On a non-preemptible unprocessor kernel it's game over, isn't it? 
> > Unless someone frees some memory from interrupt context it is time for
> > the Big Red Button.
> 
> Yes.  Its an issue on non-preemptible UP kernels. I changed the logic to
> msleep(10) before retrying. Tested on UP non-preemptible kernel with
> always failing allocation and things are fine.
> 
> > 
> > I'm not sure what to suggest, really - if an allocation failed then
> > there's nothing the caller can reliably do to fix that.  The best
> > approach is to fail all the way back to userspace with -ENOMEM.
> 
> As user space is not waiting for this allocation, -ENOMEM is really
> not an option.

Well, it would have to be -EIO, because the block layer is stupid about
errnos.

> > 
> > In this context I suppose you could drop a warning into the logs then
> > bale out and retry on the next IO attempt.
> 
> Yes, that also can be done. I found msleep(10) to be easier solution then
> remvoing group from list, and trying again when new IO comes in. Is this
> acceptable?

Seems a bit sucky to me.  That allocation isn't *needed* for the kernel
to be able to complete the IO operation.  It's just that we
(mis)designed things so that we're dependent upon it succeeding.  Sigh.

msleep() will cause that kernel thread to contribute to load average
when it is in this state.  Intentional?

> [..]
> > 
> > btw, speaking of uniprocessor: please do perform a uniprocessor build
> > and see what impact the patch has upon the size(1) output for the .o
> > files.  We should try to minimize the pointless bloat for the UP
> > kernel.
> 
> But this logic is required both for UP and SMP kernels. So bloat on UP
> is not unnecessary?

UP doesn't need a per-cpu variable, hence it doesn't need to run
alloc_per_cpu() at all.  For UP all we need to do is to aggregate a
`struct blkio_group_stats' within `struct blkg_policy_data'?

This could still be done with suitable abstraction and wrappers. 
Whether that's desirable depends on how fat the API ends up, I guess.

> I ran size(1) on block/blk-cgroup.o with and without the patch and I can
> see some bloat.
> 
> Without patch(UP kernel)
> ------------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   12950    5248      50   18248    4748 block/blk-cgroup.o
> 
> With patch(UP kernel)
> ------------------------
> # size block/blk-cgroup.o
>    text    data     bss     dec     hex filename
>   13316    5376      58   18750    493e block/blk-cgroup.o

Yeah.

The additional text imposes runtime overhead, but there's also
additional cost from things like the extra pointer hops to access the
per-cpu data.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/