linux-kernel - Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and fix blkcg percpu alloc deadlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20111222191144.78aec23a.akpm@linux-foundation.org>
Date:	Thu, 22 Dec 2011 19:11:44 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Tejun Heo <tj@...nel.org>, avi@...hat.com, nate@...nel.net,
	cl@...ux-foundation.org, oleg@...hat.com, axboe@...nel.dk,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and
 fix blkcg percpu alloc deadlock

On Thu, 22 Dec 2011 21:54:11 -0500 Vivek Goyal <vgoyal@...hat.com> wrote:

> On Thu, Dec 22, 2011 at 05:38:20PM -0800, Andrew Morton wrote:
> > On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@...hat.com> wrote:
> > 
> > > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > > 
> > > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > > options.
> > > > 
> > > > a) Give userspace a new procfs/debugfs file to start stats gathering
> > > >    on a particular cgroup/request_queue pair.  Allocate the stats
> > > >    memory in that.
> > > > 
> > > > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> > > >    and return zeroes for this first call.
> > > 
> > > But the purpose of stats is that they are gathered even if somebody
> > > has not read them even once?
> > 
> > That's not a useful way of using stats.  The normal usage would be to
> > record the stats then start the workload then monitor how the stats
> > have changed as work proceeds.
> 
> I have atleast one example "iostat" which does not follow this. Its
> first report shows the total stats since the system boot and each 
> subsequent report covers time since previous report. With stats being
> available since the cgroup creation time, one can think of extending
> iostat tool to display per IO cgroup stats too.

If that's useful (dubious) then it can be addressed by creating the
stats when a device is bound to the cgroup (below).

> Also we have a knob "reset_stats" to reset all the stats to zero. So
> one can first reset stats, starts workload and then monitor it (if one
> does not like stats since the cgroup creation time).
> 
> > 
> > > So if I create a cgroup and put some
> > > task into it which does some IO, I think stat collection should start
> > > immediately without user taking any action.
> > 
> > If you really want to know the stats since cgroup creation then trigger
> > the stats initialisation from userspace when creating the blkio_cgroup.
> 
> These per cpu stats are per cgroup per device. So if a workload in a
> cgroup is doing IO to 4 devices, we allocate 4 percpu stat areas for
> stats. So at cgroup creation time we just don't know how many of these
> to create and also it does not cover the case of device hotplug after
> cgroup creation.

Mark the cgroup as "needing stats" then allocate the stats (if needed)
when a device is bound to the cgroup.  Rather than on first I/O.

> > 
> > > Forcing the user to first
> > > read a stat before the collection starts is kind of odd to me.
> > 
> > Well one could add a separate stats_enable knob.  Doing it
> > automatically from read() would be for approximate-back-compatibility
> > with existing behaviour.
> > 
> > Plus (again) this way we also avoid burdening non-stats-users with the
> > overhead of stats.
> 
> Even if we do that we have the problem with hoplugged device. Assume a
> cgroup created, stats enabled now a new devices shows up and some task
> in the group does IO on that device. Now we need to create percpu data
> area for that cgroup-device pair dynamically in IO path and we are back
> to the same problem. 

Why do the allocation during I/O?  Can't it be done in the hotplug handler?


Please, don't expend brain cells thinking of reasons why this all can't
be done.  Instead expend them on finding a way in which it *can* be
done!

Maybe doing all this cgroups stuff within ->elevator_set_req_fn() was
the wrong design decision.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/