[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20111222191144.78aec23a.akpm@linux-foundation.org>
Date: Thu, 22 Dec 2011 19:11:44 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Vivek Goyal <vgoyal@...hat.com>
Cc: Tejun Heo <tj@...nel.org>, avi@...hat.com, nate@...nel.net,
cl@...ux-foundation.org, oleg@...hat.com, axboe@...nel.dk,
linux-kernel@...r.kernel.org
Subject: Re: [PATCHSET] block, mempool, percpu: implement percpu mempool and
fix blkcg percpu alloc deadlock
On Thu, 22 Dec 2011 21:54:11 -0500 Vivek Goyal <vgoyal@...hat.com> wrote:
> On Thu, Dec 22, 2011 at 05:38:20PM -0800, Andrew Morton wrote:
> > On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@...hat.com> wrote:
> >
> > > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > >
> > > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > > options.
> > > >
> > > > a) Give userspace a new procfs/debugfs file to start stats gathering
> > > > on a particular cgroup/request_queue pair. Allocate the stats
> > > > memory in that.
> > > >
> > > > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> > > > and return zeroes for this first call.
> > >
> > > But the purpose of stats is that they are gathered even if somebody
> > > has not read them even once?
> >
> > That's not a useful way of using stats. The normal usage would be to
> > record the stats then start the workload then monitor how the stats
> > have changed as work proceeds.
>
> I have atleast one example "iostat" which does not follow this. Its
> first report shows the total stats since the system boot and each
> subsequent report covers time since previous report. With stats being
> available since the cgroup creation time, one can think of extending
> iostat tool to display per IO cgroup stats too.
If that's useful (dubious) then it can be addressed by creating the
stats when a device is bound to the cgroup (below).
> Also we have a knob "reset_stats" to reset all the stats to zero. So
> one can first reset stats, starts workload and then monitor it (if one
> does not like stats since the cgroup creation time).
>
> >
> > > So if I create a cgroup and put some
> > > task into it which does some IO, I think stat collection should start
> > > immediately without user taking any action.
> >
> > If you really want to know the stats since cgroup creation then trigger
> > the stats initialisation from userspace when creating the blkio_cgroup.
>
> These per cpu stats are per cgroup per device. So if a workload in a
> cgroup is doing IO to 4 devices, we allocate 4 percpu stat areas for
> stats. So at cgroup creation time we just don't know how many of these
> to create and also it does not cover the case of device hotplug after
> cgroup creation.
Mark the cgroup as "needing stats" then allocate the stats (if needed)
when a device is bound to the cgroup. Rather than on first I/O.
> >
> > > Forcing the user to first
> > > read a stat before the collection starts is kind of odd to me.
> >
> > Well one could add a separate stats_enable knob. Doing it
> > automatically from read() would be for approximate-back-compatibility
> > with existing behaviour.
> >
> > Plus (again) this way we also avoid burdening non-stats-users with the
> > overhead of stats.
>
> Even if we do that we have the problem with hoplugged device. Assume a
> cgroup created, stats enabled now a new devices shows up and some task
> in the group does IO on that device. Now we need to create percpu data
> area for that cgroup-device pair dynamically in IO path and we are back
> to the same problem.
Why do the allocation during I/O? Can't it be done in the hotplug handler?
Please, don't expend brain cells thinking of reasons why this all can't
be done. Instead expend them on finding a way in which it *can* be
done!
Maybe doing all this cgroups stuff within ->elevator_set_req_fn() was
the wrong design decision.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists