linux-kernel - Re: [PATCH] task containersv11 add tasks file interface fix for cpusets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6599ad830710061409p2dcaa1c8u8c6864beaaafb149@mail.gmail.com>
Date:	Sat, 6 Oct 2007 14:09:52 -0700
From:	"Paul Menage" <menage@...gle.com>
To:	"Paul Jackson" <pj@....com>
Cc:	"David Rientjes" <rientjes@...gle.com>, akpm@...ux-foundation.org,
	nickpiggin@...oo.com.au, a.p.zijlstra@...llo.nl, serue@...ibm.com,
	clg@...ibm.com, linux-kernel@...r.kernel.org,
	ebiederm@...ssion.com, svaidy@...ux.vnet.ibm.com, xemul@...nvz.org,
	containers@...ts.osdl.org, balbir@...ux.vnet.ibm.com
Subject: Re: [PATCH] task containersv11 add tasks file interface fix for cpusets

On 10/6/07, Paul Jackson <pj@....com> wrote:
> David wrote:
> > It would probably be better to just save references to the tasks.
> >
> >       struct cgroup_iter it;
> >       struct task_struct *p, **tasks;
> >       int i = 0;
> >
> >       cgroup_iter_start(cs->css.cgroup, &it);
> >       while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
> >               get_task_struct(p);
> >               tasks[i++] = p;
> >       }
> >       cgroup_iter_end(cs->css.cgroup, &it);
>
> Hmmm ... guess I'd have to loop over the cgroup twice, once to count
> them (the 'count' field is not claimed to be accurate) and then again,
> after I've kmalloc'd the tasks[] array, filling in the tasks[] array.
>
> On a big cgroup on a big system, this could easily be thousands of
> iteration loops.

But if userspace has to do it, the effect will be far more expensive.

>
> If I need to close the window all the way, completely solving the race
> condition, then I have the code in kernel/cpuset.c:update_nodemask(),
> which builds an mmarray[] using two loops and some retries if newly
> forked tasks are showing up too rapidly at the same time.  The first of
> the two loops is hidden in the cgroup_task_count() call.

In general, the loop inside cgroup_task_count() will only have a
single iteration. (It's iterating across the shared css_set objects,
not across member tasks.

What's wrong with:

  allocate a page of task_struct pointers
again:
  need_repeat = false;
  cgroup_iter_start();
  while (cgroup_iter_next()) {
    if (p->cpus_allowed != new_cpumask) {
      store p;
      if (page is full) {
        need_repeat = true;
        break;
      }
    }
  }
  for each saved task p {
    set_cpus_allowed(p, new_cpumask);
    release p;
  }
  if (need_repeat)
    goto again;

Advantages:

- no vmalloc needed
- just one iteration in the case where a cgroup has fewer than 512 members
- additional iterations only need to deal with tasks that don't have
the right cpu mask
- automatically handles fork races


Another option would be to have a cpuset fork callback that forces
p->cpus_allowed to its cpuset's cpus_allowed if another thread is in
the middle of update_cpumask(). Then you don't need to worry about
fork races at all, since all new threads will get the right cpumask.

> Or, if there is a good reason that must remain a spinlock, then the
> smallest amount of new code, and the easiest code to write, would
> perhaps be adding another cgroup callback, called only by cgroup attach
> () requests back to the same group.  Then code that wants to do
> something odd, such as cpusets, for what seems like a no-op, can do so.

I'd much rather not perpetuate that broken API requirement. The fact
that cpusets wants this odd behaviour is based on a nasty hack.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/