linux-kernel - Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <6599ad830907150918r5ed6f2cevde9dbc9ef304fb2b@mail.gmail.com>
Date:	Wed, 15 Jul 2009 09:18:33 -0700
From:	Paul Menage <menage@...gle.com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Benjamin Blum <bblum@...gle.com>,
	containers@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org, lizf@...fujitzu.com
Subject: Re: [PATCH 1/2] Adds a read-only "procs" file similar to "tasks" that 
	shows only unique tgids

On Wed, Jul 15, 2009 at 1:33 AM, Eric W. Biederman<ebiederm@...ssion.com> wrote:
>
> I think guaranteeing a truly atomic snapshot is likely to be a
> horrible idea requiring all kinds of nasty locking,

We don't guarantee a truly atomic snapshot unless you manage to read
the entire file with a single read() call.

> and smp
> scalability issues.  So please walk the list of pids and
> just return those that belong to your cgroup.

The downside with that is that scanning any cgroup takes O(n) in the
number of threads on the machine, so scanning them all becomes O(n^2).
We've definitely seen problems (on older kernels using cpusets, which
did something similar, i.e. walking the tasklist) where we have lots
of small cpusets and a few huge ones, and this blew up the cost of
accessing any of them.

But having said that, the idea of being able to maintain just a cursor
is something that would definitely be nice.

Here's another idea that might work:

Currently, each cgroup has a list running through the attached css_set
objects, and each css_set has a list running through its tasks; we
iterate through these lists of lists to produce the cgroup's task list

Since this list isn't sorted in any way, there's no convenient way to
save/restore your position between seq_file invocations; this is why
we currently generate a sorted snapshot, so that even if the snapshot
is updated by someone else before our next read, we know where to pick
up from (the next pid above the last one that we returned).

Instead, we could actually store cursor objects in the list itself
whenever we need to pause/resume iterating through a large cgroup (due
to hitting the limits of a single seq_file read, i.e. probably after
every 700 threads).

Then we'd just need to teach cgroup_iter_next() to distinguish between
real tasks and cursors, and skip the latter. Simple way to do that
would be to change the existing declarations in task_struct:

#ifdef CONFIG_CGROUPS
        /* Control Group info protected by css_set_lock */
        struct css_set *cgroups;
        /* cg_list protected by css_set_lock and tsk->alloc_lock */
        struct list_head cg_list;
#endif

and instead define these two fields together as a struct
cgroup_css_list_elem. A cursor can just be a cgroup_css_list_elem
whose cgroups field points to a distinguishing address that identifies
it as a cursor.

So we're guaranteed to hit all threads that are in the cgroup before
we start, and stay there until we finish; there are no guarantees
about threads that move into or out of the cgroup while we're
iterating

It's a bit more complex than just iterating over the machine's entire
pid list, but I think it scales better.

Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/