linux-kernel - Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTik-oMNJdakwsH=22LzgYvgd35aom+Zu2WkchUte@mail.gmail.com>
Date:	Tue, 21 Sep 2010 13:48:32 +0200
From:	Stephane Eranian <eranian@...gle.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	linux-kernel@...r.kernel.org, mingo@...e.hu, paulus@...ba.org,
	davem@...emloft.net, fweisbec@...il.com,
	perfmon2-devel@...ts.sf.net, eranian@...il.com,
	robert.richter@....com, acme@...hat.com,
	Paul Menage <menage@...gle.com>,
	Li Zefan <lizf@...fujitsu.com>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>
Subject: Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup
 monitoring (v3)

Peter,

On Tue, Sep 21, 2010 at 11:43 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Tue, 2010-09-21 at 11:38 +0200, Peter Zijlstra wrote:
>> On Thu, 2010-09-09 at 15:05 +0200, Stephane Eranian wrote:
>> > The cgroup to monitor is designated by passing a file descriptor opened
>> > on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
>> > option must be activated by setting perf_event_attr.cgroup=1 and passing
>> > a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
>> > two ABI extensions.
>>
>> > +++ b/include/linux/perf_event.h
>> > @@ -215,8 +215,9 @@ struct perf_event_attr {
>> >                                  */
>> >                                 precise_ip     :  2, /* skid constraint       */
>> >                                 mmap_data      :  1, /* non-exec mmap data    */
>> > +                               cgroup         :  1, /* cgroup aggregation    */
>> >
>> > -                               __reserved_1   : 46;
>> > +                               __reserved_1   : 45;
>> >
>> >         union {
>> >                 __u32           wakeup_events;    /* wakeup every n events */
>> > @@ -226,6 +227,8 @@ struct perf_event_attr {
>> >         __u32                   bp_type;
>> >         __u64                   bp_addr;
>> >         __u64                   bp_len;
>> > +
>> > +       int                     cgroup_fd;
>> >  };
>> >
>> >  /*
>>
>> I'm not sure I like this much.. so we attach to {pid,cpu}, for nodes we
>> can use cpu_to_node(cpu), which would suggest to use
>> cgroup_of_task(pid), except that a task can be part of multiple cgroups,
>> so its not unique.
>>
>> One thing we could do is pass this cgroup identifier in the pid field
>> and use PERF_FLAG_CGROUP or something. Currently the syscall signature
>> uses pid_t, but I think we can safely change that to int.
>>
>> You create a special new file in the cgroup stuff, I'm not sure about
>> that either, but its not something I feel too strongly about, why
>> wouldn't a fd of any file or even directory of that cgroup work? Do the
>> cgroup people have an opinion?
>
> Ahh, I just read more of the patch, and you create a full perf cgroup,
> in which case cgroup_of_task(pid) will work, simply pick the perf
> cgroup's tasks.
>
> No need to actually create that file, open it and pass fds around, just
> pick a task from that cgroup and attach to the cgroup through that.
>
If I understand, you are proposing that we use the pid argument to the
syscall to designate the cgroup. A task belongs to only one cgroup at
a time. Thus with the pid you can identify a cgroup. No need for an
entry in cgroup_fs and therefore no need for cgroup_fd in perf_event_attr.
We would still need a flag somewhere to indicate that we don't want
per-thread mode but per-cpu per-cgroup. It could be a new field in the
bitfield in perf_event_attr. In fact, I already have such a field.

The main issue I see with this is that it relies on having at least one
task in the cgroup when you start the measurement. That is certainly
not always the case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/