[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1701171743330.15892@vshiva-Udesk>
Date: Tue, 17 Jan 2017 18:14:50 -0800 (PST)
From: Shivappa Vikas <vikas.shivappa@...el.com>
To: Thomas Gleixner <tglx@...utronix.de>
cc: Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
vikas.shivappa@...el.com, davidcc@...gle.com, eranian@...gle.com,
linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
mingo@...nel.org, peterz@...radead.org, ravi.v.shankar@...el.com,
tony.luck@...el.com, fenghua.yu@...el.com, andi.kleen@...el.com,
h.peter.anvin@...el.com
Subject: Re: [PATCH 05/12] x86/cqm,perf/core: Cgroup support prepare
On Tue, 17 Jan 2017, Thomas Gleixner wrote:
> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>
>> From: David Carrillo-Cisneros <davidcc@...gle.com>
>>
>> cgroup hierarchy monitoring is not supported currently. This patch
>> builds all the necessary datastructures, cgroup APIs like alloc, free
>> etc and necessary quirks for supporting cgroup hierarchy monitoring in
>> later patches.
>>
>> - Introduce a architecture specific data structure arch_info in
>> perf_cgroup to keep track of RMIDs and cgroup hierarchical monitoring.
>> - perf sched_in calls all the cgroup ancestors when a cgroup is
>> scheduled in. This will not work with cqm as we have a common per pkg
>> rmid associated with one task and hence cannot write different RMIds
>> into the MSR for each event. cqm driver enables a flag
>> PERF_EV_CGROUP_NO_RECURSION which indicates the perf to not call all
>> ancestor cgroups for each event and let the driver handle the hierarchy
>> monitoring for cgroup.
>> - Introduce event_terminate as event_destroy is called after cgrp is
>> disassociated from the event to support rmid handling of the cgroup.
>> This helps cqm clean up the cqm specific arch_info.
>> - Add the cgroup APIs for alloc,free,attach and can_attach
>>
>> The above framework will be used to build different cgroup features in
>> later patches.
>
> That's not a framework. It's a hodgepodge of core and x86 specific changes.
>
> I'm not even trying to review it as a whole, simply because such changes
> want to be split into several preparatory changes in the core which provide
> the 'framework' parts and then an actual user in the architecture
> code. I'll give some general feedback whatsoever.
Will split them into multiple patches.
>
>> Tests: Same as before. Cgroup still doesnt work but we did the prep to
>> get it to work
>
> Oh well. What tests are the same as before? This information is just there
> to take room in the changelog, right?
Will fix the logs
>
>> Patch modified/refactored by Vikas Shivappa
>> <vikas.shivappa@...ux.intel.com> to support recycling removal.
>>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@...ux.intel.com>
>
> So this patch is from David, but where is Davids Signed-off-by?
>
>> ---
>> arch/x86/events/intel/cqm.c | 19 ++++++++++++++++++-
>> arch/x86/include/asm/perf_event.h | 27 +++++++++++++++++++++++++++
>> include/linux/perf_event.h | 32 ++++++++++++++++++++++++++++++++
>> kernel/events/core.c | 28 +++++++++++++++++++++++++++-
>> 4 files changed, 104 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
>> index 68fd1da..a9bd7bd 100644
>> --- a/arch/x86/events/intel/cqm.c
>> +++ b/arch/x86/events/intel/cqm.c
>> @@ -741,7 +741,13 @@ static int intel_cqm_event_init(struct perf_event *event)
>> INIT_LIST_HEAD(&event->hw.cqm_group_entry);
>> INIT_LIST_HEAD(&event->hw.cqm_groups_entry);
>>
>> - event->destroy = intel_cqm_event_destroy;
>> + /*
>> + * CQM driver handles cgroup recursion and since only noe
>> + * RMID can be programmed at the time in each core, then
>> + * it is incompatible with the way generic code handles
>> + * cgroup hierarchies.
>> + */
>> + event->event_caps |= PERF_EV_CAP_CGROUP_NO_RECURSION;
>>
>> mutex_lock(&cache_mutex);
>>
>> @@ -918,6 +924,17 @@ static int intel_cqm_event_init(struct perf_event *event)
>> .read = intel_cqm_event_read,
>> .count = intel_cqm_event_count,
>> };
>> +#ifdef CONFIG_CGROUP_PERF
>> +int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
>> + struct cgroup_subsys_state *new_css)
>> +{}
>> +void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css)
>> +{}
>> +void perf_cgroup_arch_attach(struct cgroup_taskset *tset)
>> +{}
>> +int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset)
>> +{}
>> +#endif
>
> What the heck is this for? It does not even compile because
> perf_cgroup_arch_css_alloc() and perf_cgroup_arch_can_attach() are empty
> functions.
>
> Crap, crap, crap.
>
>> static inline void cqm_pick_event_reader(int cpu)
>> {
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index f353061..f38c7f0 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -299,4 +299,31 @@ static inline void perf_check_microcode(void) { }
>>
>> #define arch_perf_out_copy_user copy_from_user_nmi
>>
>> +/*
>> + * Hooks for architecture specific features of perf_event cgroup.
>> + * Currently used by Intel's CQM.
>> + */
>> +#ifdef CONFIG_INTEL_RDT_M
>> +#ifdef CONFIG_CGROUP_PERF
>> +
>> +#define perf_cgroup_arch_css_alloc perf_cgroup_arch_css_alloc
>> +
>> +int perf_cgroup_arch_css_alloc(struct cgroup_subsys_state *parent_css,
>> + struct cgroup_subsys_state *new_css);
>> +
>> +#define perf_cgroup_arch_css_free perf_cgroup_arch_css_free
>> +
>> +void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
>> +
>> +#define perf_cgroup_arch_attach perf_cgroup_arch_attach
>> +
>> +void perf_cgroup_arch_attach(struct cgroup_taskset *tset);
>> +
>> +#define perf_cgroup_arch_can_attach perf_cgroup_arch_can_attach
>> +
>> +int perf_cgroup_arch_can_attach(struct cgroup_taskset *tset);
>> +
>> +#endif
>> +
>> +#endif
>> #endif /* _ASM_X86_PERF_EVENT_H */
>
> How the heck is one supposed to figure out which endif is belonging to
> what? Random new lines are not helping for that.
Will add comments to indicate which ifdef the endif belongs
>
> Aside of that the double ifdef is horrible and this really is not at even
> remotely a framework. It's hardcoded crap to serve that CQM mess. Nothing
> else can ever use it. So don't pretend it to be a 'framework'.
>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index a8f4749..410642a 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -300,6 +300,12 @@ struct pmu {
>> int (*event_init) (struct perf_event *event);
>>
>> /*
>> + * Terminate the event for this PMU. Optional complement for a
>> + * successful event_init. Called before the event fields are tear down.
>> + */
>> + void (*event_terminate) (struct perf_event *event);
>
> And why does this need to be a PMU callback. It's called right before
> perf_cgroup_detach(). Why does it need extra treatment and cannot be done
> from the cgroup muck?
Will remove the terminate and do this in the cgroup api itself.
>
>> +
>> + /*
>> * Notification that the event was mapped or unmapped. Called
>> * in the context of the mapping task.
>> */
>> @@ -516,9 +522,13 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
>> * PERF_EV_CAP_SOFTWARE: Is a software event.
>> * PERF_EV_CAP_READ_ACTIVE_PKG: A CPU event (or cgroup event) that can be read
>> * from any CPU in the package where it is active.
>> + * PERF_EV_CAP_CGROUP_NO_RECURSION: A cgroup event that handles its own
>> + * cgroup scoping. It does not need to be enabled for all of its descendants
>> + * cgroups.
>> */
>> #define PERF_EV_CAP_SOFTWARE BIT(0)
>> #define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1)
>> +#define PERF_EV_CAP_CGROUP_NO_RECURSION BIT(2)
>>
>> #define SWEVENT_HLIST_BITS 8
>> #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
>> @@ -823,6 +833,8 @@ struct perf_cgroup_info {
>> };
>>
>> struct perf_cgroup {
>> + /* Architecture specific information. */
>
> That's a really useful comment.
>
>> + void *arch_info;
>> struct cgroup_subsys_state css;
>> struct perf_cgroup_info __percpu *info;
>> };
>> @@ -844,6 +856,7 @@ struct perf_cgroup {
>>
>> #ifdef CONFIG_PERF_EVENTS
>>
>> +extern int is_cgroup_event(struct perf_event *event);
>> extern void *perf_aux_output_begin(struct perf_output_handle *handle,
>> struct perf_event *event);
>> extern void perf_aux_output_end(struct perf_output_handle *handle,
>> @@ -1387,4 +1400,23 @@ ssize_t perf_event_sysfs_show(struct device *dev, struct device_attribute *attr,
>> #define perf_event_exit_cpu NULL
>> #endif
>>
>> +/*
>> + * Hooks for architecture specific extensions for perf_cgroup.
>
> No. That's not architecture specific. That's CQM specific hackery.
>
>> + */
>> +#ifndef perf_cgroup_arch_css_alloc
>> +#define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
>> +#endif
>
> I really hate this define style. That can be solved nicely with weak
> functions which avoid all this define and ifdeffery mess.
>
>> +#ifndef perf_cgroup_arch_css_free
>> +#define perf_cgroup_arch_css_free(css) do { } while (0)
>> +#endif
>> +
>> +#ifndef perf_cgroup_arch_attach
>> +#define perf_cgroup_arch_attach(tskset) do { } while (0)
>> +#endif
>> +
>> +#ifndef perf_cgroup_arch_can_attach
>> +#define perf_cgroup_arch_can_attach(tskset) 0
>
> This one is exceptionally stupid. Here is the use case:
Will fix and use the weak functions.
>
>> +static int perf_cgroup_can_attach(struct cgroup_taskset *tset)
>> +{
>> + return perf_cgroup_arch_can_attach(tset);
>> }
>>
>> +
>> struct cgroup_subsys perf_event_cgrp_subsys = {
>> .css_alloc = perf_cgroup_css_alloc,
>> .css_free = perf_cgroup_css_free,
>> + .can_attach = perf_cgroup_can_attach,
>> .attach = perf_cgroup_attach,
>> };
>
> So you need a extra function for calling a stub macro if it just can
> be done by assigning the real function (if required) to the callback
> pointer and otherwise leave it NULL.
>
> But all of this is moot because this 'arch framework' is just crap because
> it's in reality a CQM extension of the core code, which is not going to
> happen.
>
>> +#endif
>> +
>> #endif /* _LINUX_PERF_EVENT_H */
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index ab15509..229f611 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -590,6 +590,9 @@ static inline u64 perf_event_clock(struct perf_event *event)
>> if (!cpuctx->cgrp)
>> return false;
>>
>> + if (event->event_caps & PERF_EV_CAP_CGROUP_NO_RECURSION)
>> + return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
>> +
>
> Comments explaining what this does are overrated, right?
Will fix. Add a comment.
perf sched_in calls the cgroup ancestors during a cgroup event sched in. But the
cqm events are package wide having mapped to RMIDs. Since only RMID can be
written to the PQR_ASSOC we need to handle this seperately. The
EV_CAP_NO_RECURSION event_caps indicates to not call all the ancestor groups.
>
>> /*
>> * Cgroup scoping is recursive. An event enabled for a cgroup is
>> * also enabled for all its descendant cgroups. If @cpuctx's
>> @@ -606,7 +609,7 @@ static inline void perf_detach_cgroup(struct perf_event *event)
>> event->cgrp = NULL;
>> }
>>
>> -static inline int is_cgroup_event(struct perf_event *event)
>> +int is_cgroup_event(struct perf_event *event)
>
> So this is made global because there is no actual user outside of the core.
Will fix
>
>> {
>> return event->cgrp != NULL;
>> }
>> @@ -4019,6 +4022,9 @@ static void _free_event(struct perf_event *event)
>> mutex_unlock(&event->mmap_mutex);
>> }
>>
>> + if (event->pmu->event_terminate)
>> + event->pmu->event_terminate(event);
>> +
>> if (is_cgroup_event(event))
>> perf_detach_cgroup(event);
>>
>> @@ -9246,6 +9252,8 @@ static void account_event(struct perf_event *event)
>> exclusive_event_destroy(event);
>>
>> err_pmu:
>> + if (event->pmu->event_terminate)
>> + event->pmu->event_terminate(event);
>> if (event->destroy)
>> event->destroy(event);
>> module_put(pmu->module);
>> @@ -10748,6 +10756,7 @@ static int __init perf_event_sysfs_init(void)
>> perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>> {
>> struct perf_cgroup *jc;
>> + int ret;
>>
>> jc = kzalloc(sizeof(*jc), GFP_KERNEL);
>> if (!jc)
>> @@ -10759,6 +10768,12 @@ static int __init perf_event_sysfs_init(void)
>> return ERR_PTR(-ENOMEM);
>> }
>>
>> + jc->arch_info = NULL;
>
> Never trust kzalloc to zero out a data structure correctly!
>
>> +
>> + ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
>> + if (ret)
>> + return ERR_PTR(ret);
>
> And then leak the allocated memory in case of error.
>
> Another wonderful piece of trainwreck engineering.
>
> As I said above: Split this into bits and pieces and provide a proper
> justification for each of the items you add to the core: terminate,
> PERF_EV_CAP_CGROUP_NO_RECURSION.
>
> Then sit down and come up with a solution which allows to make use of the
> cgroup core extensions for more than a single instance of a particular
> piece of x86 perf hardware.
Will fix. Can split the patches (can remove the terminate..) and then add
generic extensions which can be used.
cqm would need a way to maintain cgroup specific information as shown in the
struct below because of the way h/w RMIDs are designed:
-the RMIDs are per package meaning the count(for mbm) and occupancy (for cqm)
cannot be updated per cpu on when a task is scheduled.
-and we cannot track two events in a hierarchy at the same as we cannot write
two RMIDs.
Because of this we maintain the mfa as shown below to keep track of which
ancestor the cgroup(or the task) needs to report data to.
* struct cgrp_cqm_info - perf_event cgroup metadata for cqm
* @mon_enabled Whether monitoring is enabled
* @level Level in the cgroup tree. Root is level 0.
* @rmid The rmids of the cgroup.
* @mfa 'Monitoring for ancestor' points to the cqm_info
* of the ancestor the cgroup is monitoring for. 'Monitoring for ancestor'
* means you will use an ancestors RMID at sched_in if you are
* not monitoring yourself.
*
* Due to the hierarchical nature of cgroups, every cgroup just
* monitors for the 'nearest monitored ancestor' at all times.
* Since root cgroup is always monitored, all descendents
* at boot time monitor for root and hence all mfa points to root except
* for root->mfa which is NULL.
* 1. RMID setup: When cgroup x start monitoring:
* for each descendent y, if y's mfa->level < x->level, then
* y->mfa = x. (Where level of root node = 0...)
* 2. sched_in: During sched_in for x
* if (x->mon_enabled) choose x->rmid
* else choose x->mfa->rmid.
* 3. read: for each descendent of cgroup x
* if (x->monitored) count += rmid_read(x->rmid).
* 4. evt_destroy: for each descendent y of x, if (y->mfa == x) then
* y->mfa = x->mfa. Meaning if any descendent was monitoring for x,
* set that descendent to monitor for the cgroup which x was monitoring for.
*
* @tskmon_rlist List of tasks being monitored in the cgroup
* When a task which belongs to a cgroup x is being monitored, it always uses
* its own task->rmid even if cgroup x is monitored during sched_in.
* To account for the counts of such tasks, cgroup keeps this list
* and parses it during read.
Thanks,
Vikas
>
> And please provide changelogs which explain WHY all of this is necessary,
> not just the WHAT.
>
> Thanks,
>
> tglx
>
>
Powered by blists - more mailing lists