linux-kernel - RE: [PATCH 1/1] perf/core: find auxiliary events in running pmus list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <37D7C6CF3E00A74B8858931C1DB2F0770589D8AB@SHSMSX103.ccr.corp.intel.com>
Date:	Thu, 25 Feb 2016 15:38:00 +0000
From:	"Liang, Kan" <kan.liang@...el.com>
To:	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:	"ak@...ux.intel.com" <ak@...ux.intel.com>,
	"eranian@...gle.com" <eranian@...gle.com>,
	"vincent.weaver@...ne.edu" <vincent.weaver@...ne.edu>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...nel.org" <mingo@...nel.org>,
	"acme@...hat.com" <acme@...hat.com>,
	"jolsa@...hat.com" <jolsa@...hat.com>,
	"ying.huang@...ux.intel.com" <ying.huang@...ux.intel.com>
Subject: RE: [PATCH 1/1] perf/core: find auxiliary events in running pmus
 list



> kan.liang@...el.com writes:
> 
> > From: Kan Liang <kan.liang@...el.com>
> >
> > perf_event_aux funciton goes through pmus list to find proper
> > auxiliary events to output. The pmus list consists of all possible
> > pmus in the system, that may or may not be running at the moment,
> > while the auxiliary events must be from the running pmus. Therefore
> > searching non-running pmus is unnecessary and expensive especially
> > when there are many non-running pmus on the list.
> >
> > For example, the brk test case in lkp triggers many mmap operations,
> > at the time, perf with cycles:pp is also running on the system. As a
> > result, many perf_event_aux are invoked, and each would search the
> > whole pmus list. If we enable the uncore support (even when uncore
> > event are not really used), dozens of uncore pmus will be added into
> > pmus list, which can significantly decrease brk_test's ops_per_sec.
> > Based on our test, the ops_per_sec without uncore patch is 2647573,
> > while the ops_per_sec with uncore patch is only 1768444, which is a
> > 33.2% reduction.
> 
> What does this ops_per_sec measure, exactly? Just curious.

The brk_test just do brk system call blindly. So I think basically it measures
the whole brk system call.

> You'll probably also observe the same effect if you simply create a bunch of
> disabled events before you measure the time that it takes
> perf_event_aux() to notify everybody. Even worse, because you can have
> way more events than pmus. Question is, is this really a problem.
>

Yes, I can observe the same effect if I create a bunch of disabled events.
But the problem is that the user did nothing but upgrade the system. They 
will suffer the performance downgrade.

There is another performance data about this issue.
On the kernel without uncore enabled, perf_event_aux uses 16.84% CPU
cycles, if we run brk_test and perf with cycles:pp.
While on the kernel with uncore enabled, perf_event_aux uses 46.37% CPU
cycles, when we run same brk_test and measure same cycles:pp event.

As we can see, the user's behavior doesn't change. The only difference is that
many uncore PMUs are registered in the system. Even no one uses uncore
PMUs at the moment, they still impact the performance. I think that is really
a problem, especially considering that more and more PMUs will be added
in future.

> > This patch introduces a running_pmus list which only tracks the
> > running pmus in the system. The perf_event_aux uses running_pmus list
> > instead of pmus list to find auxiliary events.
> 
> This patch also adds a global mutex that serializes *all* event
> creation/freeing. Including the fork and exit paths.

I think we have to maintain a global mutex for running_pmus.
There will be some impacts, but it should be limited.
It only impacts in event creation/freeing.
The running_pmus list are usually small, and most events should from same
PMU. So the time spend in critical sections should not big.


Thanks,
Kan
> 
> I mean:
> 
> > @@ -7740,6 +7770,29 @@ static void account_event_cpu(struct
> perf_event *event, int cpu)
> >  		atomic_inc(&per_cpu(perf_cgroup_events, cpu));  }
> >
> > +static void account_running_pmu(struct perf_event *event) {
> > +	struct running_pmu *pmu;
> > +
> > +	mutex_lock(&running_pmus_lock);
> > +
> > +	list_for_each_entry(pmu, &running_pmus, entry) {
> > +		if (pmu->pmu == event->pmu) {
> > +			pmu->nr_event++;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	pmu = kzalloc(sizeof(struct running_pmu), GFP_KERNEL);
> > +	if (pmu != NULL) {
> > +		pmu->nr_event++;
> > +		pmu->pmu = event->pmu;
> > +		list_add_rcu(&pmu->entry, &running_pmus);
> > +	}
> > +out:
> > +	mutex_unlock(&running_pmus_lock);
> > +}
> > +
> >  static void account_event(struct perf_event *event)  {
> >  	bool inc = false;
> > @@ -7772,6 +7825,8 @@ static void account_event(struct perf_event
> *event)
> >  		static_key_slow_inc(&perf_sched_events.key);
> >
> >  	account_event_cpu(event, event->cpu);
> > +
> > +	account_running_pmu(event);
> 
> doesn't look justified.
> 
> Regards,
> --
> Alex