linux-kernel - Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8d14b630-cbd7-7569-2566-644cd9e3a564@linux.intel.com>
Date:   Wed, 23 Aug 2017 11:54:08 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
        Valery Cherepennikov <valery.cherepennikov@...el.com>,
        Mark Rutland <mark.rutland@....com>,
        Stephane Eranian <eranian@...gle.com>,
        David Carrillo-Cisneros <davidcc@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during
 per-process profiling of STREAM benchmark on Intel Xeon Phi

On 22.08.2017 23:21, Peter Zijlstra wrote:
> On Fri, Aug 18, 2017 at 08:17:15AM +0300, Alexey Budankov wrote:
>> Hi,
> 
> Please don't post new versions in reply to old versions, that gets them
> lost in thread sorted views.

Accepted. Actually I followed these recommendations:

	https://git-scm.com/docs/git-send-email

and they say it needs to be like this:

[PATCH 0/2] Here is what I did...
  [PATCH 1/2] Clean up and tests
  [PATCH 2/2] Implementation
  [PATCH v2 0/3] Here is a reroll
    [PATCH v2 1/3] Clean up
    [PATCH v2 2/3] New tests
    [PATCH v2 3/3] Implementation

So I made v7 a descendant of v6.

Do you prefer new version be at the same level as the previous 
versions like this?

[PATCH 0/2] Here is what I did...
  [PATCH 1/2] Clean up and tests
  [PATCH 2/2] Implementation
[PATCH v2 0/3] Here is a reroll
  [PATCH v2 1/3] Clean up
  [PATCH v2 2/3] New tests
  [PATCH v2 3/3] Implementation

> 
>> This patch set v7 moves event groups into rb trees and implements 
>> skipping to the current CPU's list on hrtimer interrupt.
> 
> Does this depend on your timekeeping rework posted in that v6 thread?

This v7 includes timekeeping rework so it is complete patch set 
addressing your earlier concerns. The bunch of changes became 
smaller so we are going right way.

I tested v7 thru several nights on Xeon Phi under fuzzer like this:
for ((i=0;i<1000;i=i+1)) do ./perf_fuzzer; done
and there were no crashes or hangs in perf code the machine is still alive.

> If so, I would have expected to see that as part of these patches, if
> not, I'm confused, because part of the problem was that we currently
> need to update times for events we don't want to schedule etc..

Yes, you are right. We need to update times for that events and 
we still do, but on-demand - read() syscall. Doing so we may skip 
slow iterating of the whole bunch of events and get performance boost.

We may skip updating the times every timer interrupt and do it only
on read() call and on thread context switch out when 
update_event_times() is actually called.

> 
>> Events allocated for the same CPU are still kept in a linked list
>> of the event directly attached to the tree because it is unclear 
>> how to implement fast iteration thru events allocated for 
>> the same CPU when they are all attached to a tree employing 
>> additional 64bit index as a secondary treee key.
> 
> Finding the CPU subtree and rb_next() wasn't good?

I implemented the approach you had suggested (as I understood it),
tested it and got results that are drastically different from what 
I am getting for the tree of lists. Specifically I did:

1. keeping all groups in the same single tree by employing a 64-bit index
   additionally to CPU key;

2. implementing special _less() function and rotation by re-inserting
   group with incremented index;

3. replacing API with a callback in the signature by a macro
   perf_event_groups_for_each();

Employing all that shrunk the total patch size, however I am still 
struggling with the correctness issues.

Now I figured that not all indexed events are always located under 
the root with the same cpu, and it depends on the order of insertion
e.g. with insertion order 01,02,03,14,15,16 we get this:

     02
    /  \
   01  14
      /  \
     03  15
           \
           16

and it is unclear how to iterate cpu==0 part of tree in this case.

Iterating cpu specific subtree like this:

#define for_each_group_event(event, group, cpu, pmu, field)	 \
	for (event = rb_entry_safe(group_first(group, cpu, pmu), \
				   typeof(*event), field);	 \
	     event && event->cpu == cpu && event->pmu == pmu;	 \
	     event = rb_entry_safe(rb_next(&event->field),	 \
				   typeof(*event), field))

misses event==03 for the case above and I guess this is where I loose 
samples in my testing.

> 
> 
>