linux-kernel - Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b886d12-e71a-c7cf-4410-d0247f73b4a3@linux.intel.com>
Date:   Wed, 14 Jun 2017 14:27:10 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     David Carrillo-Cisneros <davidcc@...gle.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
        Valery Cherepennikov <valery.cherepennikov@...el.com>,
        Stephane Eranian <eranian@...gle.com>,
        Mark Rutland <mark.rutland@....com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Arun Kalyanasundaram <arunkaly@...gle.com>
Subject: Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process,
 profiling of STREAM benchmark on Intel Xeon Phi

On 01.06.2017 0:33, David Carrillo-Cisneros wrote:
> On Sat, May 27, 2017 at 4:19 AM, Alexey Budankov
> <alexey.budankov@...ux.intel.com> wrote:
>> Motivation:
>>
>> The issue manifests like 4x slowdown when profiling single thread STREAM
>> benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
>> Perf profiling is done in per-process mode and involves about 30 core
>> events. In case the benchmark is OpenMP based and runs under profiling in
>> 272 threads the overhead becomes even more dramatic: 512.144s against
>> 36.192s (with this patch).
> 
> How long does it take without any perf monitoring? Could you provide
> more details about the benchmark? how many CPUs are being monitored?

STREAM runs about 10 seconds without Perf monitoring on Intel KNL in 272 
threads (272 CPUs are monitored). More on the benchmark is here: 
https://www.cs.virginia.edu/stream/

> 
> SNIP
>> different from the one executing the handler. Additionally for every
>> filtered out group group_sched_out() updates tstamps values to the current
>> interrupt time. This updating work is now done only once by
>> update_context_time() called by ctx_sched_out() before cpu groups
>> iteration.
> 
> I don't see this. e.g.:

It is done here:

@@ -1383,6 +1384,9 @@ static void update_context_time(struct 
perf_event_context *ctx)

      ctx->time += now - ctx->timestamp;
      ctx->timestamp = now;
+
+    ctx->tstamp_data.running += ctx->time - ctx->tstamp_data.stopped;
+    ctx->tstamp_data.stopped = ctx->time;
  }

Unscheduled events tstamps are updated all at once in 
update_context_time() called from ctx_sched_out().

> in your patch task_ctx_sched_out calls ctx_sched_out with mux == 0,
> that path does the exact same thing as before your patch. >
> I understand why you want to move the event's times to a different
> structure and keep a pointer in the event, but I don't see that you
> are avoiding the update of the times of unscheduled events.
> 
>>
>>   static u64 perf_event_time(struct perf_event *event)
>> @@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event
>> *event)
>>          else if (ctx->is_active)
>>                  run_end = ctx->time;
>>          else
>> -               run_end = event->tstamp_stopped;
>> +               run_end = event->tstamp->stopped;
>>
>> -       event->total_time_enabled = run_end - event->tstamp_enabled;
>> +       event->total_time_enabled = run_end - event->tstamp->enabled;
>>
>>          if (event->state == PERF_EVENT_STATE_INACTIVE)
>> -               run_end = event->tstamp_stopped;
>> +               run_end = event->tstamp->stopped;
>>          else
>>                  run_end = perf_event_time(event);
>>
>> -       event->total_time_running = run_end - event->tstamp_running;
>> +       event->total_time_running = run_end - event->tstamp->running;
> 
> FWICT, this is run for ALL events in context with matching CPU.
> 
> 
> SNIP
>>   }
>> @@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct task_struct
>> *task,
>>    * Called with IRQs disabled
>>    */
>>   static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
>> -                             enum event_type_t event_type)
>> +                             enum event_type_t event_type, int mux)
>>   {
>> -       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
>> +       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>>   }
>>
>>   static void
>> @@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>>                      struct perf_cpu_context *cpuctx)
>>   {
>>          struct perf_event *event;
>> -
>> -       list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
>> -               if (event->state <= PERF_EVENT_STATE_OFF)
>> -                       continue;
>> -               if (!event_filter_match(event))
>> -                       continue;
> 
> Could we remove or simplify the tests in event_filter_match since the
> rb-tree filters by cpu?

Why do you want to remove or simplify event_filter_match()?

> 
>> -
>> -               /* may need to reset tstamp_enabled */
>> -               if (is_cgroup_event(event))
>> -                       perf_cgroup_mark_enabled(event, ctx);
>> -
>> -               if (group_can_go_on(event, cpuctx, 1))
>> -                       group_sched_in(event, cpuctx, ctx);
>> -
>> -               /*
>> -                * If this pinned group hasn't been scheduled,
>> -                * put it in error state.
>> -                */
>> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -                       update_group_times(event);
>> -                       event->state = PERF_EVENT_STATE_ERROR;
>> -               }
>> -       }
>> +       list_for_each_entry(event, &ctx->pinned_groups, group_entry)
>> +               ctx_sched_in_pinned_group(ctx, cpuctx, event);
>>   }
>>
>>   static void
>> @@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context
>> *ctx,
>>   {
>>          struct perf_event *event;
>>          int can_add_hw = 1;
>> -
>> -       list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>> -               /* Ignore events in OFF or ERROR state */
>> -               if (event->state <= PERF_EVENT_STATE_OFF)
>> -                       continue;
>> -               /*
>> -                * Listen to the 'cpu' scheduling filter constraint
>> -                * of events:
>> -                */
>> -               if (!event_filter_match(event))
>> -                       continue;
> same as before.
>