linux-kernel - Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process, profiling of STREAM benchmark on Intel Xeon Phi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALcN6mgbT2j_sb4L-SXt9KNvZUk6pE9hRoHtPn_quVUEaixxXQ@mail.gmail.com>
Date:   Wed, 31 May 2017 14:33:02 -0700
From:   David Carrillo-Cisneros <davidcc@...gle.com>
To:     Alexey Budankov <alexey.budankov@...ux.intel.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
        Valery Cherepennikov <valery.cherepennikov@...el.com>,
        Stephane Eranian <eranian@...gle.com>,
        Mark Rutland <mark.rutland@....com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Arun Kalyanasundaram <arunkaly@...gle.com>
Subject: Re: [PATCH v2]: perf/core: addressing 4x slowdown during per-process,
 profiling of STREAM benchmark on Intel Xeon Phi

On Sat, May 27, 2017 at 4:19 AM, Alexey Budankov
<alexey.budankov@...ux.intel.com> wrote:
> Motivation:
>
> The issue manifests like 4x slowdown when profiling single thread STREAM
> benchmark on Intel Xeon Phi running RHEL7.2 (Intel MPSS distribution).
> Perf profiling is done in per-process mode and involves about 30 core
> events. In case the benchmark is OpenMP based and runs under profiling in
> 272 threads the overhead becomes even more dramatic: 512.144s against
> 36.192s (with this patch).

How long does it take without any perf monitoring? Could you provide
more details about the benchmark? how many CPUs are being monitored?

SNIP
> different from the one executing the handler. Additionally for every
> filtered out group group_sched_out() updates tstamps values to the current
> interrupt time. This updating work is now done only once by
> update_context_time() called by ctx_sched_out() before cpu groups
> iteration.

I don't see this. e.g.:
in your patch task_ctx_sched_out calls ctx_sched_out with mux == 0,
that path does the exact same thing as before your patch.

I understand why you want to move the event's times to a different
structure and keep a pointer in the event, but I don't see that you
are avoiding the update of the times of unscheduled events.

>
>  static u64 perf_event_time(struct perf_event *event)
> @@ -1424,16 +1428,16 @@ static void update_event_times(struct perf_event
> *event)
>         else if (ctx->is_active)
>                 run_end = ctx->time;
>         else
> -               run_end = event->tstamp_stopped;
> +               run_end = event->tstamp->stopped;
>
> -       event->total_time_enabled = run_end - event->tstamp_enabled;
> +       event->total_time_enabled = run_end - event->tstamp->enabled;
>
>         if (event->state == PERF_EVENT_STATE_INACTIVE)
> -               run_end = event->tstamp_stopped;
> +               run_end = event->tstamp->stopped;
>         else
>                 run_end = perf_event_time(event);
>
> -       event->total_time_running = run_end - event->tstamp_running;
> +       event->total_time_running = run_end - event->tstamp->running;

FWICT, this is run for ALL events in context with matching CPU.


SNIP
>  }
> @@ -3051,9 +3277,9 @@ void __perf_event_task_sched_out(struct task_struct
> *task,
>   * Called with IRQs disabled
>   */
>  static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
> -                             enum event_type_t event_type)
> +                             enum event_type_t event_type, int mux)
>  {
> -       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
> +       ctx_sched_out(&cpuctx->ctx, cpuctx, event_type, mux);
>  }
>
>  static void
> @@ -3061,29 +3287,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>                     struct perf_cpu_context *cpuctx)
>  {
>         struct perf_event *event;
> -
> -       list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
> -               if (event->state <= PERF_EVENT_STATE_OFF)
> -                       continue;
> -               if (!event_filter_match(event))
> -                       continue;

Could we remove or simplify the tests in event_filter_match since the
rb-tree filters by cpu?

> -
> -               /* may need to reset tstamp_enabled */
> -               if (is_cgroup_event(event))
> -                       perf_cgroup_mark_enabled(event, ctx);
> -
> -               if (group_can_go_on(event, cpuctx, 1))
> -                       group_sched_in(event, cpuctx, ctx);
> -
> -               /*
> -                * If this pinned group hasn't been scheduled,
> -                * put it in error state.
> -                */
> -               if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -                       update_group_times(event);
> -                       event->state = PERF_EVENT_STATE_ERROR;
> -               }
> -       }
> +       list_for_each_entry(event, &ctx->pinned_groups, group_entry)
> +               ctx_sched_in_pinned_group(ctx, cpuctx, event);
>  }
>
>  static void
> @@ -3092,37 +3297,19 @@ ctx_flexible_sched_in(struct perf_event_context
> *ctx,
>  {
>         struct perf_event *event;
>         int can_add_hw = 1;
> -
> -       list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
> -               /* Ignore events in OFF or ERROR state */
> -               if (event->state <= PERF_EVENT_STATE_OFF)
> -                       continue;
> -               /*
> -                * Listen to the 'cpu' scheduling filter constraint
> -                * of events:
> -                */
> -               if (!event_filter_match(event))
> -                       continue;
same as before.