linux-kernel - Re: ftrace performance (sched events): cyclictest shows 25% more latency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180706172428.30aeef91@gandalf.local.home>
Date:   Fri, 6 Jul 2018 17:24:28 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Claudio <claudio.fontana@...wa.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org
Subject: Re: ftrace performance (sched events): cyclictest shows 25% more
 latency

On Fri, 6 Jul 2018 08:22:01 +0200
Claudio <claudio.fontana@...wa.com> wrote:

> Hello all,
> 
> I have been experimenting with the idea of leaving ftrace enabled, with sched events,
> on production systems.
> 
> The main concern that I am having at the moment is about the impact on the system.
> Enabling the sched events that I currently need for the tracing application
> seems to slow down context-switches considerably, and make the system less responsive.
> 
> I have tested with cyclictest on the mainline kernel, and noticed an increase of min, avg latencies of around 25%.
> 
> Is this expected?

No it is not. And this is weird. I'm seeing it too.

> 
> Some initial investigation into ftrace seems to point at the reservation and commit of the events into the ring buffer
> as the highest sources of overhead, while event parameters copying, including COMM, does not seem to have any noticeable effect
> relative to those costs.

Did you use perf to see where the issues were? If so what perf command
did you use?

When running just:
  perf record ./cyclictest --smp -p95 -m -s -N -l 1000000 -q

I get this:

   6.08%  cyclictest  [kernel.vmlinux]    [k] syscall_return_via_sysret
   4.03%  cyclictest  cyclictest          [.] timerthread
   3.97%  cyclictest  [kernel.vmlinux]    [k] __schedule
   3.81%  cyclictest  [kernel.vmlinux]    [k] native_write_msr
   3.13%  cyclictest  [kernel.vmlinux]    [k] trace_save_cmdline
   2.71%  cyclictest  [kernel.vmlinux]    [k] do_nanosleep
   2.48%  cyclictest  [kernel.vmlinux]    [k] do_syscall_64
   2.11%  cyclictest  [kernel.vmlinux]    [k] pick_next_task_fair
   2.03%  cyclictest  [kernel.vmlinux]    [k] native_sched_clock
   1.99%  cyclictest  [kernel.vmlinux]    [k] cpuacct_charge
   1.93%  cyclictest  [vdso]              [.] __vdso_clock_gettime
   1.86%  cyclictest  [kernel.vmlinux]    [k] hrtimer_nanosleep
   1.84%  cyclictest  [kernel.vmlinux]    [k] cpupri_set
   1.66%  cyclictest  [kernel.vmlinux]    [k] update_curr_rt
   1.65%  cyclictest  [kernel.vmlinux]    [k] __perf_event_task_sched_out
   1.59%  cyclictest  [kernel.vmlinux]    [k] dequeue_rt_stack
   1.55%  cyclictest  [kernel.vmlinux]    [k] __rb_reserve_next

Which shows there's a lot being done before we even get to the ring
buffer reserve.

One thing that I notice too is that we have trace_save_cmdline that
also gets activated when enabling trace events. Perhaps I should try to
optimize that more.

> 
> I have been running 20 times the following test, and thrown away the first results:
> 
> $ sudo ./cyclictest --smp -p95 -m -s -N -l 100000 -q
> 
> $ uname -a
> Linux claudio-HP-ProBook-470-G5 4.18.0-rc3+ #3 SMP Tue Jul 3 15:50:30 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> For brevity, this is a comparison of one test's results. All other test results show the same ~25% increase.
> 
> On the left side, the run without ftrace sched events, on the right side with ftrace sched events enabled.
> 
> CPU    Count      Min       Act        Avg        Max        Count      Min-ftrace Act-ftrace Avg-ftrace Max-ftrace
> 0     100000      2339       2936       2841     139478     100000       2900       3182       3566      93056
> 1      66742      2365       3386       2874      93639      66750       2959       3786       3646     154074
> 2      50080      2376       3058       2910     196221      50097       2997       4209       3655      18707
> 3      40076      2394       3461       2931      17914      40091       3006       4417       3750      17159
> 4      33404      2371       3612       2834      15336      33419       2997       3836       3594      23172
> 5      28635      2387       3313       2885      25863      28649       2995       3795       3647       9956
> 6      25058      2384       3428       2968      12162      25071       3051       4366       3719      18151
> 7      22275      2381       2859       2982      10706      22287       3046       5078       3825      10781
> 
> I would be thankful for any advice or comments on this,
> especially with the goal in mind to lower as much as possible the runtime impact on the system.

But still, we use all sorts of trace events when dealing with real
time, and trace events hardly shows anything in the PREEMPT_RT kernel.
Where cyclictest shows 20 microseconds, and sched events hardly touches
that. But I see this weird side effect on a distro kernel too. Maybe
there's something weird going on when we have CONFIG_VOLUNTARY_PREEMPT
set :-/

I'll investigate further.

Thanks for the report.

-- Steve