[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200123100318.GP2935@paulmck-ThinkPad-P72>
Date: Thu, 23 Jan 2020 02:03:18 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Masami Hiramatsu <mhiramat@...nel.org>
Cc: Steven Rostedt <rostedt@...dmis.org>,
Brendan Gregg <brendan.d.gregg@...il.com>,
Alexei Starovoitov <ast@...nel.org>,
Ingo Molnar <mingo@...nel.org>, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org,
Daniel Borkmann <daniel@...earbox.net>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
"David S . Miller" <davem@...emloft.net>, joel@...lfernandes.org,
"Naveen N . Rao" <naveen.n.rao@...ux.ibm.com>,
Anil S Keshavamurthy <anil.s.keshavamurthy@...el.com>
Subject: Re: [RFT PATCH 04/13] kprobes: Make optimizer delay to 1 second
On Thu, Jan 23, 2020 at 03:13:28PM +0900, Masami Hiramatsu wrote:
> On Wed, 22 Jan 2020 18:26:47 -0800
> "Paul E. McKenney" <paulmck@...nel.org> wrote:
>
> > > Anyway, without this update, I added a printk to count optimizer
> > > queue-length and found that all optimizer call with a single kprobe
> > > on the quenes. I think this depends on the machine, but as far as
> > > I tested on 8-threads qemu x86, shows this result...
> > >
> > > Probes: 256 kprobe_events
> > > Enable events
> > > real 0m 0.02s
> > > user 0m 0.00s
> > > sys 0m 0.02s
> > > [ 17.730548] Queue-update: 180, skipped, Total Queued: 180
> > > [ 17.739445] Queue-update: 1, go, Total Queued: 180
> > > Disable events
> [...]
> > > [ 41.135594] Queue-update: 1, go, Total Queued: 1
> > > real 0m 21.40s
> > > user 0m 0.00s
> > > sys 0m 0.04s
> >
> > So 21.4s/256 = 84 milliseconds per event disable, correct?
>
> Actually, it seems only 172 probes are on the unoptimized list, so
> the number will be a bit different.
>
> Anyway, that above elapsed time is including non-batch optimizer
> working time as below.
>
> (1) start looping on probe events
> (2) disabling-kprobe
> (2.1) wait kprobe_mutex if optimizer is running
> (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing
> list and kick optimizer with 5 jiffies delay
> (4) unlink enabled event
> (5) wait synchronize_rcu()
> (6) optimizer start optimization before finishing (5)
> (7) goto (1)
>
> I think the disabling performance issue came from (6) (and (2.1)).
> Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start
> after some loops are done. (and the optimizer detects "active"
> queuing, postpone the process)
>
> >
> > It might be worth trying synchronize_rcu_expedited() just as an experiment
> > to see if it speeds things up significantly.
>
> Would you mean replacing synchronize_rcu() in disabling loop, or
> replacing synchronize_rcu_tasks() in optimizer?
The former. (There is no synchronize_rcu_tasks_expedited().)
> I think that is not a root cause of this behavior, since if we
> make the optimizer delay to 1 sec, it seems enough for making
> it a batch operation. See below, this is the result with patched
> kernel (1 HZ delay).
>
> Probes: 256 kprobe_events
> Enable events
> real 0m 0.07s
> user 0m 0.00s
> sys 0m 0.07s
> [ 19.191181] Queue-update: 180, skipped, Total Queued: 180
> Disable events
> [ 20.214966] Queue-update: 1, go, Total Queued: 172
> [ 21.302924] Queue-update: 86, skipped, Total Queued: 86
> real 0m 2.11s
> user 0m 0.00s
> sys 0m 0.03s
> [ 22.327173] Queue-update: 87, skipped, Total Queued: 172
> [ 23.350933] Queue-update: 1, go, Total Queued: 172
> Remove events
> real 0m 2.13s
> user 0m 0.02s
> sys 0m 0.02s
>
> As you can see, the optimizer ran outside of the disabling loop.
> In that case, it is OK to synchronize RCU tasks in the optimizer
> because it just runs *once* per multiple probe events.
Even better!
> >From above result, 86 probes are disabled per 1 sec delay.
> Each probe disabling took 11-12 msec in average. So
> (HZ / 10) can also be good. (But note that the optimizer
> will retry to run each time until the disabling loop is
> finished.)
>
> BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100,
> the result will be different in the current code. So I think
> we should use HZ-based delay instead of fixed number.
Agreed, the HZ-based delay seems worth a try.
Thanx, Paul
Powered by blists - more mailing lists