[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANpmjNNwOOYxOXLixrUD25YoszYcy7SRwXMMfrj5zZvrETkp0g@mail.gmail.com>
Date: Fri, 10 Jun 2022 10:25:00 +0200
From: Marco Elver <elver@...gle.com>
To: Dmitry Vyukov <dvyukov@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Frederic Weisbecker <frederic@...nel.org>,
Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...hat.com>,
Namhyung Kim <namhyung@...nel.org>,
linux-perf-users@...r.kernel.org, x86@...nel.org,
linux-sh@...r.kernel.org, kasan-dev@...glegroups.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent
On Thu, 9 Jun 2022 at 17:00, 'Dmitry Vyukov' via kasan-dev
<kasan-dev@...glegroups.com> wrote:
>
> On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@...gle.com> wrote:
> >
> > Running the perf benchmark with (note: more aggressive parameters vs.
> > preceding changes, but same host with 256 CPUs):
> >
> > | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> > | # Running 'breakpoint/thread' benchmark:
> > | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> > | Total time: 1.953 [sec]
> > |
> > | 38.146289 usecs/op
> > | 4882.725000 usecs/op/cpu
> >
> > 16.29% [kernel] [k] rhashtable_jhash2
> > 16.19% [kernel] [k] osq_lock
> > 14.22% [kernel] [k] queued_spin_lock_slowpath
> > 8.58% [kernel] [k] task_bp_pinned
> > 8.30% [kernel] [k] mutex_spin_on_owner
> > 4.03% [kernel] [k] smp_cfm_core_cond
> > 2.97% [kernel] [k] toggle_bp_slot
> > 2.94% [kernel] [k] bcmp
> >
> > We can see that a majority of the time is now spent hashing task
> > pointers to index into task_bps_ht in task_bp_pinned().
> >
> > However, if task_bp_pinned()'s computation is independent of any CPU,
> > i.e. always `iter->cpu < 0`, the result for each invocation will be
> > identical. With increasing CPU-count, this problem worsens.
> >
> > Instead, identify if every call to task_bp_pinned() is CPU-independent,
> > and cache the result. Use the cached result instead of a call to
> > task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> > if the cached result can be used.
> >
> > After this optimization:
> >
> > 21.96% [kernel] [k] queued_spin_lock_slowpath
> > 16.39% [kernel] [k] osq_lock
> > 9.82% [kernel] [k] toggle_bp_slot
> > 9.81% [kernel] [k] find_next_bit
> > 4.93% [kernel] [k] mutex_spin_on_owner
> > 4.71% [kernel] [k] smp_cfm_core_cond
> > 4.30% [kernel] [k] __reserve_bp_slot
> > 2.65% [kernel] [k] cpumask_next
> >
> > Showing that the time spent hashing keys has become insignificant.
> >
> > With the given benchmark parameters, however, we see no statistically
> > significant improvement in performance on the test system with 256 CPUs.
> > This is very likely due to the benchmark parameters being too aggressive
> > and contention elsewhere becoming dominant.
> >
> > Indeed, when using the less aggressive parameters from the preceding
> > changes, we now observe:
> >
> > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > | # Running 'breakpoint/thread' benchmark:
> > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > | Total time: 0.071 [sec]
> > |
> > | 37.134896 usecs/op
> > | 2376.633333 usecs/op/cpu
> >
> > Which is an improvement of 12% compared to without this optimization
> > (baseline is 42 usecs/op). This is now only 5% slower than the
> > theoretical ideal (constraints disabled), and 18% slower than no
> > breakpoints at all.
> >
> > [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
> > more consistent with other functions (which have bp first, before the
> > cpu argument). ]
>
> There are 3 main cases:
> 1. Per-cpu bp.
Yes, CPU-target breakpoint on just 1 CPU.
> 2. Per-task and per-cpu bp.
Task-target breakpoint but pinned to 1 CPU.
> 3. Per-task bp (on all cpus)
Task-target breakpoint that can run on all CPUs.
> right?
>
> For case 1 we still seem to do lots of unnecessary work in
> fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
> only the CPU's cpu_pinned, so that's the only CPU we need to
> fetch/check.
It'll just do 1 iteration, because cpumask_of_bp() will return a mask
with just the event's target CPU in it.
> For case 2 we also do lots of unnecessary work, again we also need to
> check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
> atomic_dec/inc on all other CPUs (they dec/inc the same variable).
Same as above, just 1 iteration because cpumask_of_bp() does the right
thing. cached_tbp_pinned may still be used if all existing task
breakpoints are CPU-independent (i.e. cpu==-1; granted, doing
task_bp_pinned() once or twice probably is irrelevant in this case).
> Case 3 is the only one when we need to check all CPUs and
> cached_tbp_pinned may be useful.
> But I wonder if we could instead add a per-task
> has_per_cpu_breakpoints flag. Then if the flag is set, we check all
> CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
> then we could optimize the code even more by making it O(1) instead of
> O(N).
> Namely, we add global tsk_pinned for tasks that don't have
> per-cpu breakpoints, and we update only that tsk_pinned instead of
> iterating over all CPUs.
That seems reasonable.
> I think this will require adding cpu_pinned as well (similar to
> tsk_pinned but aggregated over all CPUs).
> Then the fast path capacity check can become just:
>
> if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
> if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
> hw_breakpoint_weight(bp) > nr_slots[type])
> return -ENOSPC;
> }
>
> Does it make any sense?
Yes, I think this might work. I'll see if I can make it work.
Powered by blists - more mailing lists