linux-kernel - Re: [PATCHSET 0/6] perf lock: Add contention subcommand (v1)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YuuFWr2435kd0CYl@gmail.com>
Date:   Thu, 4 Aug 2022 10:37:46 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Namhyung Kim <namhyung@...nel.org>
Cc:     Arnaldo Carvalho de Melo <acme@...nel.org>,
        Jiri Olsa <jolsa@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Ian Rogers <irogers@...gle.com>,
        linux-perf-users <linux-perf-users@...r.kernel.org>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Davidlohr Bueso <dave@...olabs.net>
Subject: Re: [PATCHSET 0/6] perf lock: Add contention subcommand (v1)


* Namhyung Kim <namhyung@...nel.org> wrote:

> Hi Ingo,
> 
> On Wed, Aug 3, 2022 at 2:51 AM Ingo Molnar <mingo@...nel.org> wrote:
> >
> >
> > * Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > > Hello,
> > >
> > > It's to add a new subcommand 'contention' (shortly 'con') to perf lock.
> > >
> > > The new subcommand is to handle the new lock:contention_{begin,end}
> > > tracepoints and shows lock type and caller address like below:
> > >
> > >   $ perf lock contention
> > >    contended   total wait     max wait     avg wait         type   caller
> > >
> > >          238      1.41 ms     29.20 us      5.94 us     spinlock   update_blocked_averages+0x4c
> > >            1    902.08 us    902.08 us    902.08 us      rwsem:R   do_user_addr_fault+0x1dd
> > >           81    330.30 us     17.24 us      4.08 us     spinlock   _nohz_idle_balance+0x172
> > >            2     89.54 us     61.26 us     44.77 us     spinlock   do_anonymous_page+0x16d
> > >           24     78.36 us     12.27 us      3.27 us        mutex   pipe_read+0x56
> > >            2     71.58 us     59.56 us     35.79 us     spinlock   __handle_mm_fault+0x6aa
> > >            6     25.68 us      6.89 us      4.28 us     spinlock   do_idle+0x28d
> > >            1     18.46 us     18.46 us     18.46 us      rtmutex   exec_fw_cmd+0x21b
> > >            3     15.25 us      6.26 us      5.08 us     spinlock   tick_do_update_jiffies64+0x2c
> > >    ...
> >
> > Wouldn't it also be useful to display a lock contention percentage value,
> > the ratio of fastpath vs. contended/wait events?
> >
> > That's usually the first-approximation metric to see how contended
> > different locks are, and the average wait time quantifies it.
> 
> Yeah, that'd be nice to have.  But it requires some action in the fast 
> path which I don't want because I'd like to use this in production.  So 
> these new tracepoints were added only in the slow path.

Yeah. Might make sense to re-measure the impact of possibly doing that 
though: most of the locking fast-patch is out of line already and could be 
instrumented, with only a handful of inlined primitives - 
CONFIG_UNINLINE_SPIN_UNLOCK in particular.

How many additional inlined NOP sequences does this add in a defconfig 
kernel? How much is the bloat, and would it be acceptable for production 
kernels?

The other question is to keep tracing overhead low in production systems.

For that we'd have to implement some concept of 'sampling tracepoints', 
which generate only one event for every 128 fast path invocations or so, 
but stay out of the way & don't slow down the system otherwise.

OTOH frequently used locking fastpaths are measured via regular PMU 
sampling based profiling already.

> Instead, I think we can display the ratio of (total) contended time vs. 
> wall clock time.  What do you think?

That looks useful too - but also the time spent waiting/spinning in a 
thread vs. the time spent actually running and executing real stuff.

That ratio could easily get over 100%, for wait-dominated workloads - so 
ordering by that ratio would highlight the tasks that make the least amount 
of real progress. Measuring the ratio based only on wall clock time would 
hide this aspect.

Thanks,

	Ingo