linux-kernel - Re: [PATCH v3 0/2] kstats: kernel metric collector

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMOZA0Lzf2r7rFvgBEWpf-B=wXvyED2CxfzuO7qUA_qVsNtL7g@mail.gmail.com>
Date:   Wed, 26 Feb 2020 09:26:22 -0800
From:   Luigi Rizzo <lrizzo@...gle.com>
To:     Toke Høiland-Jørgensen <toke@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Greg KH <gregkh@...uxfoundation.org>,
        naveen.n.rao@...ux.ibm.com, ardb@...nel.org,
        Luigi Rizzo <rizzo@....unipi.it>,
        Paolo Abeni <pabeni@...hat.com>, giuseppe.lettieri@...pi.it,
        Jesper Dangaard Brouer <hawk@...nel.org>, mingo@...hat.com,
        acme@...nel.org, Steven Rostedt <rostedt@...dmis.org>,
        peterz@...radead.org
Subject: Re: [PATCH v3 0/2] kstats: kernel metric collector

[this reply also addresses comments from Alexei and Peter]

On Wed, Feb 26, 2020 at 7:00 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Luigi Rizzo <lrizzo@...gle.com> writes:
>
> > This patchset introduces a small library to collect per-cpu samples and
> > accumulate distributions to be exported through debugfs.
> >
> > This v3 series addresses some initial comments (mostly style fixes in the
> > code) and revises commit logs.
>
> Could you please add a proper changelog spanning all versions of the
> patch as you iterate?

Will do (v2->v3 was just a removal of stray Change-Id from the log messages)

> As for the idea itself; picking up this argument you made on v1:
>
> > The tracepoint/kprobe/kretprobe solution is much more expensive --
> > from my measurements, the hooks that invoke the various handlers take
> > ~250ns with hot cache, 1500+ns with cold cache, and tracing an empty
> > function this way reports 90ns with hot cache, 500ns with cold cache.
>
> I think it would be good if you could include an equivalent BPF-based
> implementation of your instrumentation example so people can (a) see the
> difference for themselves and get a better idea of how the approaches
> differ in a concrete case and (b) quantify the difference in performance
> between the two implementations.

At the moment, a bpf version is probably beyond my skills and goals,
but I hope the
following comments can clarify the difference in approach/performance:

-  my primary goal, implemented in patch 1/2, is to have this code embedded in
  the kernel, _always available_ , even to users without the skills to
hack up the
  necessary bpf code, or load a bpf program (which may not be allowed in
  certain environments), and eventually replace and possibly improve custom
  variants of metric collections which we already have (or wished to,
but don't have
  because there wasn't a convenient library to use for them).

- I agree that this code can be recompiled in bpf (using a
  BPF_MAP_TYPE_PERCPU_ARRAY for storage, and kstats_record() and
  ks_show_entry() should be easy to convert).

- the runtime cost and complexity of hooking bpf code is still a bit
unclear to me.
  kretprobe or tracepoints are expensive, I suppose that some lean hook
  replace register_kretprobe() may exist and the difference from inline
  annotations would be marginal (we'd still need to put in the hooks around
  the code we want to time, though, so it wouldn't be a pure bpf solution).
  Any pointers to this are welcome; Alexei mentioned fentry/fexit and
  bpf trampolines, but I haven't found an example that lets me do something
  equivalent to kretprobe (take a timestamp before and one after a function
  without explicit instrumentation)

- I still see some huge differences in usability, and this is in my opinion
  one very big difference between the two approaches. The systems where data
  collection may be of interest are not necessarily accessible to developers
  with the skills to write custom bpf code, or load bpf modules
(security policies
  may prevent that). One thing is to tell a sysadmin to run
  "echo trace foo > /sys/kernel/debug/kstats/_config"
  or "watch grep CPUS /sys/kernel/debug/kstats/bar",
  another one is to tell them to load a bpf program (or write their own one).

thanks for the feedback
luigi