linux-kernel - RE: [PATCH 1/3] perf/x86: Add perf text poke event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <363DA0ED52042842948283D2FC38E4649C66B7C0@IRSMSX106.ger.corp.intel.com>
Date:   Thu, 19 Dec 2019 16:50:37 +0000
From:   "Hunter, Adrian" <adrian.hunter@...el.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H . Peter Anvin" <hpa@...or.com>,
        "x86@...nel.org" <x86@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Mathieu Poirier <mathieu.poirier@...aro.org>,
        Leo Yan <leo.yan@...aro.org>,
        "Arnaldo Carvalho de Melo" <acme@...nel.org>,
        Jiri Olsa <jolsa@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 1/3] perf/x86: Add perf text poke event

> -----Original Message-----
> From: Peter Zijlstra <peterz@...radead.org>
> Sent: Thursday, December 19, 2019 3:09 PM
> To: Hunter, Adrian <adrian.hunter@...el.com>
> Cc: Ingo Molnar <mingo@...hat.com>; Borislav Petkov <bp@...en8.de>; H .
> Peter Anvin <hpa@...or.com>; x86@...nel.org; Mark Rutland
> <mark.rutland@....com>; Alexander Shishkin
> <alexander.shishkin@...ux.intel.com>; Mathieu Poirier
> <mathieu.poirier@...aro.org>; Leo Yan <leo.yan@...aro.org>; Arnaldo
> Carvalho de Melo <acme@...nel.org>; Jiri Olsa <jolsa@...hat.com>; linux-
> kernel@...r.kernel.org
> Subject: Re: [PATCH 1/3] perf/x86: Add perf text poke event
> 
> On Wed, Dec 18, 2019 at 04:26:16PM +0200, Adrian Hunter wrote:
> > Record changes to kernel text (i.e. self-modifying code) in order to
> > support tracers like Intel PT decoding through jump labels.
> 
> I don't get the obsession with just jump-labels, we need a solution all
> modifying code. The fentry site usage is increasing, and optprobes are also
> fairly popular with a bunch of people.

Yes we need a solution for all modifying code.  Jump labels are just a good
place to start because they are the biggest problem by far.

> 
> > A copy of the running kernel code is needed as a reference point (e.g.
> > from /proc/kcore). The text poke event records the old bytes and the
> > new bytes so that the event can be processed forwards or backwards.
> >
> > In the case of Intel PT tracing, the executable code must be walked to
> > reconstruct the control flow. For x86 a jump label text poke begins:
> >   - write INT3 byte
> >   - IPI-SYNC
> >   - write instruction tail
> > At this point the actual control flow will be through the INT3 and
> > handler and not hit the old or new instruction. Intel PT outputs
> > FUP/TIP packets for the INT3, so the flow can still be decoded.
> Subsequently:
> >   - emit RECORD_TEXT_POKE with the new instruction
> >   - IPI-SYNC
> >   - write first byte
> >   - IPI-SYNC
> > So before the text poke event timestamp, the decoder will see either
> > the old instruction flow or FUP/TIP of INT3. After the text poke event
> > timestamp, the decoder will see either the new instruction flow or
> > FUP/TIP of INT3. Thus decoders can use the timestamp as the point at
> > which to modify the executable code.
> 
> I feel a much better justification for the design can be found in the discussion
> we've had around ARM-CS.
> 
> Basically SMP instruction coherency mandates something like this, it is just a
> happy accident x86 already had all the bits in place.
> 
> How is something like:
> 
> "Record (single instruction) changes to the kernel text (i.e.
> self-modifying code) in order to support tracers like Intel PT and ARM
> CoreSight.
> 
> A copy of the running kernel code is needed as a reference point (e.g.
> from /proc/kcore). The text poke event records the old bytes and the new
> bytes so that the event can be processed forwards or backwards.
> 
> The basic problem is recording the modified instruction in an unambiguous
> manner given SMP instruction cache (in)coherence. That is, when modifying
> an instruction concurrently any solution with one or multiple timestamps is
> not sufficient:
> 
> 	CPU0				CPU1
>  0
>  1	write insn A
>  2					execute insn A
>  3	sync-I$
>  4
> 
> Due to I$, CPU1 might execute either the old or new A. No matter where we
> record tracepoints on CPU0, one simply cannot tell what CPU1 will have
> observed, except that at 0 it must be the old one and at 4 it must be the new
> one.
> 
> To solve this, take inspiration from x86 text poking, which has to solve this
> exact problem due to variable length instruction encoding and I-fetch
> windows.
> 
>  1) overwrite the instruction with a breakpoint and sync I$
> 
> This guarantees that that code flow will never hit the target instruction
> anymore, on any CPU (or rather, it will cause an exception).
> 
>  2) issue the TEXT_POKE event
> 
>  3) overwrite the breakpoint with the new instruction and sync I$
> 
> Now we know that any execution after the TEXT_POKE event will either
> observe the breakpoint (and hit the exception) or the new instruction.
> 
> So by guarding the TEXT_POKE event with an exception on either side; we
> can now tell, without doubt, which instruction another CPU will have
> observed."
> 
> ?

Ok

> 
> > Signed-off-by: Adrian Hunter <adrian.hunter@...el.com>
> > ---
> >  arch/x86/kernel/alternative.c   | 37 +++++++++++++-
> 
> I'm thinking it might make sense to do this x86 part in a separate patch, and
> just present the generic thing first:
> 
> >  include/linux/perf_event.h      |  6 +++
> >  include/uapi/linux/perf_event.h | 19 ++++++-
> >  kernel/events/core.c            | 87 ++++++++++++++++++++++++++++++++-
> >  4 files changed, 146 insertions(+), 3 deletions(-)
> >
> 
> > @@ -1006,6 +1007,22 @@ enum perf_event_type {
> >  	 */
> >  	PERF_RECORD_BPF_EVENT			= 18,
> >
> > +	/*
> > +	 * Records changes to kernel text i.e. self-modified code.
> > +	 * 'len' is the number of old bytes, which is the same as the number
> > +	 * of new bytes. 'bytes' contains the old bytes followed immediately
> > +	 * by the new bytes.
> > +	 *
> > +	 * struct {
> > +	 *	struct perf_event_header	header;
> > +	 *	u64				addr;
> > +	 *	u16				len;
> > +	 *	u8				bytes[];
> 
> Would it make sense to have something like:
> 
> 	 *	u16				old_len;
> 	 *	u16				new_len;
> 	 *	u8				old_bytes[old_len];
> 	 *	u8				new_bytes[new_len];
> 
> That would allow using this for (short) trampolines (ftrace, optprobes etc..).
> {old_len=0, new_len>0} would indicate a new trampoline, while {old_len>0,
> new_len=0} would indicate the dissapearance of a trampoline.

Yes that makes sense.

> 
> > +	 *	struct sample_id		sample_id;
> > +	 * };
> > +	 */
> > +	PERF_RECORD_TEXT_POKE			= 19,
> > +
> >  	PERF_RECORD_MAX,			/* non-ABI */
> >  };
> 
> Then the x86 patch, hooking up the event, can also cover kprobes and stuff.

Ok