linux-kernel - Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling taken at 'callq' instruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20151201072826.GB28270@gmail.com>
Date:	Tue, 1 Dec 2015 08:28:26 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	"Wangnan (F)" <wangnan0@...wei.com>, Jiri Olsa <jolsa@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	David Ahern <dsahern@...il.com>,
	Milian Wolff <milian.wolff@...b.com>,
	linux-kernel@...r.kernel.org, pi3orama <pi3orama@....com>,
	lizefan 00213767 <lizefan@...wei.com>
Subject: Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling
 taken at 'callq' instruction


* Peter Zijlstra <peterz@...radead.org> wrote:

> On Fri, Nov 27, 2015 at 09:38:11AM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@...radead.org> wrote:
> > 
> > > On Thu, Nov 19, 2015 at 11:23:00AM +0100, Ingo Molnar wrote:
> > > > PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
> > > > might not even result in any interruption of execution. The 'pt_regs' does not 
> > > > necessarily correspond to an interrupted, restartable context - we take the RIP 
> > > > from the PEBS machinery and also use LBR and disassembly to determine the previous 
> > > > instruction, before reporting it to user-space.
> > > 
> > > Note that modern PEBS hardware (hsw+) does the rollback in hardware.
> > > Prior to that we indeed to it manually using the LBR.
> > > 
> > > As to pt_regs, we construct a franken pt_regs based on the actual PEBS
> > > buffer overflow PMI and bits from the PEBS record (which also includes
> > > some register state). See
> > > arch/x86/kernel/cpu/perf_event_intel_ds.c:setup_pebs_sample_data().
> > > 
> > > We always copy the flags, ip, bp and sp from the PEBS record into the
> > > interrupt pt_regs.
> > > 
> > > And note that the PEBS record is constructed at instruction retirement,
> > > so it shows the state _after_ the instruction, with exception of the
> > > (hsw+) real_ip field.
> > > 
> > > So the unwinder will have to be taught that if the IP points at a stack
> > > altering instruction (call, push, etc.) it will have to 'undo' the
> > > effects on the actual stack (I appreciate this might be 'interesting'
> > > for things like: pop, ret, etc.).
> > 
> > So do we dump both the 'real' and the actual RIP, to not force tooling into having 
> > to decode instructions and such?
> 
> Nope, we only expose the corrected one.
> 
> > (Which is pretty hard and fragile and not always 
> > possible with instructions that destroy the original RIP, like JMP, etc.)
> 
> Not sure what you're getting at here. We don't need the uncorrected
> instruction.

Well, we need it for stack unwinding, as you point it out:

> But the problem here is that we rewind the instruction stream, but not
> the stack. And the stack unwinder is (obviously) interested in the stack
> state.

Unwinding the stack state would fix it as well - but an equivalent solution would 
be to pass along the original RIP would fix it as well: we'd have a 
self-consistent pair of RIP/RSP.

Especially since unwinding the RSP is probably hard:

> I'm not sure we want (or need) to go undo the specific instruction's
> stack effect in-kernel. If the !DWARF unwinders are similarly confused
> we might need to put it in kernel (expensive *groan*). If its only the
> DWARF muck then its something that can be done in userspace just
> fine, although we might need to copy slightly more of the stack than SP
> is pointing at, such that we can undo RET/POP etc. which would have data
> beyond the head of stack.
> 
> The easiest solution might be to figure out the biggest stack offset for
> any instruction and always capture that much over the head of stack.

so I think the problem here is that the RSP does not match up to the RIP. We can 
either pass along the original RIP+RSP, or the fixed up one - but what we do 
currently is that we pass along only half of it - which corrupts dwarf unwinding 
state that doesn't tolerate such errors.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/