[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrXXe3Rc5UQPxZyD==EV1vBZ+ei3qtqhkTMTyuXnF9Jazw@mail.gmail.com>
Date: Thu, 21 Apr 2016 16:27:19 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Dmitry Safonov <dsafonov@...tuozzo.com>,
Thomas Gleixner <tglx@...utronix.de>,
Shuah Khan <shuahkh@....samsung.com>,
Ingo Molnar <mingo@...hat.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Borislav Petkov <bp@...en8.de>, khorenko@...tuozzo.com,
X86 ML <x86@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>, xemul@...tuozzo.com,
linux-kselftest@...r.kernel.org,
Cyrill Gorcunov <gorcunov@...nvz.org>,
Dmitry Safonov <0x7f454c46@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to
change compatible mode
On Thu, Apr 21, 2016 at 1:12 PM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Thu, Apr 21, 2016 at 12:39:42PM -0700, Andy Lutomirski wrote:
>> On Wed, Apr 20, 2016 at 12:05 PM, Peter Zijlstra <peterz@...radead.org> wrote:
>> > On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:
>
>> >> >> Peter, I got lost in the code that calls this. Are regs coming from
>> >> >> the overflow interrupt's regs, current_pt_regs(), or
>> >> >> perf_get_regs_user?
>> >> >
>> >> > So get_perf_callchain() will get regs from:
>> >> >
>> >> > - interrupt/NMI regs
>> >> > - perf_arch_fetch_caller_regs()
>> >> >
>> >> > And when user && !user_mode(), we'll use:
>> >> >
>> >> > - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
>> >>
>> >> Could you point me to this bit of the code?
>> >
>> > kernel/events/callchain.c:198
>>
>> But that only applies to the callchain code, right?
>
> Yes, which is what I thought you were after..
>
>> AFAICS the PEBS
>> code is invoked through the x86_pmu NMI handler and always gets the
>> IRQ regs. Except for this case:
>>
>> static inline void intel_pmu_drain_pebs_buffer(void)
>> {
>> struct pt_regs regs;
>>
>> x86_pmu.drain_pebs(®s);
>> }
>>
>> which seems a bit confused.
>
> Yes, so that only gets used with 'large' pebs, which requires no other
> flags than PERF_FRERERUNNING_FLAGS, which precludes the regs set from
> being used.
>
> Could definitely use a comment.
>
>> I don't suppose we could arrange to pass something consistent into the
>> PEBS handlers...
>>
>> Or is the PEBS code being called from the callchain code somehow?
>
> No. I think we were/are slightly talking past one another.
>
>> >> One call to perf_get_user_regs per interrupt shouldn't be too bad --
>> >> certainly much better then one per PEBS record. One call to get user
>> >> ABI per overflow would be even less bad, but at that point, folding it
>> >> in to the PEBS code wouldn't be so bad either.
>> >
>> > Right; although note that the whole fixup_ip() thing requires a single
>> > record per interrupt (for we need the LBR state for each record in order
>> > to rewind).
>>
>> So do earlier PEBS events not get rewound? Or so we just program the
>> thing to only ever give us one event at a time?
>
> The latter; we program PEBS such that it can hold but a single record
> and thereby assure we get an interrupt for each record.
>
>> > The problem here is that the overflow stuff is designed for a single
>> > 'event' per interrupt, so passing it data for multiple events is
>> > somewhat icky.
>>
>> It also seems that there's a certain amount of confusion as to exactly
>> what "regs" means in various contexts. Or at least I'm confused by
>> it.
>
> Yes, there's too much regs.
>
> Typically 'regs' is the 'interrrupt'/'event' regs, that is the register
> set at eventing time. For sampling hardware PMUs this is NMI/IRQ like
> things, for software events this ends up being
> perf_arch_fetch_caller_regs().
>
> Then there's PERF_SAMPLE_REGS_USER|PERF_SAMPLE_STACK_USER, which, for
> each event with it set, use perf_get_regs_user() to dump the thing into
> our ringbuffer as part of the event record.
>
> And then there's the callchain code, which first unwinds kernel space if
> the 'interrupt'/'event' reg set points into the kernel, and then uses
> task_pt_regs() (which I think we agree should be perf_get_regs_user())
> to obtain the user regs to continue with the user stack unwind.
>
> Finally there's PERF_SAMPLE_REGS_INTR, which dumps whatever
> 'interrupt/event' regs we get into the ringbuffer sample record.
>
>
> Did that help? Or did I confuse you moar?
>
I think I'm starting to get it. What if we rearrange slightly, like this:
perf_sample_data already has a struct perf_regs in it. We could add a
flags field to the first chunk of perf_sample_data:
u64 sample_flags;
perf_sample_data_init sets sample_flags to zero.
Now we rename perf_sample_regs_user to __perf_sample_regs_user and
make it non-static. We also teach it to set do data->sample_flags |=
PERF_SAMPLE_FLAGS_HAS_REGS_USER. We add:
static void perf_fetch_regs_user(struct perf_sample_data *data, struct
pt_regs *interrupt_regs)
{
if (data->sample_flags & PERF_SAMPLE_FLAGS_HAS_REGS_USER)
return;
__perf_sample_regs_user(&data->regs_user, interrupt_regs,
&data->regs_user_copy);
}
(Hmm. This only really works well if we can guarantee that
interrupt_regs remains valid for the life of the perf_sample_data
object. Could we perhaps move the interrupt_regs pointer *into*
perf_sample_data and stop passing it all over the place?)
We change all the callers of perf_sample_regs_user to use
perf_fetch_regs_user instead.
Now we teach the PEBS fixup code to call perf_fetch_regs_user if it
sees a user IP. Then it can use regs_user->abi instead of TIF_IA32,
and my original goal of nuking TIF_IA32 can proceed apace. (Keep in
mind that, if the interrupt refers to user mode, this is very fast.
If the interrupt refers to kernel mode, it's slower, but that's
comparatively rare and it's the case where we actually care.)
There might be one or two other tweaks needed, but I think this should
mostly do the trick.
What do you think? If you like it, I can probably find some time to
give it a shot, but I don't guarantee that I won't miss some subtlety
in its interaction with the rest of the event output code.
On a vaguely related note, why is the big prebs-to-pt_regs copy
conditional on (sample_type & PERF_SAMPLE_REGS_INTR)? I bet it would
be faster to make it unconditional, because you could avoid copying
over the entire pt_regs struct if PERF_SAMPLE_REGS_INTR isn't set.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
Powered by blists - more mailing lists