[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec514ab8-cf18-6fa7-d2b5-5f81e1a70983@oracle.com>
Date: Mon, 9 Jul 2018 18:42:29 -0400
From: Boris Ostrovsky <boris.ostrovsky@...cle.com>
To: Mark Rutland <mark.rutland@....com>,
Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Jin Yao <yao.jin@...ux.intel.com>
Subject: Re: [RFC PATCH] perf/core: don't sample kernel regs upon skid
On 07/02/2018 12:02 PM, Mark Rutland wrote:
> On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:
>>> Users can request that general purpose registers, instruction pointer,
>>> etc, are sampled when a perf event counter overflows. To try to avoid
>>> this resulting in kernel state being leaked, unprivileged users are
>>> usually forbidden from opening events which count while the kernel is
>>> running.
>>>
>>> Unfortunately, this is not sufficient to avoid leading kernel state.
>> 'leaking' surely.
>>
>>> For various reasons, there can be a delay between the overflow occurring
>>> and the resulting overflow exception (e.g. an NMI) being taken. During
>>> this window, other instructions may be executed, resulting in skid.
>>>
>>> This skid means that a userspace-only event overflowing may result in an
>>> exception being taken *after* entry to the kernel, allowing kernel
>>> registers to be sampled. Depending on the amount of skid, this may only
>>> leak the PC (breaking KASLR), or it may leak secrets which are currently
>>> live in GPRs.
>>>
>>> Let's avoid this by only sampling from the user registers when an event
>>> is supposed to exclude the kernel, providing the illusion that the
>>> overflow exception is taken from userspace.
>>>
>>> We also have similar cases when sampling a guest, where we get the host
>>> regs in some cases. It's not entirely clear to me how we should handle
>>> these.
>> Would not a similar:
>>
>> if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? */ &&
>> perf_guest_cbs && !perf_guest_cbs->is_in_guest())
>> return perf_guest_cbs->guest_pt_regs();
>>
>> work there?
> Mostly. It's fun if the user also passed exclude_guest -- I have no idea
> what should be sampled there, if anything.
>
> It's easy to get stuck in a rabbit hole looking at this.
>
>> Of course, perf_guest_info_callbacks is currently lacking that
>> guest_pt_regs() thingy..
> Yeah; I started looking at implementing it, but ran away since it wasn't
> clear to me how to build that on most architectures.
>
>>> Signed-off-by: Mark Rutland <mark.rutland@....com>
>>> Cc: Ingo Molnar <mingo@...hat.com>
>>> Cc: Jin Yao <yao.jin@...ux.intel.com>
>>> Cc: Peter Zijlstra <peterz@...radead.org>
>>> ---
>>> kernel/events/core.c | 28 ++++++++++++++++++++++++++++
>>> 1 file changed, 28 insertions(+)
>>>
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 8f0434a9951a..2ab2548b2e66 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>>> return callchain ?: &__empty_callchain;
>>> }
>>>
>>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,
>>> + struct pt_regs *regs)
>>> +{
>>> + /*
>>> + * Due to interrupt latency (AKA "skid"), we may enter the kernel
>>> + * before taking an overflow, even if the PMU is only counting user
>>> + * events.
>>> + *
>>> + * If we're not counting kernel events, always use the user regs when
>>> + * sampling.
>>> + *
>>> + * TODO: what do we do about sampling a guest's registers? The IP is
>>> + * special-cased, but for the rest of the regs they'll get the
>>> + * user/kernel regs depending on whether exclude_kernel is set, which
>>> + * is nonsensical.
>>> + *
>>> + * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU
>>> + * can't provide the GPRs), so should we just zero the GPRs when in a
>>> + * guest? Or skip outputting the regs in perf_output_sample?
>> Seems daft Xen cannot provide registers; why is that? Boris?
> The xen_pmu_regs structure simply doesn't have them, so I assume there's
> no API to get them.
>
> Given we don't currently sample the guest regs, I'd be tempted to just
> zero them for now, or skip the sample at output time (if that doesn't
> break some other case).
(Was out on vacation, couldn't respond earlier)
Yes, PV guests only get a limited set of registers passed to the handler
by the hypervisor. GPRs are not part of this set.
I think we need do
diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index 7d00d4a..95997e6 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -478,7 +478,7 @@ static void xen_convert_regs(const struct
xen_pmu_regs *xen_regs,
irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id)
{
int err, ret = IRQ_NONE;
- struct pt_regs regs;
+ struct pt_regs regs = {0};
const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();
uint8_t xenpmu_flags = get_xenpmu_flags();
Do you want me to submit a separate patch or can you make this part of
yours?
Thanks.
-boris
Powered by blists - more mailing lists