[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <0e2cd6e4-8016-ccf6-eaff-2b304cf966ee@linux.intel.com>
Date: Thu, 17 Sep 2020 17:58:26 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Dave Hansen <dave.hansen@...el.com>, peterz@...radead.org,
mingo@...hat.com, acme@...nel.org, linux-kernel@...r.kernel.org
Cc: mark.rutland@....com, alexander.shishkin@...ux.intel.com,
jolsa@...hat.com, eranian@...gle.com, ak@...ux.intel.com,
kirill.shutemov@...ux.intel.com, mpe@...erman.id.au,
benh@...nel.crashing.org, paulus@...ba.org
Subject: Re: [PATCH V7 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
On 9/17/2020 5:24 PM, Dave Hansen wrote:
> On 9/17/20 2:16 PM, Liang, Kan wrote:
>>> One last concern as I look at this: I wish it was a bit more
>>> future-proof. There are lots of weird things folks are trying to do
>>> with the page tables, like Address Space Isolation. For instance, if
>>> you get a perf NMI when running userspace, current->mm->pgd is
>>> *different* than the PGD that was in use when userspace was running.
>>> It's close enough today, but it might not stay that way. But I can't
>>> think of any great ways to future proof this code, other than spitting
>>> out an error message if too many of the page table walks fail when they
>>> shouldn't.
>>>
>>
>> If the page table walks fail, page size 0 will return. So the worst case
>> is that the page size is not available for users, which is not a fatal
>> error.
>
> Right, it's not a fatal error. It will just more or less silently break
> this feature.
>
>> If my understanding is correct, when the above case happens, there is
>> nothing we can do for now (because we have no idea what it will become),
>> except disabling the page size support and throw an error/warning.
>>
>> From the user's perspective, throwing an error message or marking the
>> page size unavailable should be the same. I think we may leave the code
>> as-is. We can fix the future case later separately.
>
> The only thing I can think of is to record the number of consecutive
> page walk errors without a success. Occasional failures are OK and
> expected, such as if reclaim zeroes a PTE and a later perf event goes
> and looks at it. But a *LOT* of consecutive errors indicates a real
> problem somewhere.
>
> Maybe if you have 10,000 or 1,000,000 successive walk failures, you do a
> WARN_ON_ONCE().
The user space perf tool looks like a better place for this kind of
warning. The perf tool knows the total number of the samples. It also
knows the number of the page size 0 samples. We can set a threshold,
e.g., 90%. If 90% of the samples have the page size 0, perf tool will
throw out a warning message.
The problem is that the warning from the perf tool usually includes some
hints regarding the cause of the warning or possible solution to
workaround/fix the warning. What message should we deliver to the users?
"Warning: Too many error page size. Address space isolation feature may
be enabled, please check"?
Thanks,
Kan
Powered by blists - more mailing lists