[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f03715ac-a4ac-415d-8daa-1914384319fb@linaro.org>
Date: Mon, 28 Apr 2025 09:56:42 +0100
From: James Clark <james.clark@...aro.org>
To: Yabin Cui <yabinc@...gle.com>, Leo Yan <leo.yan@....com>,
Ingo Molnar <mingo@...hat.com>
Cc: Ingo Molnar <mingo@...nel.org>, coresight@...ts.linaro.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org, Mike Leach <mike.leach@...aro.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
Liang Kan <kan.liang@...ux.intel.com>
Subject: Re: [PATCH 1/2] perf: Allow non-contiguous AUX buffer pages via PMU
capability
On 23/04/2025 8:52 pm, Yabin Cui wrote:
> On Tue, Apr 22, 2025 at 7:10 AM Leo Yan <leo.yan@....com> wrote:
>>
>> On Tue, Apr 22, 2025 at 02:49:54PM +0200, Ingo Molnar wrote:
>>
>> [...]
>>
>>>> Hi Yabin,
>>>>
>>>> I was wondering if this is just the opposite of
>>>> PERF_PMU_CAP_AUX_NO_SG, and that order 0 should be used by default
>>>> for all devices to solve the issue you describe. Because we already
>>>> have PERF_PMU_CAP_AUX_NO_SG for devices that need contiguous pages.
>>>> Then I found commit 5768402fd9c6 ("perf/ring_buffer: Use high order
>>>> allocations for AUX buffers optimistically") that explains that the
>>>> current allocation strategy is an optimization.
>>>>
>>>> Your change seems to decide that for certain devices we want to
>>>> optimize for fragmentation rather than performance. If these are
>>>> rarely used features specifically when looking at performance should
>>>> we not continue to optimize for performance? Or at least make it user
>>>> configurable?
>>>
>>> So there seems to be 3 categories:
>>>
>>> - 1) Must have physically contiguous AUX buffers, it's a hardware ABI.
>>> (PERF_PMU_CAP_AUX_NO_SG for Intel BTS and PT.)
>>>
>>> - 2) Would be nice to have continguous AUX buffers, for a bit more
>>> performance.
>>>
>>> - 3) Doesn't really care.
>>>
>>> So we do have #1, and it appears Yabin's usecase is #3?
>
> Yes, in my usecase, I care much more about MM-friendly than a little potential
> performance when using PMU. It's not a rarely used feature. On Android, we
> collect ETM data periodically on internal user devices for AutoFDO optimization
> (for both userspace libraries and the kernel). Allocating a large
> chunk of contiguous
> AUX pages (4M for each CPU) periodically is almost unbearable. The kernel may
> need to kill many processes to fulfill the request. It affects user
> experience even
> after using PMU.
>
> I am totally fine to reuse PERF_PMU_CAP_AUX_NO_SG. If PMUs don't want to
> sacrifice performance for MM-friendly, why support scatter gather mode? If there
> are strong performance reasons to allocate contiguous AUX pages in
> scatter gather
> mode, I hope max_order is configurable in userspace.
>
> Currently, max_order is affected by aux_watermark. But aux_watermark
> also affects
> how frequently the PMU overflows AUX buffer and notifies userspace.
> It's not ideal
> to set aux_watermark to 1 page size. So if we want to make max_order user
> configurable, maybe we can add a one bit field in perf_event_attr?
>
>>
>> In Yabin's case, the AUX buffer work as a bounce buffer. The hardware
>> trace data is copied by a driver from low level's contiguous buffer to
>> the AUX buffer.
>>
>> In this case we cannot benefit much from continguous AUX buffers.
>>
>> Thanks,
>> Leo
Hi Yabin,
So after doing some testing it looks like there is 0 difference in
overhead for max_order=0 vs ensuring the buffer is one contiguous
allocation for Arm SPE, and TRBE would be exactly the same. This makes
sense because we're vmapping pages individually anyway regardless of the
base allocation.
Seems like the performance optimization of the optimistically large
mappings is only for devices that require extra buffer management stuff
other than normal virtual memory. Can we add a new capability
PERF_PMU_CAP_AUX_PREFER_LARGE and apply it to Intel PT and BTS? Then the
old (before the optimistic large allocs change) max_order=0 behavior
becomes the default again, and PREFER_LARGE is just for those two
devices. Other and new devices would get the more memory friendly
allocations by default, as it's unlikely they'll benefit from anything
different.
Thanks
James
Powered by blists - more mailing lists