[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYZgGlh3e84ZrUNQ@x1>
Date: Fri, 6 Feb 2026 18:41:46 -0300
From: Arnaldo Carvalho de Melo <acme@...nel.org>
To: Thomas Richter <tmricht@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, linux-s390@...r.kernel.org,
linux-perf-users@...r.kernel.org, namhyung@...nel.org,
agordeev@...ux.ibm.com, gor@...ux.ibm.com, sumanthk@...ux.ibm.com,
hca@...ux.ibm.com, japo@...ux.ibm.com,
James Clark <james.clark@...aro.org>
Subject: Re: [PATCH] perf/test: Fix test case Leader sampling on s390.
On Fri, Nov 28, 2025 at 10:11:39AM +0100, Thomas Richter wrote:
> The subtest 'Leader sampling' some time fails on s390.
> - for z/VM guest: Disable the test for z/VM guest. There is no
> CPU Measurement facility to run the test successfully.
> - for LPAR: Use correct event names.
This one fell thru the cracks, still applies cleanly and the extra logic
affects only s390, applying to perf-tools-next,
- Arnaldo
> A detailed analysis follows here:
> Now to the debugging and investigation:
> 1. With command
> perf record -e '{cycles,cycles}:S' -- ....
> the first cycles event starts sampling.
> On s390 this sets up sampling with a frequency of 4000 Hz.
> This translates to hardware sample rate of 1377000 instructions per
> micro-second to meet a frequency of 4000 HZ.
>
> 2. With first event cycles now sampling into a hardware buffer, an
> interrupt is triggered each time a sampling buffer gets full.
> The interrupt handler is then invoked and debug output shows the
> processing of samples. The size of one hardware sample is 32 bytes.
> With an interrupt triggered when the hardware buffer page of 4KB
> gets full, the interrupt handler processes 128 samples.
> (This is taken from s390 specific fast debug data gathering)
> 2025-11-07 14:35:51.977248 000003ffe013cbfa \
> perf_event_count_update event->count 0x0 count 0x1502e8
> 2025-11-07 14:35:51.977248 000003ffe013cbfa \
> perf_event_count_update event->count 0x1502e8 count 0x1502e8
> 2025-11-07 14:35:51.977248 000003ffe013cbfa \
> perf_event_count_update event->count 0x2a05d0 count 0x1502e8
> 2025-11-07 14:35:51.977252 000003ffe013cbfa \
> perf_event_count_update event->count 0x3f08b8 count 0x1502e8
> 2025-11-07 14:35:51.977252 000003ffe013cbfa \
> perf_event_count_update event->count 0x540ba0 count 0x1502e8
> 2025-11-07 14:35:51.977253 000003ffe013cbfa \
> perf_event_count_update event->count 0x690e88 count 0x1502e8
> 2025-11-07 14:35:51.977254 000003ffe013cbfa \
> perf_event_count_update event->count 0x7e1170 count 0x1502e8
> 2025-11-07 14:35:51.977254 000003ffe013cbfa \
> perf_event_count_update event->count 0x931458 count 0x1502e8
> 2025-11-07 14:35:51.977254 000003ffe013cbfa \
> perf_event_count_update event->count 0xa81740 count 0x1502e8
>
> 3. The value is constantly increasing by the number of instructions
> executed to generate a sample entry. This is the first line of the
> pairs of lines. count 0x1502e8 --> 1377000
>
> # perf script | grep 1377000 | wc -l
> 214
> # perf script | wc -l
> 428
> #
> That is 428 lines in total, and half of the lines contain value
> 1377000.
>
> 4. The second event cycles is opened against the counting PMU, which
> is an independent PMU and is not interrupt driven. Once enabled it
> runs in the background and keeps running, incrementing silently
> about 400+ counters. The counter values are read via assembly
> instructions.
>
> This second counter PMU's read call back function is called when the
> interrupt handler of the sampling facility processes each sample. The
> function call sequence is:
>
> perf_event_overflow()
> +--> __perf_event_overflow()
> +--> __perf_event_output()
> +--> perf_output_sample()
> +--> perf_output_read()
> +--> perf_output_read_group()
> for_each_sibling_event(sub, leader) {
> values[n++] = perf_event_count(sub, self);
> printk("%s sub %p values %#lx\n", __func__, sub, values[n-1]);
> }
>
> The last function perf_event_count() is invoked on the second event
> cylces *on* the counting PMU. An added printk statement shows the
> following lines in the dmesg output:
>
> # dmesg|grep perf_output_read_group |head -10
> [ 332.368620] perf_output_read_group sub 00000000d80b7c1f values 0x3a80917 (1)
> [ 332.368624] perf_output_read_group sub 00000000d80b7c1f values 0x3a86c7f (2)
> [ 332.368627] perf_output_read_group sub 00000000d80b7c1f values 0x3a89c15 (3)
> [ 332.368629] perf_output_read_group sub 00000000d80b7c1f values 0x3a8c895 (4)
> [ 332.368631] perf_output_read_group sub 00000000d80b7c1f values 0x3a8f569 (5)
> [ 332.368633] perf_output_read_group sub 00000000d80b7c1f values 0x3a9204b
> [ 332.368635] perf_output_read_group sub 00000000d80b7c1f values 0x3a94790
> [ 332.368637] perf_output_read_group sub 00000000d80b7c1f values 0x3a9704b
> [ 332.368638] perf_output_read_group sub 00000000d80b7c1f values 0x3a99888
> #
>
> This correlates with the output of
> # perf report -D | grep 'id 00000000000000'|head -10
> ..... id 0000000000000006, value 00000000001502e8, lost 0
> ..... id 000000000000000e, value 0000000003a80917, lost 0 --> line (1) above
> ..... id 0000000000000006, value 00000000002a05d0, lost 0
> ..... id 000000000000000e, value 0000000003a86c7f, lost 0 --> line (2) above
> ..... id 0000000000000006, value 00000000003f08b8, lost 0
> ..... id 000000000000000e, value 0000000003a89c15, lost 0 --> line (3) above
> ..... id 0000000000000006, value 0000000000540ba0, lost 0
> ..... id 000000000000000e, value 0000000003a8c895, lost 0 --> line (4) above
> ..... id 0000000000000006, value 0000000000690e88, lost 0
> ..... id 000000000000000e, value 0000000003a8f569, lost 0 --> line (5) above
>
> Summary:
> - Above command starts the CPU sampling facility, with runs interrupt
> driven when a 4KB page is full. An interrupt processes the 128 samples
> and calls eventually perf_output_read_group() for each sample to save it
> in the event's ring buffer.
>
> - At that time the CPU counting facility is invoked to read the value of
> the event cycles. This value is saved as the second value in the
> sample_read structure.
>
> - The first and odd lines in the perf script output displays the period
> value between 2 samples being created by hardware. It is the number
> of instructions executes before the hardware writes a sample.
>
> - The second and even lines in the perf script output displays the number
> of CPU cycles needed to process each sample and save it in the event's
> ring buffer.
> These 2 different values can never be identical on s390.
>
> Since event leader sampling is not possible on s390 the perf tool will
> return EOPNOTSUPP soon. Perpare the test case for that.
>
> Suggested-by: James Clark <james.clark@...aro.org>
> Signed-off-by: Thomas Richter <tmricht@...ux.ibm.com>
> Tested-by: Jan Polensky <japo@...ux.ibm.com>
> Reviewed-by: Jan Polensky <japo@...ux.ibm.com>
> ---
> tools/perf/tests/shell/record.sh | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
> index 0f5841c479e7..46b96d565680 100755
> --- a/tools/perf/tests/shell/record.sh
> +++ b/tools/perf/tests/shell/record.sh
> @@ -260,7 +260,21 @@ test_uid() {
>
> test_leader_sampling() {
> echo "Basic leader sampling test"
> - if ! perf record -o "${perfdata}" -e "{cycles,cycles}:Su" -- \
> + events="{cycles,cycles}:Su"
> + [ $(uname -m) = "s390x" ] && {
> + [ ! -d /sys/devices/cpum_sf ] && {
> + echo "No CPUMF [Skipped record]"
> + return
> + }
> + events="{cpum_sf/SF_CYCLES_BASIC/,cycles}:Su"
> + perf record -o "${perfdata}" -e "$events" -- perf test -w brstack 2> /dev/null
> + # Perf grouping might be unsupported, depends on version.
> + [ "$?" -ne 0 ] && {
> + echo "Grouping not support [Skipped record]"
> + return
> + }
> + }
> + if ! perf record -o "${perfdata}" -e "$events" -- \
> perf test -w brstack 2> /dev/null
> then
> echo "Leader sampling [Failed record]"
> --
> 2.52.0
Powered by blists - more mailing lists