linux-kernel - Re: [PATCH 2/2] perf tests: add test for trace output loss

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEmfU+vfYY8jVD-mLuRBVrk9M8OFCNiptXQjL1Rf+S5XtjiDVg@mail.gmail.com>
Date: Fri, 8 Nov 2024 08:59:11 -0800
From: Benjamin Peterson <benjamin@...flow.com>
To: Howard Chu <howardchu95@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Ian Rogers <irogers@...gle.com>, Adrian Hunter <adrian.hunter@...el.com>, 
	"Liang, Kan" <kan.liang@...ux.intel.com>, open list <linux-kernel@...r.kernel.org>, 
	"open list:PERFORMANCE EVENTS SUBSYSTEM" <linux-perf-users@...r.kernel.org>
Subject: Re: [PATCH 2/2] perf tests: add test for trace output loss

On Fri, Nov 8, 2024 at 8:20 AM Howard Chu <howardchu95@...il.com> wrote:
>
> Hello Benjamin,
>
> On Thu, Nov 7, 2024 at 3:18 PM Benjamin Peterson <benjamin@...flow.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 2:07 PM Howard Chu <howardchu95@...il.com> wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Nov 6, 2024 at 3:46 PM Benjamin Peterson <benjamin@...flow.com> wrote:
> > > >
> > > > Add a test that checks that trace output is not lost to races. This is
> > > > accomplished by tracing the exit_group syscall of "true" multiple times and
> > > > checking for correct output.
> > > >
> > > > Conveniently, this test also serves as a regression test for 5fb8e56542a3 ("perf
> > > > trace: avoid garbage when not printing a trace event's arguments") because
> > > > exit_group triggers the previously buggy printing behavior.
> > > >
> > > > Signed-off-by: Benjamin Peterson <benjamin@...flow.com>
> > > > ---
> > > >  tools/perf/tests/shell/trace_exit_race.sh | 31 +++++++++++++++++++++++
> > > >  1 file changed, 31 insertions(+)
> > > >  create mode 100755 tools/perf/tests/shell/trace_exit_race.sh
> > > >
> > > > diff --git a/tools/perf/tests/shell/trace_exit_race.sh b/tools/perf/tests/shell/trace_exit_race.sh
> > > > new file mode 100755
> > > > index 000000000000..8b70324bc5b4
> > > > --- /dev/null
> > > > +++ b/tools/perf/tests/shell/trace_exit_race.sh
> > > > @@ -0,0 +1,31 @@
> > > > +#!/bin/sh
> > > > +# perf trace exit race
> > > > +# SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +# Check that the last events of a perf trace'd subprocess are not
> > > > +# lost. Specifically, trace the exiting syscall of "true" 100 times and ensure
> > > > +# the output contains 100 correct lines.
> > > > +
> > > > +# shellcheck source=lib/probe.sh
> > > > +. "$(dirname $0)"/lib/probe.sh
> > > > +
> > > > +skip_if_no_perf_trace || exit 2
> > > > +
> > > > +trace_shutdown_race() {
> > > > +       for i in $(seq 100); do
> > > > +               perf trace -e syscalls:sys_enter_exit_group true 2>>$file
> > > > +       done
> > > > +       [ $(grep -c -E " +[0-9]+\.[0-9]+ +true/[0-9]+ syscalls:sys_enter_exit_group\(\)$" $file) = "100" ]
> > >
> > > The test failed due to regex mismatched, I think because of this:
> > >
> > >      0.000 true/1526046 syscalls:sys_enter_exit_group( )
> > >      0.000 true/1526212 syscalls:sys_enter_exit_group( )
> > >      0.000 true/1526383 syscalls:sys_enter_exit_group(SSSSSSSSSSSSSSSS)
> > >
> > > Ironically the junk buffer problem you fixed last time, thanks for
> > > finding another printing problem in perf trace :), I'll figure out the
> > > cause of this.
> >
> > Thanks, the problem is a parallel issue with trace__fprintf_sys_enter.
> > I will include a patch for the problem in my spin of this series.
> >
> > >
> > > Another thing is this test takes a long time to finish
> > >
> > > perf $ time ./perf test 109
> > > 109: perf trace exit race                                            : FAILED!
> > >
> > > real    0m38.762s
> > > user    0m15.090s
> > > sys     0m21.794s
> > >
> > > Is it really necessary to run it 100 times? To me it seems to be just
> > > a wrong handling logic of draining samples, will there be coincidence?
> >
> > Yes, it's racy. In my testing, sometimes the correct output would
> > appear. However, I will reduce the number of iterations to 10.
>
> Could you be a bit more specific about this race condition issue?

>From the code, it's apparent that if the SIGCHLD from the traced
process is seen before the exit_group trace event, output will be
lost. If the signal is received after the trace event, the trace event
will be correctly printed.

> Could you please tell us how one might emulate and test the scenario
> you mentioned? I ran the command 1,000 times before and after applying
> the patch, and each time I got the expected output. Could you please
> provide more information about your use cases?

I can see it simply by running the unpatched perf few times in the
terminal, but it might be helpful for reproduction to create some
scheduling contention with stress.