lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANpmjNPrssMSw+jaP1ods1NZLxnk+bO_P9FUmRLs-ENEMPCcgg@mail.gmail.com>
Date: Wed, 13 Mar 2024 15:17:28 +0100
From: Marco Elver <elver@...gle.com>
To: Arnaldo Carvalho de Melo <acme@...nel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>, linux-perf-users@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Adrian Hunter <adrian.hunter@...el.com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Ian Rogers <irogers@...gle.com>, 
	Ingo Molnar <mingo@...hat.com>, Jiri Olsa <jolsa@...nel.org>, Mark Rutland <mark.rutland@....com>, 
	Namhyung Kim <namhyung@...nel.org>, Peter Zijlstra <peterz@...radead.org>, 
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v2 0/4] perf: Make SIGTRAP and __perf_pending_irq() work
 on RT.

On Wed, 13 Mar 2024 at 14:47, Arnaldo Carvalho de Melo <acme@...nel.org> wrote:
>
> On Wed, Mar 13, 2024 at 10:28:44AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> > > One part I don't get: did you let it run or did you kill it?
>
> > If I let them run they will finish and exit, no exec_child remains.
>
> > If I instead try to stop the loop that goes on forking the 100 of them,
> > then the exec_child remain spinning.
>
> > > `exec_child' spins until a signal is received or the parent kills it. So
>
> > > it shouldn't remain there for ever. And my guess, that it is in spinning
> > > in userland and not in kernel.
>
> > Checking that now:
>
> tldr; the tight loop, full details at the end.
>
> 100.00  b6:   mov    signal_count,%eax
>               test   %eax,%eax
>             ↑ je     b6
>
> remove_on_exec.c
>
> /* For exec'd child. */
> static void exec_child(void)
> {
>         struct sigaction action = {};
>         const int val = 42;
>
>         /* Set up sigtrap handler in case we erroneously receive a trap. */
>         action.sa_flags = SA_SIGINFO | SA_NODEFER;
>         action.sa_sigaction = sigtrap_handler;
>         sigemptyset(&action.sa_mask);
>         if (sigaction(SIGTRAP, &action, NULL))
>                 _exit((perror("sigaction failed"), 1));
>
>         /* Signal parent that we're starting to spin. */
>         if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
>                 _exit((perror("write failed"), 1));
>
>         /* Should hang here until killed. */
>         while (!signal_count);
> }
>
> So probably just a test needing to be a bit more polished?

Yes, possible.

> Seems like it, on a newer machine, faster, I managed to reproduce it on
> a non-RT kernel, with one exec_child remaining:
>
>   1.44  b6:   mov   signal_count,%eax
>               test  %eax,%eax
>  98.56      ↑ je    b6

It's unclear to me why that happens. But I do recall seeing it before,
and my explanation was that with too many concurrent copies of the
test the system either ran out of memory (maybe?) because the stress
test also spawns 30 parallel copies of the "exec_child" subprocess. So
with the 100 parallel copies we end up with 30 * 100 processes. Maybe
that's too much?

In any case, if the kernel didn't fall over during that kind of stress
testing, and the test itself passes when run as a single copy, then
I'd conclude all looks good.

This particular feature of perf along with testing it once before
melted Peter's and my brain [1]. I hope your experience didn't result
in complete brain-melt. ;-)

[1] https://lore.kernel.org/all/Y0VofNVMBXPOJJr7@elver.google.com/

Thanks,
-- Marco

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ