[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221128151623.GI4001@paulmck-ThinkPad-P17-Gen-1>
Date: Mon, 28 Nov 2022 07:16:23 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Zhouyi Zhou <zhouzhouyi@...il.com>, fweisbec@...il.com,
mingo@...nel.org, dave@...olabs.net, josh@...htriplett.org,
mpe@...erman.id.au, linuxppc-dev@...ts.ozlabs.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu
On Mon, Nov 28, 2022 at 09:12:28AM +0100, Thomas Gleixner wrote:
> On Sun, Nov 27 2022 at 09:53, Paul E. McKenney wrote:
> > On Sun, Nov 27, 2022 at 01:40:28PM +0100, Thomas Gleixner wrote:
> >> There are quite some reasons why a CPU-hotplug or a hot-unplug operation
> >> can fail, which is not a fatal problem, really.
> >>
> >> So if a CPU hotplug operation fails, then why can't the torture test
> >> just move on and validate that the system still behaves correctly?
> >>
> >> That gives us more coverage than just testing the good case and giving
> >> up when something unexpected happens.
> >
> > Agreed, with access to a function like the tick_nohz_full_timekeeper()
> > suggested earlier in this email thread, then yes, it would make sense to
> > try to offline the CPU anyway, then forgive the failure in cases where
> > the CPU matches that indicated by tick_nohz_full_timekeeper().
>
> Why special casing this? There are other valid reasons why offlining can
> fail. So we special case timekeeper today and then next week we special
> case something else just because. That does not make sense. If it fails
> there is a reason and you can log it. The important part is that the
> system is functional and stable after the fail and the rollback.
Perhaps there are other valid reasons, but they have not been showing up
in my torture-test runs for well over a decade. Not saying that they
don't happen, of course. But if they involved (say) cgroups, then my
test setup would not exercise them.
So are you looking to introduce spurious CPU-hotplug failures? If so,
these will also affect things like suspend/resume. Plus it will make
it much more difficult to detect real but intermittent CPU-hotplug bugs,
which is the motivation for special-casing the tick_nohz_full_timekeeper()
failures.
So we should discuss introduciton of any spurious failures that might
be under consideration.
Independently of that, the torture_onoff() functions can of course keep
some sort of histogram of the failure return codes. Or are there other
failure indications that should be captured?
> >> I even argue that the torture test should inject random failures into
> >> the hotplug state machine to achieve extended code coverage.
> >
> > I could imagine torture_onoff() telling various CPU-hotplug notifiers
> > to refuse the transition using some TBD interface.
>
> There is already an interface which is exposed to sysfs which allows you
> to enforce a "fail" at a defined hotplug state.
If you would like me to be testing this as part of my normal testing
regimen, I will need an in-kernel interface. Such an interface is of
course not needed for modprobe-style testing, in which case the script
doing the modprobe and rmmod can of course manipulate the sysfs files.
But I don't do that sort of testing very often. And when I do, it is
almost always with kernels configured for Meta's fleet, which almost
never do CPU-offline operations.
Thanx, Paul
> > That would better test the CPU-hotplug common code's ability to deal
> > with failures.
>
> Correct.
>
> > Or did you have something else/additional in mind?
>
> No.
>
> Thanks,
>
> tglx
Powered by blists - more mailing lists