[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc2ac7cb99e09a3898f033690fd0b7978d0b120d.camel@redhat.com>
Date: Tue, 06 Feb 2024 10:18:33 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>, Jakub Kicinski
	 <kuba@...nel.org>
Cc: netdev@...r.kernel.org, davem@...emloft.net, edumazet@...gle.com, 
	linux-kselftest@...r.kernel.org, Willem de Bruijn <willemb@...gle.com>, 
	Matthieu Baerts <matttbe@...nel.org>
Subject: Re: [PATCH net-next] selftests/net: ignore timing errors in
 so_txtime if KSFT_MACHINE_SLOW
On Fri, 2024-02-02 at 19:31 -0500, Willem de Bruijn wrote:
> Jakub Kicinski wrote:
> > On Thu,  1 Feb 2024 11:21:19 -0500 Willem de Bruijn wrote:
> > > This test is time sensitive. It may fail on virtual machines and for
> > > debug builds.
> > > 
> > > Continue to run in these environments to get code coverage. But
> > > optionally suppress failure for timing errors (only). This is
> > > controlled with environment variable KSFT_MACHINE_SLOW.
> > > 
> > > The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
> > > as previously discussed. Because making so_txtime.c return that and
> > > then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
> > > and pass it on added a bunch of (fragile bash) boilerplate, while the
> > > result is interpreted the same as KSFT_PASS anyway.
> > 
> > FWIW another idea that came up when talking to Matthieu -
> > isolate the VMs which run time-sensitive tests to dedicated
> > CPUs. Right now we kick off around 70 4 CPU VMs and let them 
> > battle for 72 cores. The machines don't look overloaded but
> > there can be some latency spikes (CPU use diagram attached).
> > 
> > So the idea would be to have a handful of special VMs running 
> > on dedicated CPUs without any CPU time competition. That could help 
> > with latency spikes. But we'd probably need to annotate the tests
> > which need some special treatment.
> > 
> > Probably too much work both to annotate tests and set up env,
> > but I thought I'd bring it up here in case you had an opinion.
> 
> I'm not sure whether the issue with timing in VMs is CPU affinity.
> Variance may just come from expensive hypercalls, even with a
> dedicated CPU. Though tests can tell.
FTR, I think the CPU affinity setup is a bit too complex, and hard to
reproduce for 3rd parties willing to investigate eventual future CI
failures, I think the current env-variable-based approach would help
with reproducibility.
> There's still the debug builds, as well.
I understand/hope you are investigating it? 
Cheers,
Paolo
Powered by blists - more mailing lists
 
