linux-kernel - Re: [PATCH v11] drm: Add initial ci/ subdirectory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ejq3hjpoy3gxft2jbmoa5m656usetuxcs7g3ezyyiitj67rav@r5jhdz27foat>
Date:   Mon, 11 Sep 2023 14:51:26 +0200
From:   Maxime Ripard <mripard@...nel.org>
To:     Michel Dänzer <michel.daenzer@...lbox.org>
Cc:     Daniel Stone <daniels@...labora.com>, emma@...olt.net,
        linux-doc@...r.kernel.org, vignesh.raman@...labora.com,
        dri-devel@...ts.freedesktop.org, alyssa@...enzweig.io,
        jbrunet@...libre.com, robdclark@...gle.com, corbet@....net,
        khilman@...libre.com, sergi.blanch.torne@...labora.com,
        david.heidelberg@...labora.com, linux-rockchip@...ts.infradead.org,
        tzimmermann@...e.de, martin.blumenstingl@...glemail.com,
        robclark@...edesktop.org, Helen Koike <helen.koike@...labora.com>,
        anholt@...gle.com, linux-mediatek@...ts.infradead.org,
        matthias.bgg@...il.com, linux-amlogic@...ts.infradead.org,
        gustavo.padovan@...labora.com,
        linux-arm-kernel@...ts.infradead.org,
        angelogioacchino.delregno@...labora.com, neil.armstrong@...aro.org,
        guilherme.gallo@...labora.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v11] drm: Add initial ci/ subdirectory

On Mon, Sep 11, 2023 at 02:13:43PM +0200, Michel Dänzer wrote:
> On 9/11/23 11:34, Maxime Ripard wrote:
> > On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote:
> >> Yeah, this is what our experience with Mesa (in particular) has taught us.
> >>
> >> Having 100% of the tests pass 100% of the time on 100% of the platforms is a
> >> great goal that everyone should aim for. But it will also never happen.
> >>
> >> Firstly, we're just not there yet today. Every single GPU-side DRM driver
> >> has userspace-triggerable faults which cause occasional errors in GL/Vulkan
> >> tests. Every single one. We deal with these in Mesa by retrying; if we
> >> didn't retry, across the breadth of hardware we test, I'd expect 99% of
> >> should-succeed merges to fail because of these intermittent bugs in the DRM
> >> drivers.
> > 
> > So the plan is only to ever test rendering devices? It should have been
> > made clearer then.
> > 
> >> We don't have the same figure for KMS - because we don't test it - but
> >> I'd be willing to bet no driver is 100% if you run tests often enough.
> > 
> > And I would still consider that a bug that we ought to fix, and
> > certainly not something we should sweep under the rug. If half the tests
> > are not running on a driver, then fine, they aren't. I'm not really
> > against having failing tests, I'm against not flagging unreliable tests
> > on a given hardware as failing tests.
> 
> A flaky test will by definition give a "pass" result at least some of the time, which would be considered a failure by the CI if the test is marked as failing.
> 
> 
> >> Secondly, we will never be there. If we could pause for five years and sit
> >> down making all the current usecases for all the current hardware on the
> >> current kernel run perfectly, we'd probably get there. But we can't: there's
> >> new hardware, new userspace, and hundreds of new kernel trees.
> > 
> > Not with that attitude :)
> 
> Attitude is not the issue, the complexity of the multiple systems
> involved is.

FTR, that was a meme/joke.

> > I'm not sure it's actually an argument, really. 10 years ago, we would
> > never have been at "every GPU on the market has an open-source driver"
> > here. 5 years ago, we would never have been at this-series-here. That
> > didn't stop anyone making progress, everyone involved in that thread
> > included.
> 
> Even assuming perfection is achievable at all (which is very doubtful,
> given the experience from the last few years of CI in Mesa and other
> projects), if you demand perfection before even taking the first step,
> it will never get off the ground.

Perfection and scale from the get-go isn't reasonable, yes. Building a
small, "perfect" (your words, not mine) system that you can later expand
is doable. And that's very much a design choice.

> > How are we even supposed to detect those failures in the first
> > place if tests are flagged as unreliable?
> 
> Based on experience with Mesa, only a relatively small minority of
> tests should need to be marked as flaky / not run at all. The majority
> of tests are reliable and can catch regressions even while some tests
> are not yet.

I understand and acknowledge that it worked with Mesa. That's great for
Mesa. That still doesn't mean that it's the panacea and is for every
project.

> > No matter what we do here, what you describe will always happen. Like,
> > if we do flag those tests as unreliable, what exactly prevents another
> > issue to come on top undetected, and what will happen when we re-enable
> > testing?
> 
> Any issues affecting a test will need to be fixed before (re-)enabling
> the test for CI.

If that underlying issue is never fixed, at which point do we consider
that it's a failure and should never be re-enabled? Who has that role?

> > On top of that, you kind of hinted at that yourself, but what set of
> > tests will pass is a property linked to a single commit. Having that
> > list within the kernel already alters that: you'll need to merge a new
> > branch, add a bunch of fixes and then change the test list state. You
> > won't have the same tree you originally tested (and defined the test
> > state list for).
> 
> Ideally, the test state lists should be changed in the same commits
> which affect the test results. It'll probably take a while yet to get
> there for the kernel.
> 
> > It might or might not be an issue for Linus' release, but I can
> > definitely see the trouble already for stable releases where fixes will
> > be backported, but the test state list certainly won't be updated.
> 
> If the stable branch maintainers want to take advantage of CI for the
> stable branches, they may need to hunt for corresponding state list
> commits sometimes. They'll need to take that into account for their
> decision.

So we just expect the stable maintainers to track each and every patches
involved in a test run, make sure that they are in a stable tree, and
then update the test list? Without having consulted them at all?

> >> By keeping those sets of expectations, we've been able to keep Mesa pretty
> >> clear of regressions, whilst having a very clear set of things that should
> >> be fixed to point to. It would be great if those set of things were zero,
> >> but it just isn't. Having that is far better than the two alternatives:
> >> either not testing at all (obviously bad), or having the test always be red
> >> so it's always ignored (might as well just not test).
> > 
> > Isn't that what happens with flaky tests anyway?
> 
> For a small minority of tests. Daniel was referring to whole test suites.
> 
> > Even more so since we have 0 context when updating that list.
> 
> The commit log can provide whatever context is needed.

Sure, I've yet to see that though.

There's in 6.6-rc1 around 240 reported flaky tests. None of them have
any context. That new series hads a few dozens too, without any context
either. And there's no mention about that being a plan, or a patch
adding a new policy for all tests going forward.

So I'm still fairly doubtful it will ever happen.

> > I've asked a couple of times, I'll ask again. In that other series, on
> > the MT8173, kms_hdmi_inject@...ect-4k is setup as flaky (which is a KMS
> > test btw).
> > 
> > I'm a maintainer for that part of the kernel, I'd like to look into it,
> > because it's seriously something that shouldn't fail, ever, the hardware
> > isn't involved.
> > 
> > How can I figure out now (or worse, let's say in a year) how to
> > reproduce it? What kernel version was affected? With what board? After
> > how many occurences?
> > 
> > Basically, how can I see that the bug is indeed there (or got fixed
> > since), and how to start fixing it?
> 
> Many of those things should be documented in the commit log of the
> state list change.
> 
> How the CI works in general should be documented in some appropriate
> place in tree.

I think I'll stop the discussion there. It was merged anyway so I'm not
quite sure why I was asked to give my feedback on this. Any concern I
raised were met with a giant "it worked on Mesa" handwave or "someone
will probably work on it at some point".

And fine, guess I'm wrong.

Thanks
Maxime

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)