linux-kernel - Re: RFC: starting a kernel-testers group for newbies

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080430071526.1bce202c@infradead.org>
Date:	Wed, 30 Apr 2008 07:15:26 -0700
From:	Arjan van de Ven <arjan@...radead.org>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Adrian Bunk <bunk@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Rafael J. Wysocki" <rjw@...k.pl>, davem@...emloft.net,
	linux-kernel@...r.kernel.org, jirislaby@...il.com,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: RFC: starting a kernel-testers group for newbies

On Thu, 1 May 2008 01:13:46 -0700
Andrew Morton <akpm@...ux-foundation.org> wrote:

> On Wed, 30 Apr 2008 00:03:38 -0700 Arjan van de Ven
> <arjan@...radead.org> wrote:
> 
> > > First of all:
> > > I 100% agree with Andrew that our biggest problems are in
> > > reviewing code and resolving bugs, not in finding bugs (we
> > > already have far too many unresolved bugs).
> > 
> > I would argue instead that we don't know which bugs to fix first.
> 
> <boggle>
> 
> How about "a bug which we just added"?  One which is repeatable. 
> Repeatable by a tester who is prepared to work with us on resolving
> it. Those bugs.
> 
> Rafael has a list of them.  We release kernels when that list still
> has tens of unfixed regressions dating back up to a couple of months.
> 


I know he does. But I will still argue that if that is all we work from, and treat
all of those equally, we're doing the wrong thing.
I'm sorry, but I really do not consider "ext4 doesn't compile on m68k" which is 
on that list to be as relevant as a "i915 drm driver crashes" bug which is among
us for a while and not on that list, just based on the total user base for either of those. 

Does that mean nobody should fix the m68k bug?
Someone who cares about m68k for sure should work on it, or if it's easy for an ext4 developer,
sure. But if the ext4 person has to spend 8 hours on it figuring cross compilers, I say 
we're doing something very wrong here. (no offense to the m68k people, but there's just
a few of you; maybe I should have picked voyager instead)

Maybe that's a "boggle" for you; but for me that's symptomatic of where we are today:
We don't make (effective) prioritization decisions. Such decisions are hard, because it 
effectively means telling people "I'm sorry but your bug is not yet important". That's
unpopular, especially if the reporter is very motivated on lkml. And it will involve a 
certain amount of non-quantifiable judgement calls, which also means we won't always be
right. Another hard thing is that lkml is a very self-selective audience. A bug may be 
reported three times there, but never hit otherwise, while another bug might not be reported
at all (or only once) while thousands and thousands of people are hitting it.

Not that we're doing all that bad, we ARE fixing the bugs (at least the oopses/warnings) that
are frequently hit. So I wouldn't blindly say we're doing a bad job at prioritizing. I would
rather say that if we focus only on what is left afterwards without doing a reality check,
we'll *always* have a negative view of quality, since there will *always* be bugs we don't 
fix. Linux well over ten million users (much more if you count embedded devices). 
A lot of them will have "standard" hardware, and a bunch of them will have "weird" stuff.
Cosmic rays happen. As do overclocking and bad DIMMs. And some BIOSes are just weird etc etc.
If we do not prioritize effectively we'll be stuck forever chasing ghosts, or we'll be stuck
saying "our quality sucks" forever without making progress.

Another trap is to only look at what goes wrong, not on what goes right... we tend to only
see what goes wrong on lkml and it's an easy trap to fall into doomthinking that way.
Are we doing worse on quality? My (subjective) opinion is that we are doing better than last year.
We are focused more on quality. We are fixing the bugs that people hit most. We are fixing most
of the regressions (yes, not all). Subsystems are seeing flat or lower bugcounts/bugrates. Take ACPI, 
the number of outstanding bugs *halved* over the last year. Of course you can pick a single 
bug and say "but this one did not get fixed", but that just loses the big picture (and 
proves the point :). All of this with a growing userbase and a rate of development that's a bit
faster than last year as well.

Can we do better? Always. More testing will help. Both to detect things early, and by 
letting us figure out which bugs are important. Just saying "more testing is not relevant
because we're not even fixing the bugs we have now" is just incorrect. Sorry.
More testers helps. Wider range of hardware/usages allows us to find better patterns
in the hard to track down bugs. More testers means more people willing to see if they
can diagnose the bugs at least somewhat themselves, via bisection or otherwise. That's important,
because that's the part of the problem that scales well with a growing userbase.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/