linux-kernel - Re: stable? quality assurance?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.1007142358160.14221@asgard.lang.hm>
Date:	Thu, 15 Jul 2010 00:23:28 -0700 (PDT)
From:	david@...g.hm
To:	David Newall <davidn@...idnewall.com>
cc:	Stefan Richter <stefanr@...6.in-berlin.de>,
	Marcin Letyns <mletyns@...il.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: stable? quality assurance?

On Tue, 13 Jul 2010, David Newall wrote:

> (Segue to a problem which follows from calling bleeding-edge kernels 
> "stable".)
>
> When reporting bugs, the first response is often, "we're not interested in 
> such an old kernel; try it with the latest."  That's not hugely useful when 
> the latest kernels are not suitable for production use.  If kernels weren't 
> marked stable until they had earned the moniker, for example 2.6.27, then the 
> expectation of developers and of users would be consistent: developers could 
> expect users to try it again with latest stable kernel, and users could 
> reasonably expect that trying it wouldn't break their system.

2.6.27 didn't get declared 'stable' because it had very few bugs, it was 
declared 'stable' because someone volunteered to maintain it longer and 
back-port patches to it long past the normal process.

2.6.32 got declared 'long-term stable' before 2.6.33 was released, again 
not because it was especially good, but because it didn't appear to be 
especially bad and several distros were shipping kernels based on it, so 
again someone volunteered (or was volunteered by the distro that pays 
their paycheck) to badk-port patches to it longer.

I have been running kernel.org kernels on my production systems for >13 
years. I am _very_ short of time, so I generally don't get a chance to 
test the -rc kernels (once in a while I do get a chance to do so on my 
laptop). What I do is every 2-3 kernel releases I wait a couple days after 
the kernel release to see if there are show-stopper bugs, and if nothing 
shows up (which is the common case for the last several years) I compile a 
kernel and load it on machines in my lab. I try to have a selection of 
machines that match the systems I have in production in what I have found 
are the 'important' ways (a defintition that changes once in a while when 
I find something that should 'just work' that doesn't ;-). This primarily 
includes systems with all the network card types and Raid card types that 
I use in production, but now also includes a machine with a SSD (after I 
found a bug that only affected that combination)

if my lab machiens don't crash immediatly, I leave them running (usually 
not even stress testing them, again lack of time) for a week or so, then I 
put the new kernel on my development machiens, wait a few days, then put 
them on QA machines, wait a few days, then put them in production. I have 
the old kernel around so that I can re-boot into it if needed.

This tends to work very well for me. It's not perfect and every couple of 
cycles I run into grief and have to report a bug to the kernel list. 
Usually I find it before I get into production, but I have run into cases 
that got all the way into production before I found a problem.

with the 'new' -stable series, I generally wait until at least 2.6.x.1 is 
released before I consider it ready to go anywhere outside my lab (I'll 
still install the 2.6.x kernel in the lab, but I'll wait for the 
additional testing that comes with the .1 stable kernels before moving it 
on)

I don't go through this entire process with the later -stable kernels, If 
I'm already running 2.6.x and there is a 2.6.x.y released that contains 
fixes that look like they are relavent to the configuration that I run 
(which lets out the majority of changes, I do fairly minimal kernel 
configs) I will just test it in the lab to do a smoke test, then schedule 
a rollout through the rest of my network. If there are no problems before 
I get permission to deploy to production I put it on half my boxes, 
failover to them, then wait a little bit (a day to a week) before 
upgrading the backups.

this writeup actually makes it sound like I spend a lot of time working 
with kernels, but I really don't. I'll spend couple half days twice a year 
on testing, and then additional time rolling it out to the 150+ clusters 
of servers I have in place. If you can't spend at least this much time on 
the kernel you are probably better off just running your distro kernel, 
but even there you really should do a very similar set of tests on it's 
kernel releases.

There's another department in my company that uses distro kernels (big 
name distro, but I will avoid flames by not naming names) without the 
testing routine that I use and my track record for stability compares 
favorablely to theirs over the last 7 years or so (they haven't been 
running linux as long as I have, so we can't go back as far ;-) They also 
do more updates than I do simply because they can't as easily look at the 
kernel release and decide it doesn't apply to them.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/