linux-kernel - Re: live kernel upgrades (was: live kernel patching design)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 23 Feb 2015 07:35:53 +0100
From:	Vojtech Pavlik <vojtech@...e.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Jiri Kosina <jkosina@...e.cz>, Ingo Molnar <mingo@...nel.org>,
	Josh Poimboeuf <jpoimboe@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Seth Jennings <sjenning@...hat.com>,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Arjan van de Ven <arjan@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Borislav Petkov <bp@...en8.de>, live-patching@...r.kernel.org
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote:

> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <jkosina@...e.cz> wrote:
> 
> > But if you ask the folks who are hungry for live bug patching, they 
> > wouldn't care.
> > 
> > You mentioned "10 seconds", that's more or less equal to infinity to them. 
> 
> 10 seconds outage is unacceptable, but we're running our service on a
> single machine with no failover.  Who is doing this??

This is the most common argument that's raised when live patching is
discussed. "Why do need live patching when we have redundancy?"

People who are asking for live patching typically do have failover in
place, but prefer not to have to use it when they don't have to.

In many cases, the failover just can't be made transparent to the
outside world and there is a short outage. Examples would be legacy
applications which can't run in an active-active cluster and need to be
restarted on failover. Or trading systems, where the calculations must
be strictly serialized and response times are counted in tens of
microseconds. 

Another usecase is large HPC clusters, where all nodes have to run
carefully synchronized. Once one gets behind in a calculation cycle,
others have to wait for the results and the efficiency of the whole
cluster goes down. There are people who run realtime on them for
that reason. Dumping all data and restarting the HPC cluster takes a lot
of time and many nodes (out of tens of thousands) may not come back up,
making the restore from media difficult. Doing a rolling upgrade causes
the nodes one by one stall by 10+ seconds, which times 10k is a long
time, too.

And even the case where you have a perfect setup with everything
redundant and with instant failover does benefit from live patching.
Since you have to plan for failure, you have to plan for failure while
patching, too. With live patching you need 2 servers minimum (or N+1),
without you need 3 (or N+2), as one will be offline while during the
upgrade process.

10 seconds of outage may be acceptable in a disaster scenario. Not
necessarily for a regular update scenario.

The value of live patching is in near zero disruption.

-- 
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/