linux-kernel - Re: Linux 5.3-rc8

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190917123015.sirlkvy335crozmj@debian-stretch-darwi.lab.linutronix.de>
Date:   Tue, 17 Sep 2019 12:30:15 +0000
From:   "Ahmed S. Darwish" <darwish.07@...il.com>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Martin Steigerwald <martin@...htvoll.de>, Willy Tarreau <w@....eu>,
        Matthew Garrett <mjg59@...f.ucam.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Vito Caputo <vcaputo@...garu.com>,
        Lennart Poettering <mzxreary@...inter.de>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Jan Kara <jack@...e.cz>, Ray Strode <rstrode@...hat.com>,
        William Jon McCann <mccann@....edu>,
        "Alexander E. Patrakov" <patrakov@...il.com>,
        zhangjs <zachary@...shancloud.com>, linux-ext4@...r.kernel.org,
        lkml <linux-kernel@...r.kernel.org>
Subject: Re: Linux 5.3-rc8

On Tue, Sep 17, 2019 at 08:11:56AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> > Willy Tarreau - 17.09.19, 07:24:38 CEST:
> > > On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > > > >Well, the patch actually made getrandom() return en error too, but
> > > > >you seem more interested in the hypotheticals than in arguing
> > > > >actualities.>
> > > > If you want to be safe, terminate the process.
> > >
> > > This is an interesting approach. At least it will cause bug reports in
> > > application using getrandom() in an unreliable way and they will
> > > check for other options. Because one of the issues with systems that
> > > do not finish to boot is that usually the user doesn't know what
> > > process is hanging.
> >
>
> I would be happy with a change which changes getrandom(0) to send a
> kill -9 to the process if it is called too early, with a new flag,
> getrandom(GRND_BLOCK) which blocks until entropy is available.  That
> leaves it up to the application developer to decide what behavior they
> want.
>

Yup, I'm convinced that's the sanest option too. I'll send a final RFC
patch tonight implementing the following:

config GETRANDOM_CRNG_ENTROPY_MAX_WAIT_MS
	int
	default 3000
	help
	  Default max wait in milliseconds, for the getrandom(2) system
	  call when asking for entropy from the urandom source, until
	  the Cryptographic Random Number Generator (CRNG) gets
	  initialized.  Any process exceeding this duration for entropy
	  wait will get killed by kernel. The maximum wait can be
	  overriden through the "random.getrandom_max_wait_ms" kernel
	  boot parameter. Rationale follows.

	  When the getrandom(2) system call was created, it came with
	  the clear warning: "Any userspace program which uses this new
	  functionality must take care to assure that if it is used
	  during the boot process, that it will not cause the init
	  scripts or other portions of the system startup to hang
	  indefinitely.

	  Unfortunately, due to multiple factors, including not having
	  this warning written in a scary enough language in the
	  manpages, and due to glibc since v2.25 implementing a BSD-like
	  getentropy(3) in terms of getrandom(2), modern user-space is
	  calling getrandom(2) in the boot path everywhere.

	  Embedded Linux systems were first hit by this, and reports of
	  embedded system "getting stuck at boot" began to be
	  common. Over time, the issue began to even creep into consumer
	  level x86 laptops: mainstream distributions, like Debian
	  Buster, began to recommend installing haveged as a workaround,
	  just to let the system boot.

	  Filesystem optimizations in EXT4 and XFS exagerated the
	  problem, due to aggressive batching of IO requests, and thus
	  minimizing sources of entropy at boot. This led to large
	  delays until the kernel's Cryptographic Random Number
	  Generator (CRNG) got initialized, and thus having reports of
	  getrandom(2) inidifinitely stuck at boot.

	  Solve this problem by setting a conservative upper bound for
	  getrandom(2) wait. Kill the process, instead of returning an
	  error code, because otherwise crypto-sensitive applications
	  may revert to less secure mechanisms (e.g. /dev/urandom). We
	  __deeply encourage__ system integrators and distribution
	  builders not to considerably increase this value: during
	  system boot, you either have entropy, or you don't. And if you
	  didn't have entropy, it will stay like this forever, because
	  if you had, you wouldn't have blocked in the first place. It's
	  an atomic "either/or" situation, with no middle ground. Please
	  think twice.

	  Ideally, systems would be configured with hardware random
	  number generators, and/or configured to trust the CPU-provided
	  RNG's (CONFIG_RANDOM_TRUST_CPU) or boot-loader provided ones
	  (CONFIG_RANDOM_TRUST_BOOTLOADER).  In addition, userspace
	  should generate cryptographic keys only as late as possible,
	  when they are needed, instead of during early boot.  (For
	  non-cryptographic use cases, such as dictionary seeds or MIT
	  Magic Cookies, other mechanisms such as /dev/urandom or
	  random(3) may be more appropropriate.)

Sounds good?

thanks,

--
Ahmed Darwish
http://darwish.chasingpointers.com