linux-kernel - Re: [GIT PULL] usercopy whitelisting for v4.15-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+55aFyBOWcz5jhni5PKEmwWMEiu8xm0-smorPA_wEJYNhbLaw@mail.gmail.com>
Date:   Tue, 21 Nov 2017 05:25:06 -1000
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
Cc:     Kees Cook <keescook@...omium.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        David Windsor <dave@...lcore.net>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [GIT PULL] usercopy whitelisting for v4.15-rc1

[ This turned longer than it should have. I guess jet lag is a good thing ]

On Tue, Nov 21, 2017 at 3:48 AM, Jason A. Donenfeld <Jason@...c4.com> wrote:
>
> It might be news to you that actually some security people scoff at
> other security peoples' obsession with "security bugs".

Heh. I'm not actually surprised. It's just that the public "look at
this security bug" ones are the ones  you see.

> The security industry is largely obsessed by finding (and
> selling/using/patching/reporting/showcasing/stockpiling/detecting/stealing)
> these "dangerous/useful" variety of bugs. And this obsession is
> continually fulfilled because bugs keep happening -- which is just the
> nature of software development -- and so this "security bug"
> infatuation continues. It even leads to people upset with you that you
> don't care about CVEs and so forth, because they're so fixated on
> individual bugs and their security impact.

Agreed.

> So what's the
> alternative to obsessing over each individual software bug?
>
> In the context of the kernel, the solution from Spender and Pipacs,
> and more recently "adopted" by Kees and his KSPP project, has been to
> try to eliminate the "security utility" of bugs.

And the thing is, I obviously agree very much about the whole "let's
have multiple layers of security even within the kernel, so that
random individual bugs don't end up being so exploitable". Bugs will
happen, let's aim to limit their damage.

To turn them "benign" in your words.

So I should be thrilled pink about the hardening efforts, right?

Well, I would - except for what "benign" means in that context, and
how security people have very different expectations from users - and
how those are both different from developers.

>From a security standpoint, when you find an invalid access, and you
mitigate it, you've done a great job, and your hardening was
successful and you're done. "Look ma, it's not a security issue any
more", and you can basically ignore it as "just another bug" that is
now in a class that is no longer your problem.

So to you, the big win is when the access is _stopped_. That's the end
of the story from a security standpoint - at least if you are one of
those bad security people who don't care about anything else.

But from a developer standpoint, things _really_ are not done. Not
even close. From a developer standpoint, the bad access was just a
symptom, and it needs to be reported, and debugged, and fixed, so that
the bug actually gets corrected.

So from a developer standpoint, the end point of  hardening is just
the starting point, and when _you_ think you're done, we're really
only getting started.

And from a _user_ standpoint, it's something else altogether. For a
user, pretty much EVERY SINGLE TIME, it wasn't actually a security
attack at all, it was just a latent bug that got exposed. And the
keyword here is that it was _latent_, and things used to work, and the
hardening patch did something - probably fairly drastic - to turn it
from "dangerous" to "benign" from a security perspective.

So from a user standpoint, the hardening was just a big nasty
annoyance, and probably made their workflow _break_, without actually
helping their case at all, because they never really saw the original
bug as a problem to begin with.

Notice? BIG disconnect in what "hardening" means for three groups, and
in particular, the number one rule of kernel development is that "we
don't break users".

Because without users, your program is pointless, and all the
development work you've done over decades is pointless.

.. and security is pointless too, in the end.

Now, the thing that annoys me and that makes me so _angry_ about this,
is that it shouldn't need to be that huge a disconnect.

It shouldn't need to be a big issue, because pretty much all the work
done for hardening should be able to actually make both the developers
and the users _happier_, instead of just making their lives miserable.

But that does mean that he hardening people need to really see past
that "endpoint" that they were looking at.

For a developer, the hardening effort _could_ be a great boon, in that
it could show nasty bugs early, it could make them easier to report,
and it could add a lot of useful information to that report that makes
them easier to fix too.

And from a user perspective, the hardening work shouldn't have to mean
"the latent bug that I didn't care about now screwed me over and is an
overt bug for me". It might not help the user directly, but if a year
from now, the latent bug that made their machine occasionally go all
wonky is fixed, the hardening effort did end up helping them too.

But what do we need for this to actually happen?

As a developer, I do want the report. But if you killed the user
program in the process, I'm actually _less_ likely to get the report,
because the latent access was most likely in some really rare and
nasty case, or we would have found it already. In the kernel, there's
a high likelihood that it was in a driver, for example. Maybe an
unusual ioctl() that is not getting a huge amount of attention,
because it's one driver ramong thousands, and it's probably not used
every time anyway. But because it's the kernel, and because it's a
driver, it's quite likely that killing the offender will do bad things
to various random locks that were held, or maybe it happens in an
interrupt and the whole machine is now dead if we're unlucky because
there really were some very core locks being held.

And as a user, my unhappiness is obvious. You don't even have to kill
the machine and make it hard to report to make a user unhappy, just
"the new kernel didn't work for me" will make that user skittish about
upgrading the kernel at all.

And the fix really looks fairly straightforward:

 - when adding hardening features, you as a security person should
always see that hardening to be the _endpoint_, but not the immediate
goal.

 - when adding hardening features, the first step should *ALWAYS* be
"just report it". Not killing things, not even stopping the access.
Report it. Nothing else.

"Do no harm" should be your mantra for any new hardening work.

And that "do no harm" may feel antithetical to the whole point. You go
"but t hat doesn't work - then the bug still exists".

But remember - keep your eye on the endpoint, and that this is just
the first step. You need to not piss off users, and you need to not
piss of developers.

Because if you as a security person just piss off users, and piss off
developers, I'm not going to take your work, and I'm going to call you
a bad security person.

Because in the end, those users really do matter. Without those users,
your system may be "secure", but all your security work was still just
masturbation. You didn't do anything useful at all in the end.

So if hardening people can learn to "always report first", then I
think we can all work together.

But that really does mean that you don't start killing processes until
after you've shown that "look, the code to report these things has
been there for months, can we start doing more drastic things now?"

And remember: it's not that the code is  months old. It's that the
code has been RUN BY USERS for months. If it's been in your tree, or
in grsecurity for five years, that doesn't mean a thing. It only means
that hardly anybody actually ever ran it.

If it's been on a random cellphone for a few months, and real users
used it, and had facebook and candy crush running on it, that's a
pretty different deal.

If it's been in a released kernel for a year, and Ubuntu and Fedora
and SuSE had it enabled, and there aren't reports, that's a big deal.

In contrast, if it's been on your server farm for three months, that
means _nothing_. You have pretty much zero coverage of the driver
situation, or of the random apps that people run.

See?

All I need is that the whole "let's kill processes" mentality goes
away, and that people acknowledge that the first step is always "just
report".

Do no harm.

Please.

                Linus