linux-kernel - Re: [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whFQNkqPJ5zA1xAyvgtCPLN2C4xeJ181rU3k6bG+2zugg@mail.gmail.com>
Date:   Sat, 24 Aug 2019 10:40:47 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Arnd Bergmann <arnd@...db.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        syzbot <syzbot+8ab2d0f39fb79fe6ca40@...kaller.appspotmail.com>
Subject: Re: [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory.

On Sat, Aug 24, 2019 at 9:14 AM Ingo Molnar <mingo@...nel.org> wrote:
>
> What I noticed is that while reading regular RAM is reasonably fast even
> in very large chunks, accesses become very slow once they hit iomem - and
> delays even longer than the reported 145 seconds become possible if
> bs=100M is increased to say bs=1000M.

Ok, that makes sense.

So for IOMEM, we actually have two independent issues:

 (a) for normal unmapped IOMEM (ie no device), which is probably the
case your test triggers anyway, it quite possibly ends up having ISA
timings (ie around 1us per read)

 (b) we use "probe_kernel_read()" to a temporary buffer avoid page
faults, which uses __copy_from_user_inatomic(). So far so good. But on
modern x86, because we treat it as just a regular memcpy, we probably
have ERMS and do "rep movsb".

So (a) means that each access is slow to begin with, and then (b)
means that we don't use "copy_from_io()" but a slow byte-by-byte
access.

> With Tetsuo's approach the delays are fixed, because the fatal signal is
> noticed within the 4K chunks of read_mem() immediately:

Yeah. that's the size of the temp buffer.

> Note how the first 100MB chunk took 17 seconds, the second chunk was
> interrupted within ~2 seconds after I sent a SIGKILL.

It looks closer to 1 second from that trace, but that actually is
still within the basic noise above: a 4kB buffer being copied one byte
at a time would take about 4s, but maybe it's not *quite* as bad as
traditional ISA PIO timings.

> Except that I think the regular pattern here is not 'break' but 'goto
> failed' - 'break' would just result in a partial read and 'err' gets
> ignored.

That's actually the correct signal handling pattern: either a partial
read, or -EINTR.

In the case of SIGKILL, the return value doesn't matter, of course,
but *if* we were to decide that we can make it interruptible, then it
would.

> I also agree that we should probably allow regular interrupts to
> interrupt /dev/mem accesses - while the 'dd' process is killable, it's
> not Ctrl-C-able.

I'm a bit nervous about it, but there probably aren't all that many
actual /dev/mem users.

The main ones I can see are things like "dmidecode" etc.

> If I change the fatal_signal_pending() to signal_pending() it all behaves
> much better IMHO - assuming that no utility learned rely on
> non-interruptibility ...

So I cloned the dmidecode git tree, and it does "myread()" on the
/dev/mem file as far as I can tell. And that one correctly hanles
partial reads.

It still makes me a bit nervous, but just plain "signal_pending()" is
not just nicer, it's technically the right thing to do (it's not a
regular file, so partial reads are permissible, and other files like
it - eg /dev/zero - already does exactly that).

I also wonder if we should just not use that crazy
"probe_kernel_read()" to begin with. The /dev/mem code uses it to
avoid faults - and that's the intent of it - but it's not meant for
IO-memory at all.

But "read()" on /dev/mem does that off "xlate_dev_mem_ptr()", which is
a bastardized "kernel address or ioremap" thing. It works, but it's
all kinds of stupid. We'd be a *lot* better off using kmap(), I think.

So my gut feel is that this is something worth trying to do bigger and
more fundamental changes to.

But changing it to use "signal_pending()" at least as a trial looks
ok. The only user *should* be things like dmidecode that apparently
already do the right thing.

And if we changed the bounce buffering to do things at least a 32-bit
word at a time, that would help latency for IO by a factor of 4.

And if the signal_pending() is at the end of the copy, then we'd
guarantee that at least _small_ reads still work the way they did
before, so people who do things like "lspci()" with direct hardware
accesses wouldn't be impacted - if they exist.

So I'd be willing to try that (and then if somebody reports a
regression we can make it use "fatal_signal_pending()" instead)

                    Linus