linux-kernel - Re: [patch V2 01/20] bpf: Enforce preallocation for all instrumentation programs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20200223224046.o6ynykpvg6kl75ar@ast-mbp>
Date:   Sun, 23 Feb 2020 14:40:48 -0800
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        David Miller <davem@...emloft.net>, bpf@...r.kernel.org,
        netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Sebastian Sewior <bigeasy@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Clark Williams <williams@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Ingo Molnar <mingo@...nel.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Vinicius Costa Gomes <vinicius.gomes@...el.com>,
        Jakub Kicinski <kuba@...nel.org>
Subject: Re: [patch V2 01/20] bpf: Enforce preallocation for all
 instrumentation programs

On Sat, Feb 22, 2020 at 09:40:10AM +0100, Thomas Gleixner wrote:
> Alexei,
> 
> Alexei Starovoitov <alexei.starovoitov@...il.com> writes:
> > On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote:
> >> The assumption that only programs attached to perf NMI events can deadlock
> >> on memory allocators is wrong. Assume the following simplified callchain:
> >>  	 */
> >> -	if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> >> +	if ((is_tracing_prog_type(prog->type)) {
> >
> > This doesn't build.
> > I assumed the typo somehow sneaked in and proceeded, but it broke
> > a bunch of tests:
> > Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED
> > One can argue that the test are unsafe and broken.
> > We used to test all those tests with and without prealloc:
> > map_flags = 0;
> > run_all_tests();
> > map_flags = BPF_F_NO_PREALLOC;
> > run_all_tests();
> > Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc
> > always and that how it stayed since then. We can adjust the tests to use
> > prealloc with tracing progs, but this breakage shows that there could be plenty
> > of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply
> > be because they know that their kprobes are in a safe spot (and kmalloc is ok)
> > and they want to save memory. They could be using large max_entries parameter
> > for worst case hash map usage, but typical load is low. In general hashtables
> > don't perform well after 50%, so prealloc is wasting half of the memory. Since
> > we cannot control where kprobes are placed I'm not sure what is the right fix
> > here. It feels that if we proceed with this patch somebody will complain and we
> > would have to revert, but I'm willing to take this risk if we cannot come up
> > with an alternative fix.
> 
> Having something which is known to be broken exposed is not a good option
> either.
> 
> Just assume that someone is investigating a kernel issue. BOFH who is
> stuck in the 90's uses perf, kprobes and tracepoints. Now he goes on
> vacation and the new kid in the team decides to flip that over to BPF.
> So now instead of getting information he deadlocks or crashes the
> machine.
> 
> You can't just tell him, don't do that then. It's broken by design and
> you really can't tell which probes are safe and which are not because
> the allocator calls out into whatever functions which might look
> completely unrelated.
> 
> So one way to phase this out would be:
> 
> 	if (is_tracing()) {
>         	if (is_perf() || IS_ENABLED(RT))
>                 	return -EINVAL;
>                 WARN_ONCE(.....)
>         }
> 
> And clearly write in the warning that this is dangerous, broken and
> about to be forbidden. Hmm?

Yeah. Let's start with WARN_ONCE and verbose(env, "dangerous, broken")
so the users see it in the verifier log and people who maintain
servers (like kernel-team-s in fb, goog, etc) see it as well
in their dmesg logs. So the motivation will be on both sides.
Then in few kernel releases we can flip it to disable.
Or we'll find a way to make it work without pre-allocating.