[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200223224046.o6ynykpvg6kl75ar@ast-mbp>
Date: Sun, 23 Feb 2020 14:40:48 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: LKML <linux-kernel@...r.kernel.org>,
David Miller <davem@...emloft.net>, bpf@...r.kernel.org,
netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Sebastian Sewior <bigeasy@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Clark Williams <williams@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Juri Lelli <juri.lelli@...hat.com>,
Ingo Molnar <mingo@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Vinicius Costa Gomes <vinicius.gomes@...el.com>,
Jakub Kicinski <kuba@...nel.org>
Subject: Re: [patch V2 01/20] bpf: Enforce preallocation for all
instrumentation programs
On Sat, Feb 22, 2020 at 09:40:10AM +0100, Thomas Gleixner wrote:
> Alexei,
>
> Alexei Starovoitov <alexei.starovoitov@...il.com> writes:
> > On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote:
> >> The assumption that only programs attached to perf NMI events can deadlock
> >> on memory allocators is wrong. Assume the following simplified callchain:
> >> */
> >> - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> >> + if ((is_tracing_prog_type(prog->type)) {
> >
> > This doesn't build.
> > I assumed the typo somehow sneaked in and proceeded, but it broke
> > a bunch of tests:
> > Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED
> > One can argue that the test are unsafe and broken.
> > We used to test all those tests with and without prealloc:
> > map_flags = 0;
> > run_all_tests();
> > map_flags = BPF_F_NO_PREALLOC;
> > run_all_tests();
> > Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc
> > always and that how it stayed since then. We can adjust the tests to use
> > prealloc with tracing progs, but this breakage shows that there could be plenty
> > of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply
> > be because they know that their kprobes are in a safe spot (and kmalloc is ok)
> > and they want to save memory. They could be using large max_entries parameter
> > for worst case hash map usage, but typical load is low. In general hashtables
> > don't perform well after 50%, so prealloc is wasting half of the memory. Since
> > we cannot control where kprobes are placed I'm not sure what is the right fix
> > here. It feels that if we proceed with this patch somebody will complain and we
> > would have to revert, but I'm willing to take this risk if we cannot come up
> > with an alternative fix.
>
> Having something which is known to be broken exposed is not a good option
> either.
>
> Just assume that someone is investigating a kernel issue. BOFH who is
> stuck in the 90's uses perf, kprobes and tracepoints. Now he goes on
> vacation and the new kid in the team decides to flip that over to BPF.
> So now instead of getting information he deadlocks or crashes the
> machine.
>
> You can't just tell him, don't do that then. It's broken by design and
> you really can't tell which probes are safe and which are not because
> the allocator calls out into whatever functions which might look
> completely unrelated.
>
> So one way to phase this out would be:
>
> if (is_tracing()) {
> if (is_perf() || IS_ENABLED(RT))
> return -EINVAL;
> WARN_ONCE(.....)
> }
>
> And clearly write in the warning that this is dangerous, broken and
> about to be forbidden. Hmm?
Yeah. Let's start with WARN_ONCE and verbose(env, "dangerous, broken")
so the users see it in the verifier log and people who maintain
servers (like kernel-team-s in fb, goog, etc) see it as well
in their dmesg logs. So the motivation will be on both sides.
Then in few kernel releases we can flip it to disable.
Or we'll find a way to make it work without pre-allocating.
Powered by blists - more mailing lists