linux-kernel - Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAEf4BzYrZZH7uTuBG=feL+AORgxqtAKhG3hJ=vUJvQd1xSOe0Q@mail.gmail.com>
Date: Thu, 31 Oct 2024 09:35:12 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org, 
	Michael Jeanson <mjeanson@...icios.com>, Masami Hiramatsu <mhiramat@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Alexei Starovoitov <ast@...nel.org>, Yonghong Song <yhs@...com>, 
	"Paul E . McKenney" <paulmck@...nel.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Namhyung Kim <namhyung@...nel.org>, 
	bpf@...r.kernel.org, Joel Fernandes <joel@...lfernandes.org>, 
	Jordan Rife <jrife@...gle.com>, syzbot+b390c8062d8387b6272a@...kaller.appspotmail.com
Subject: Re: [RFC PATCH v3 3/3] tracing: Fix syscall tracepoint use-after-free

On Thu, Oct 31, 2024 at 8:44 AM Mathieu Desnoyers
<mathieu.desnoyers@...icios.com> wrote:
>
> On 2024-10-28 15:19, Mathieu Desnoyers wrote:
> > On 2024-10-27 21:22, Andrii Nakryiko wrote:
> >> On Sat, Oct 26, 2024 at 8:48 AM Mathieu Desnoyers
> >> <mathieu.desnoyers@...icios.com> wrote:
> >>>
> >>> The grace period used internally within tracepoint.c:release_probes()
> >>> uses call_rcu() to batch waiting for quiescence of old probe arrays,
> >>> rather than using the tracepoint_synchronize_unregister() which blocks
> >>> while waiting for quiescence.
> >>>
> >>> With the introduction of faultable syscall tracepoints, this causes
> >>> use-after-free issues reproduced with syzkaller.
> >>>
> >>> Fix this by using the appropriate call_rcu() or call_rcu_tasks_trace()
> >>> before invoking the rcu_free_old_probes callback. This can be chosen
> >>> using the tracepoint_is_syscall() API.
> >>>
> >>> A similar issue exists in bpf use of call_rcu(). Fixing this is left to
> >>> a separate change.
> >>>
> >>> Reported-by: syzbot+b390c8062d8387b6272a@...kaller.appspotmail.com
> >>> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to
> >>> handle page faults")
> >>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> >>> Cc: Michael Jeanson <mjeanson@...icios.com>
> >>> Cc: Steven Rostedt <rostedt@...dmis.org>
> >>> Cc: Masami Hiramatsu <mhiramat@...nel.org>
> >>> Cc: Peter Zijlstra <peterz@...radead.org>
> >>> Cc: Alexei Starovoitov <ast@...nel.org>
> >>> Cc: Yonghong Song <yhs@...com>
> >>> Cc: Paul E. McKenney <paulmck@...nel.org>
> >>> Cc: Ingo Molnar <mingo@...hat.com>
> >>> Cc: Arnaldo Carvalho de Melo <acme@...nel.org>
> >>> Cc: Mark Rutland <mark.rutland@....com>
> >>> Cc: Alexander Shishkin <alexander.shishkin@...ux.intel.com>
> >>> Cc: Namhyung Kim <namhyung@...nel.org>
> >>> Cc: Andrii Nakryiko <andrii.nakryiko@...il.com>
> >>> Cc: bpf@...r.kernel.org
> >>> Cc: Joel Fernandes <joel@...lfernandes.org>
> >>> Cc: Jordan Rife <jrife@...gle.com>
> >>> ---
> >>> Changes since v0:
> >>> - Introduce tracepoint_call_rcu(),
> >>> - Fix bpf_link_free() use of call_rcu as well.
> >>>
> >>> Changes since v1:
> >>> - Use tracepoint_call_rcu() for bpf_prog_put as well.
> >>>
> >>> Changes since v2:
> >>> - Do not cover bpf changes in the same commit, let bpf developers
> >>>    implement it.
> >>> ---
> >>>   kernel/tracepoint.c | 11 +++++++----
> >>>   1 file changed, 7 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> >>> index 5658dc92f5b5..47569fb06596 100644
> >>> --- a/kernel/tracepoint.c
> >>> +++ b/kernel/tracepoint.c
> >>> @@ -106,13 +106,16 @@ static void rcu_free_old_probes(struct rcu_head
> >>> *head)
> >>>          kfree(container_of(head, struct tp_probes, rcu));
> >>>   }
> >>>
> >>> -static inline void release_probes(struct tracepoint_func *old)
> >>> +static inline void release_probes(struct tracepoint *tp, struct
> >>> tracepoint_func *old)
> >>>   {
> >>>          if (old) {
> >>>                  struct tp_probes *tp_probes = container_of(old,
> >>>                          struct tp_probes, probes[0]);
> >>>
> >>> -               call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> >>> +               if (tracepoint_is_syscall(tp))
> >>> +                       call_rcu_tasks_trace(&tp_probes->rcu,
> >>> rcu_free_old_probes);
> >>
> >> should this be call_rcu_tasks_trace() -> call_rcu() chain instead of
> >> just call_rcu_tasks_trace()? While currently call_rcu_tasks_trace()
> >> implies RCU GP (as evidenced by rcu_trace_implies_rcu_gp() being
> >> hardcoded right now to returning true), this might not always be the
> >> case in the future, so it's best to have a guarantee that regardless
> >> of sleepable or not, we'll always have have RCU GP, and for sleepable
> >> tracepoint *also* RCU Tasks Trace GP.
> >
> > Given that faultable tracepoints only use RCU tasks trace for the
> > read-side and do not rely on preempt disable, I don't see why we would
> > need to chain both grace periods there ?
>
> Hi Andrii,
>
> AFAIU, your question above is rooted in the way bpf does its sleepable
> program grace periods (chaining RCU tasks trace + RCU GP), e.g.:
>
> bpf_map_free_mult_rcu_gp
> bpf_link_defer_dealloc_mult_rcu_gp
>
> and
>
> bpf_link_free:
>                  /* schedule BPF link deallocation; if underlying BPF program
>                   * is sleepable, we need to first wait for RCU tasks trace
>                   * sync, then go through "classic" RCU grace period
>                   */
>
> This is introduced in commit 1a80dbcb2db ("bpf: support deferring bpf_link dealloc to after RCU grace period")
> which has a bit more information in the commit message, but what I'm not seeing
> is an explanation of *why* chaining RCU tasks trace and RCU grace periods is
> needed for sleepable bpf programs. What am I missing ?

At least one of the reasons are BPF maps that can be used from both
sleepable and non-sleepable BPF programs *at the same time*. So in BPF
everything is *at least* protected with rcu_read_lock(), and then
sleepable-capable things are *additionally* supported by
rcu_read_tasks_trace(). So on destruction, we chain both RCU GP kinds
to make sure that all users can't see BPF map/prog/(and soon links).

It might not be strictly necessary in general, but you are right that
I asked because of how we do this in BPF. Also, in practice, tasks
trace RCU GP implies RCU GP, so there is no overhead for how we do
this for BPF maps (and progs? didn't check).

Anyways, this might be fine as is.

>
> As far as tracepoint.c release_probes() is concerned, just waiting for
> RCU tasks trace before freeing memory of faultable tracepoints is
> sufficient.
>
> Thanks,
>
> Mathieu
>
> >
> > Thanks,
> >
> > Mathieu
> >
> >>
> >>> +               else
> >>> +                       call_rcu(&tp_probes->rcu, rcu_free_old_probes);
> >>>          }
> >>>   }
> >>>
> >>> @@ -334,7 +337,7 @@ static int tracepoint_add_func(struct tracepoint
> >>> *tp,
> >>>                  break;
> >>>          }
> >>>
> >>> -       release_probes(old);
> >>> +       release_probes(tp, old);
> >>>          return 0;
> >>>   }
> >>>
> >>> @@ -405,7 +408,7 @@ static int tracepoint_remove_func(struct
> >>> tracepoint *tp,
> >>>                  WARN_ON_ONCE(1);
> >>>                  break;
> >>>          }
> >>> -       release_probes(old);
> >>> +       release_probes(tp, old);
> >>>          return 0;
> >>>   }
> >>>
> >>> --
> >>> 2.39.5
> >>>
> >
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>