[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <acba35f6-2e29-903f-6eb8-a50dde25a147@mojatatu.com>
Date: Sun, 17 Jan 2021 10:15:45 -0500
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Cong Wang <xiyou.wangcong@...il.com>, netdev@...r.kernel.org
Cc: Cong Wang <cong.wang@...edance.com>,
syzbot+82752bc5331601cf4899@...kaller.appspotmail.com,
syzbot+b3b63b6bff456bd95294@...kaller.appspotmail.com,
syzbot+ba67b12b1ca729912834@...kaller.appspotmail.com,
Jiri Pirko <jiri@...nulli.us>,
Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
Davide Caratti <dcaratti@...hat.com>,
Vlad Buslov <vlad@...lov.dev>,
Briana Oursler <briana.oursler@...il.com>
Subject: Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by
request_module()
On 2021-01-16 7:56 p.m., Cong Wang wrote:
> From: Cong Wang <cong.wang@...edance.com>
>
> tcf_action_init_1() loads tc action modules automatically with
> request_module() after parsing the tc action names, and it drops RTNL
> lock and re-holds it before and after request_module(). This causes a
> lot of troubles, as discovered by syzbot, because we can be in the
> middle of batch initializations when we create an array of tc actions.
>
> One of the problem is deadlock:
>
> CPU 0 CPU 1
> rtnl_lock();
> for (...) {
> tcf_action_init_1();
> -> rtnl_unlock();
> -> request_module();
> rtnl_lock();
> for (...) {
> tcf_action_init_1();
> -> tcf_idr_check_alloc();
> // Insert one action into idr,
> // but it is not committed until
> // tcf_idr_insert_many(), then drop
> // the RTNL lock in the _next_
> // iteration
> -> rtnl_unlock();
> -> rtnl_lock();
> -> a_o->init();
> -> tcf_idr_check_alloc();
> // Now waiting for the same index
> // to be committed
> -> request_module();
> -> rtnl_lock()
> // Now waiting for RTNL lock
> }
> rtnl_unlock();
> }
> rtnl_unlock();
>
> This is not easy to solve, we can move the request_module() before
> this loop and pre-load all the modules we need for this netlink
> message and then do the rest initializations. So the loop breaks down
> to two now:
>
> for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
> struct tc_action_ops *a_o;
>
> a_o = tc_action_load_ops(name, tb[i]...);
> ops[i - 1] = a_o;
> }
>
> for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
> act = tcf_action_init_1(ops[i - 1]...);
> }
>
> Although this looks serious, it only has been reported by syzbot, so it
> seems hard to trigger this by humans. And given the size of this patch,
> I'd suggest to make it to net-next and not to backport to stable.
>
> This patch has been tested by syzbot and tested with tdc.py by me.
>
LGTM.
Initially i was worried about performance impact but i found nothing
observable. We need to add a tdc test for batch (I can share how i did
batch testing at next meet).
Tested-by: Jamal Hadi Salim <jhs@...atatu.com>
Acked-by: Jamal Hadi Salim <jhs@...atatu.com>
cheers,
jamal
Powered by blists - more mailing lists