[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87im7u9v0t.fsf@buslov.dev>
Date: Mon, 18 Jan 2021 16:31:14 +0200
From: Vlad Buslov <vlad@...lov.dev>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: Cong Wang <xiyou.wangcong@...il.com>, netdev@...r.kernel.org,
Cong Wang <cong.wang@...edance.com>,
syzbot+82752bc5331601cf4899@...kaller.appspotmail.com,
syzbot+b3b63b6bff456bd95294@...kaller.appspotmail.com,
syzbot+ba67b12b1ca729912834@...kaller.appspotmail.com,
Jiri Pirko <jiri@...nulli.us>,
Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
Davide Caratti <dcaratti@...hat.com>,
Briana Oursler <briana.oursler@...il.com>
Subject: Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by
request_module()
On Sun 17 Jan 2021 at 17:15, Jamal Hadi Salim <jhs@...atatu.com> wrote:
> On 2021-01-16 7:56 p.m., Cong Wang wrote:
>> From: Cong Wang <cong.wang@...edance.com>
>> tcf_action_init_1() loads tc action modules automatically with
>> request_module() after parsing the tc action names, and it drops RTNL
>> lock and re-holds it before and after request_module(). This causes a
>> lot of troubles, as discovered by syzbot, because we can be in the
>> middle of batch initializations when we create an array of tc actions.
>> One of the problem is deadlock:
>> CPU 0 CPU 1
>> rtnl_lock();
>> for (...) {
>> tcf_action_init_1();
>> -> rtnl_unlock();
>> -> request_module();
>> rtnl_lock();
>> for (...) {
>> tcf_action_init_1();
>> -> tcf_idr_check_alloc();
>> // Insert one action into idr,
>> // but it is not committed until
>> // tcf_idr_insert_many(), then drop
>> // the RTNL lock in the _next_
>> // iteration
>> -> rtnl_unlock();
>> -> rtnl_lock();
>> -> a_o->init();
>> -> tcf_idr_check_alloc();
>> // Now waiting for the same index
>> // to be committed
>> -> request_module();
>> -> rtnl_lock()
>> // Now waiting for RTNL lock
>> }
>> rtnl_unlock();
>> }
>> rtnl_unlock();
>> This is not easy to solve, we can move the request_module() before
>> this loop and pre-load all the modules we need for this netlink
>> message and then do the rest initializations. So the loop breaks down
>> to two now:
>> for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>> struct tc_action_ops *a_o;
>> a_o = tc_action_load_ops(name, tb[i]...);
>> ops[i - 1] = a_o;
>> }
>> for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>> act = tcf_action_init_1(ops[i - 1]...);
>> }
>> Although this looks serious, it only has been reported by syzbot, so it
>> seems hard to trigger this by humans. And given the size of this patch,
>> I'd suggest to make it to net-next and not to backport to stable.
>> This patch has been tested by syzbot and tested with tdc.py by me.
>>
>
> LGTM.
> Initially i was worried about performance impact but i found nothing
> observable. We need to add a tdc test for batch (I can share how i did
> batch testing at next meet).
>
> Tested-by: Jamal Hadi Salim <jhs@...atatu.com>
> Acked-by: Jamal Hadi Salim <jhs@...atatu.com>
>
> cheers,
> jamal
Hi,
Thanks for adding me to the thread!
I ran our performance tests with the patch applied and didn't observe
any regression.
Regards,
Vlad
Powered by blists - more mailing lists