netdev - Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87im7u9v0t.fsf@buslov.dev>
Date:   Mon, 18 Jan 2021 16:31:14 +0200
From:   Vlad Buslov <vlad@...lov.dev>
To:     Jamal Hadi Salim <jhs@...atatu.com>
Cc:     Cong Wang <xiyou.wangcong@...il.com>, netdev@...r.kernel.org,
        Cong Wang <cong.wang@...edance.com>,
        syzbot+82752bc5331601cf4899@...kaller.appspotmail.com,
        syzbot+b3b63b6bff456bd95294@...kaller.appspotmail.com,
        syzbot+ba67b12b1ca729912834@...kaller.appspotmail.com,
        Jiri Pirko <jiri@...nulli.us>,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        Davide Caratti <dcaratti@...hat.com>,
        Briana Oursler <briana.oursler@...il.com>
Subject: Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by
 request_module()

On Sun 17 Jan 2021 at 17:15, Jamal Hadi Salim <jhs@...atatu.com> wrote:
> On 2021-01-16 7:56 p.m., Cong Wang wrote:
>> From: Cong Wang <cong.wang@...edance.com>
>> tcf_action_init_1() loads tc action modules automatically with
>> request_module() after parsing the tc action names, and it drops RTNL
>> lock and re-holds it before and after request_module(). This causes a
>> lot of troubles, as discovered by syzbot, because we can be in the
>> middle of batch initializations when we create an array of tc actions.
>> One of the problem is deadlock:
>> CPU 0					CPU 1
>> rtnl_lock();
>> for (...) {
>>    tcf_action_init_1();
>>      -> rtnl_unlock();
>>      -> request_module();
>> 				rtnl_lock();
>> 				for (...) {
>> 				  tcf_action_init_1();
>> 				    -> tcf_idr_check_alloc();
>> 				   // Insert one action into idr,
>> 				   // but it is not committed until
>> 				   // tcf_idr_insert_many(), then drop
>> 				   // the RTNL lock in the _next_
>> 				   // iteration
>> 				   -> rtnl_unlock();
>>      -> rtnl_lock();
>>      -> a_o->init();
>>        -> tcf_idr_check_alloc();
>>        // Now waiting for the same index
>>        // to be committed
>> 				    -> request_module();
>> 				    -> rtnl_lock()
>> 				    // Now waiting for RTNL lock
>> 				}
>> 				rtnl_unlock();
>> }
>> rtnl_unlock();
>> This is not easy to solve, we can move the request_module() before
>> this loop and pre-load all the modules we need for this netlink
>> message and then do the rest initializations. So the loop breaks down
>> to two now:
>>          for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>>                  struct tc_action_ops *a_o;
>>                  a_o = tc_action_load_ops(name, tb[i]...);
>>                  ops[i - 1] = a_o;
>>          }
>>          for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>>                  act = tcf_action_init_1(ops[i - 1]...);
>>          }
>> Although this looks serious, it only has been reported by syzbot, so it
>> seems hard to trigger this by humans. And given the size of this patch,
>> I'd suggest to make it to net-next and not to backport to stable.
>> This patch has been tested by syzbot and tested with tdc.py by me.
>> 
>
> LGTM.
> Initially i was worried about performance impact but i found nothing
> observable. We need to add a tdc test for batch (I can share how i did
> batch testing at next meet).
>
> Tested-by: Jamal Hadi Salim <jhs@...atatu.com>
> Acked-by: Jamal Hadi Salim <jhs@...atatu.com>
>
> cheers,
> jamal

Hi,

Thanks for adding me to the thread!
I ran our performance tests with the patch applied and didn't observe
any regression.

Regards,
Vlad