netdev - Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <acba35f6-2e29-903f-6eb8-a50dde25a147@mojatatu.com>
Date:   Sun, 17 Jan 2021 10:15:45 -0500
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Cong Wang <xiyou.wangcong@...il.com>, netdev@...r.kernel.org
Cc:     Cong Wang <cong.wang@...edance.com>,
        syzbot+82752bc5331601cf4899@...kaller.appspotmail.com,
        syzbot+b3b63b6bff456bd95294@...kaller.appspotmail.com,
        syzbot+ba67b12b1ca729912834@...kaller.appspotmail.com,
        Jiri Pirko <jiri@...nulli.us>,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        Davide Caratti <dcaratti@...hat.com>,
        Vlad Buslov <vlad@...lov.dev>,
        Briana Oursler <briana.oursler@...il.com>
Subject: Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by
 request_module()

On 2021-01-16 7:56 p.m., Cong Wang wrote:
> From: Cong Wang <cong.wang@...edance.com>
> 
> tcf_action_init_1() loads tc action modules automatically with
> request_module() after parsing the tc action names, and it drops RTNL
> lock and re-holds it before and after request_module(). This causes a
> lot of troubles, as discovered by syzbot, because we can be in the
> middle of batch initializations when we create an array of tc actions.
> 
> One of the problem is deadlock:
> 
> CPU 0					CPU 1
> rtnl_lock();
> for (...) {
>    tcf_action_init_1();
>      -> rtnl_unlock();
>      -> request_module();
> 				rtnl_lock();
> 				for (...) {
> 				  tcf_action_init_1();
> 				    -> tcf_idr_check_alloc();
> 				   // Insert one action into idr,
> 				   // but it is not committed until
> 				   // tcf_idr_insert_many(), then drop
> 				   // the RTNL lock in the _next_
> 				   // iteration
> 				   -> rtnl_unlock();
>      -> rtnl_lock();
>      -> a_o->init();
>        -> tcf_idr_check_alloc();
>        // Now waiting for the same index
>        // to be committed
> 				    -> request_module();
> 				    -> rtnl_lock()
> 				    // Now waiting for RTNL lock
> 				}
> 				rtnl_unlock();
> }
> rtnl_unlock();
> 
> This is not easy to solve, we can move the request_module() before
> this loop and pre-load all the modules we need for this netlink
> message and then do the rest initializations. So the loop breaks down
> to two now:
> 
>          for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>                  struct tc_action_ops *a_o;
> 
>                  a_o = tc_action_load_ops(name, tb[i]...);
>                  ops[i - 1] = a_o;
>          }
> 
>          for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
>                  act = tcf_action_init_1(ops[i - 1]...);
>          }
> 
> Although this looks serious, it only has been reported by syzbot, so it
> seems hard to trigger this by humans. And given the size of this patch,
> I'd suggest to make it to net-next and not to backport to stable.
> 
> This patch has been tested by syzbot and tested with tdc.py by me.
> 

LGTM.
Initially i was worried about performance impact but i found nothing
observable. We need to add a tdc test for batch (I can share how i did
batch testing at next meet).

Tested-by: Jamal Hadi Salim <jhs@...atatu.com>
Acked-by: Jamal Hadi Salim <jhs@...atatu.com>

cheers,
jamal