netdev - Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87fs7fxov6.fsf@nvidia.com>
Date: Mon, 29 May 2023 15:58:50 +0300
From: Vlad Buslov <vladbu@...dia.com>
To: Peilin Ye <yepeilin.cs@...il.com>, Jamal Hadi Salim <jhs@...atatu.com>
CC: Jakub Kicinski <kuba@...nel.org>, Pedro Tammela <pctammela@...atatu.com>,
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
	Paolo Abeni <pabeni@...hat.com>, Cong Wang <xiyou.wangcong@...il.com>, Jiri
 Pirko <jiri@...nulli.us>, Peilin Ye <peilin.ye@...edance.com>, Daniel
 Borkmann <daniel@...earbox.net>, "John Fastabend" <john.fastabend@...il.com>,
	Hillf Danton <hdanton@...a.com>, <netdev@...r.kernel.org>, Cong Wang
	<cong.wang@...edance.com>
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and
 clsact Qdiscs before grafting

On Mon 29 May 2023 at 14:50, Vlad Buslov <vladbu@...dia.com> wrote:
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@...atatu.com> wrote:
>> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@...il.com> wrote:
>>>
>>> Hi Jakub and all,
>>>
>>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>>> > > Thanks a lot, I'll get right on it.
>>> >
>>> > Any insights? Is it just a live-lock inherent to the retry scheme
>>> > or we actually forget to release the lock/refcnt?
>>>
>>> I think it's just a thread holding the RTNL mutex for too long (replaying
>>> too many times).  We could replay for arbitrary times in
>>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>>> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>

Followup: I considered two approaches for preventing the dealock:

- Refactor cls_api to always obtain the lock before taking a reference
  to Qdisc. I started implementing PoC moving the rtnl_lock() call in
  tc_new_tfilter() before __tcf_qdisc_find() and decided it is not
  feasible because cls_api will still try to obtain rtnl_lock when
  offloading a filter to a device with non-unlocked driver or after
  releasing the lock when loading a classifier module.

- Account for such cls_api behavior in sch_api by dropping and
  re-tacking the lock before replaying. This actually seems to be quite
  straightforward since 'replay' functionality that we are reusing for
  this is designed for similar behavior - it releases rtnl lock before
  loading a sch module, takes the lock again and safely replays the
  function by re-obtaining all the necessary data.

If livelock with concurrent filters insertion is an issue, then it can
be remedied by setting a new Qdisc->flags bit
"DELETED-REJECT-NEW-FILTERS" and checking for it together with
QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
insertion coming after the flag is set to synchronize on rtnl lock.

Thoughts?

>>>
>>> I tested the new reproducer Pedro posted, on:
>>>
>>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>>
>>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>>    any issues (in about 30 minutes).
>>>
>>> 3. All 6 v5 patches, plus this diff:
>>>
>>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>>> index 286b7c58f5b9..988718ba5abe 100644
>>> --- a/net/sched/sch_api.c
>>> +++ b/net/sched/sch_api.c
>>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>>>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>>                          */
>>> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>>> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>>> +                               rtnl_unlock();
>>> +                               rtnl_lock();
>>>                                 return -EAGAIN;
>>> +                       }
>>>                 }
>>>
>>>                 if (dev->flags & IFF_UP)
>>>
>>>    Did not trigger any issues (in about 30 mintues) either.
>>>
>>> What would you suggest?
>>
>>
>> I am more worried it is a wackamole situation. We fixed the first
>> reproducer with essentially patches 1-4 but we opened a new one which
>> the second reproducer catches. One thing the current reproducer does
>> is create a lot rtnl contention in the beggining by creating all those
>> devices and then after it is just creating/deleting qdisc and doing
>> update with flower where such contention is reduced. i.e it may just
>> take longer for the mole to pop up.
>>
>> Why dont we push the V1 patch in and then worry about getting clever
>> with EAGAIN after? Can you test the V1 version with the repro Pedro
>> posted? It shouldnt have these issues. Also it would be interesting to
>> see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.