lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAM0EoM=xLkAr5EF7bty+ETmZ3GXnmB9De3fYSCrQjKPb8qDy7Q@mail.gmail.com> Date: Sun, 28 May 2023 14:54:29 -0400 From: Jamal Hadi Salim <jhs@...atatu.com> To: Peilin Ye <yepeilin.cs@...il.com> Cc: Jakub Kicinski <kuba@...nel.org>, Pedro Tammela <pctammela@...atatu.com>, "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>, Peilin Ye <peilin.ye@...edance.com>, Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, Hillf Danton <hdanton@...a.com>, netdev@...r.kernel.org, Cong Wang <cong.wang@...edance.com>, Vlad Buslov <vladbu@...dia.com> Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@...il.com> wrote: > > Hi Jakub and all, > > On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote: > > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote: > > > Thanks a lot, I'll get right on it. > > > > Any insights? Is it just a live-lock inherent to the retry scheme > > or we actually forget to release the lock/refcnt? > > I think it's just a thread holding the RTNL mutex for too long (replaying > too many times). We could replay for arbitrary times in > tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter > requests for the old Qdisc. > > I tested the new reproducer Pedro posted, on: > > 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported > > 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger > any issues (in about 30 minutes). > > 3. All 6 v5 patches, plus this diff: > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c > index 286b7c58f5b9..988718ba5abe 100644 > --- a/net/sched/sch_api.c > +++ b/net/sched/sch_api.c > @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, > * RTNL-unlocked filter request(s). This is the counterpart of that > * qdisc_refcount_inc_nz() call in __tcf_qdisc_find(). > */ > - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) > + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) { > + rtnl_unlock(); > + rtnl_lock(); > return -EAGAIN; > + } > } > > if (dev->flags & IFF_UP) > > Did not trigger any issues (in about 30 mintues) either. > > What would you suggest? I am more worried it is a wackamole situation. We fixed the first reproducer with essentially patches 1-4 but we opened a new one which the second reproducer catches. One thing the current reproducer does is create a lot rtnl contention in the beggining by creating all those devices and then after it is just creating/deleting qdisc and doing update with flower where such contention is reduced. i.e it may just take longer for the mole to pop up. Why dont we push the V1 patch in and then worry about getting clever with EAGAIN after? Can you test the V1 version with the repro Pedro posted? It shouldnt have these issues. Also it would be interesting to see how performance of the parallel updates to flower is affected. cheers, jamal > Thanks, > Peilin Ye >
Powered by blists - more mailing lists