[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZFv6Z7hssZ9snNAw@C02FL77VMD6R.googleapis.com>
Date: Wed, 10 May 2023 13:11:19 -0700
From: Peilin Ye <yepeilin.cs@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
Jamal Hadi Salim <jhs@...atatu.com>,
Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
Peilin Ye <peilin.ye@...edance.com>,
Daniel Borkmann <daniel@...earbox.net>,
John Fastabend <john.fastabend@...il.com>,
Vlad Buslov <vladbu@...lanox.com>,
Pedro Tammela <pctammela@...atatu.com>,
Hillf Danton <hdanton@...a.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, Cong Wang <cong.wang@...edance.com>
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and
clsact Qdiscs before grafting
On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote:
> Great analysis, thanks for squashing this bug.
Thanks, happy to help!
> Have you considered creating a fix more localized to the miniq
> implementation? It seems that having per-device miniq pointers is
> incompatible with using reference counted objects. So miniq is
> a more natural place to solve the problem. Otherwise workarounds
> in the core keep piling up (here qdisc_graft()).
>
> Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
> If active qdisc is neither a1 nor a2 we should leave the dev state
> alone.
Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid
it is hard:
(3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER()
to set dev->miniq_ingress to NULL. It will need a logic like:
I am A. Set dev->miniq_ingress to NULL, if and only if it is a1 or a2,
and do it atomically.
We need more than a cmpxchg() to implement this "set NULL iff a1 or a2".
Additionally:
On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
Looking at the code, I think there is no guarantee that (1st) cannot
happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
handlers get preempted?
If (1st) happens later than (2nd), we will need to make (1st) no-op, by
detecting that we are the "old" Qdisc. I am not sure there is any
(clean) way to do it. I even thought about:
(1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
(2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it
means we are the "old" Qdisc (have been replaced), and should do
nothing.
However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
egress one, so I can't container_of() during step (1) ...
Eventually I created [5,6/6]. It is a workaround indeed, in the sense
that it changes sch_api.c to avoid a mini Qdisc issue. However I think it
makes the code correct in a relatively understandable way, without slowing
down mini_qdisc_pair_swap() or sch_handle_*gress().
Thanks,
Peilin Ye
Powered by blists - more mailing lists