lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 10 May 2023 13:11:19 -0700
From: Peilin Ye <yepeilin.cs@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: "David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
	Jamal Hadi Salim <jhs@...atatu.com>,
	Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us>,
	Peilin Ye <peilin.ye@...edance.com>,
	Daniel Borkmann <daniel@...earbox.net>,
	John Fastabend <john.fastabend@...il.com>,
	Vlad Buslov <vladbu@...lanox.com>,
	Pedro Tammela <pctammela@...atatu.com>,
	Hillf Danton <hdanton@...a.com>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, Cong Wang <cong.wang@...edance.com>
Subject: Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and
 clsact Qdiscs before grafting

On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote:
> Great analysis, thanks for squashing this bug.

Thanks, happy to help!

> Have you considered creating a fix more localized to the miniq
> implementation? It seems that having per-device miniq pointers is
> incompatible with using reference counted objects. So miniq is
> a more natural place to solve the problem. Otherwise workarounds
> in the core keep piling up (here qdisc_graft()).
>
> Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
> If active qdisc is neither a1 nor a2 we should leave the dev state
> alone.

Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid
it is hard:

(3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER()
to set dev->miniq_ingress to NULL.  It will need a logic like:

  I am A.  Set dev->miniq_ingress to NULL, if and only if it is a1 or a2,
  and do it atomically.

We need more than a cmpxchg() to implement this "set NULL iff a1 or a2".
Additionally:

On Fri,  5 May 2023 17:16:10 -0700 Peilin Ye wrote:
>   Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>   adds a flower filter X to A.
> 
>   Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>   b2) to replace A, then adds a flower filter Y to B.
> 
>  Thread 1               A's refcnt   Thread 2
>   RTM_NEWQDISC (A, RTNL-locked)
>    qdisc_create(A)               1
>    qdisc_graft(A)                9
> 
>   RTM_NEWTFILTER (X, RTNL-lockless)
>    __tcf_qdisc_find(A)          10
>    tcf_chain0_head_change(A)
>    mini_qdisc_pair_swap(A) (1st)
>             |
>             |                         RTM_NEWQDISC (B, RTNL-locked)
>            RCU                   2     qdisc_graft(B)
>             |                    1     notify_and_destroy(A)
>             |
>    tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-lockless)
>    qdisc_destroy(A)                    tcf_chain0_head_change(B)
>    tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>    mini_qdisc_pair_swap(A) (3rd)                |
>            ...                                 ...

Looking at the code, I think there is no guarantee that (1st) cannot
happen after (2nd), although unlikely?  Can RTNL-lockless RTM_NEWTFILTER
handlers get preempted?

If (1st) happens later than (2nd), we will need to make (1st) no-op, by
detecting that we are the "old" Qdisc.  I am not sure there is any
(clean) way to do it.  I even thought about:

  (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
  (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping".  If false, it
      means we are the "old" Qdisc (have been replaced), and should do
      nothing.

However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
egress one, so I can't container_of() during step (1) ...

Eventually I created [5,6/6].  It is a workaround indeed, in the sense
that it changes sch_api.c to avoid a mini Qdisc issue.  However I think it
makes the code correct in a relatively understandable way, without slowing
down mini_qdisc_pair_swap() or sch_handle_*gress().

Thanks,
Peilin Ye


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ