linux-kernel - Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM_iQpWGaTSkg+-Em6u=NSWcyswX-xN=-1p0OAdaR___U1M4rg@mail.gmail.com>
Date:   Wed, 2 Sep 2020 19:53:38 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     Yunsheng Lin <linyunsheng@...wei.com>
Cc:     Jamal Hadi Salim <jhs@...atatu.com>, Jiri Pirko <jiri@...nulli.us>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>, linuxarm@...wei.com
Subject: Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and
 enqueue op for lockless qdisc

On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>
> On 2020/9/3 9:48, Cong Wang wrote:
> > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>
> >> On 2020/9/3 8:35, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>>>
> >>>> On 2020/9/2 12:41, Cong Wang wrote:
> >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>>>>>
> >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin <linyunsheng@...wei.com> wrote:
> >>>>>>>>
> >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>>>
> >>>>>>> Can you be more specific here? Which call path requests a smaller
> >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>>>> we already have a synchronize_net() there.
> >>>>>>
> >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>>>> do the correct work, as below:
> >>>>>>
> >>>>>> CPU 0:                                       CPU1:
> >>>>>> __dev_queue_xmit()                       netif_set_real_num_tx_queues()
> >>>>>> rcu_read_lock_bh();
> >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>>>>         .
> >>>>>>         .                               dev->real_num_tx_queues = txq;
> >>>>>>         .                                       .
> >>>>>>         .                                       .
> >>>>>>         .                               synchronize_net();
> >>>>>>         .                                       .
> >>>>>> q->enqueue()                                    .
> >>>>>>         .                                       .
> >>>>>> rcu_read_unlock_bh()                            .
> >>>>>>                                         qdisc_reset_all_tx_gt
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>
> >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >>>>>> too.
> >>>>>>
> >>>>>> The problem we hit is as below:
> >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >>>>>> netif_set_real_num_tx_queues() does not help here.
> >>>>>
> >>>>> How could qdisc still be running after deactivating the device?
> >>>>
> >>>> qdisc could be running during the device deactivating process.
> >>>>
> >>>> The main process of changing channel number is as below:
> >>>>
> >>>> 1. dev_deactivate()
> >>>> 2. hns3 handware related setup
> >>>> 3. netif_set_real_num_tx_queues()
> >>>> 4. netif_tx_wake_all_queues()
> >>>> 5. dev_activate()
> >>>>
> >>>> During step 1, qdisc could be running while qdisc is resetting, so
> >>>> there could be skb left in the old qdisc(which will be restored back to
> >>>> txq->qdisc during dev_activate()), as below:
> >>>>
> >>>> CPU 0:                                       CPU1:
> >>>> __dev_queue_xmit():                      dev_deactivate_many():
> >>>> rcu_read_lock_bh();                      qdisc_deactivate(qdisc);
> >>>> q = rcu_dereference_bh(txq->qdisc);             .
> >>>> netdev_core_pick_tx(dev, skb, sb_dev);          .
> >>>>         .
> >>>>         .                               rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> >>>>         .                                       .
> >>>>         .                                       .
> >>>>         .                                       .
> >>>>         .                                       .
> >>>> q->enqueue()                                    .
> >>>
> >>>
> >>> Well, like I said, if the deactivated bit were tested before ->enqueue(),
> >>> there would be no packet queued after qdisc_deactivate().
> >>
> >> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> >> otherwise there is still window between setting and testing the deactivated bit.
> >
> > Can you be more specific here? Why testing or setting a bit is not atomic?
>
> testing a bit or setting a bit separately is atomic.
> But testing a bit and setting a bit is not atomic, right?
>
>   cpu0:                   cpu1:
>                         testing A bit
> setting A bit                .
>        .                     .
>        .               qdisc enqueuing
> qdisc reset
>

Well, this was not a problem until commit d518d2ed8640c1cbbb.
Prior to that commit, qdsic can still be scheduled even with this
race condition, that is, the packet just enqueued after resetting can
still be dequeued with qdisc_run().

It is amazing to see how messy the lockless qdisc is now.

Thanks.