linux-kernel - Re: [RFC v2] net: sched: implement TCQ_F_CAN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4eb6bb79-0122-3472-d4a6-ed41475f6a96@huawei.com>
Date:   Tue, 16 Mar 2021 20:36:50 +0800
From:   Yunsheng Lin <linyunsheng@...wei.com>
To:     Eric Dumazet <edumazet@...gle.com>
CC:     Jakub Kicinski <kuba@...nel.org>,
        David Miller <davem@...emloft.net>,
        Vladimir Oltean <olteanv@...il.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andriin@...com>,
        "Wei Wang" <weiwan@...gle.com>,
        Cong Wang <cong.wang@...edance.com>,
        Taehee Yoo <ap420073@...il.com>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>, <linuxarm@...neuler.org>,
        Marc Kleine-Budde <mkl@...gutronix.de>,
        <linux-can@...r.kernel.org>
Subject: Re: [RFC v2] net: sched: implement TCQ_F_CAN_BYPASS for lockless
 qdisc

On 2021/3/16 16:15, Eric Dumazet wrote:
> On Tue, Mar 16, 2021 at 1:35 AM Yunsheng Lin <linyunsheng@...wei.com> wrote:
>>
>> On 2021/3/16 2:53, Jakub Kicinski wrote:
>>> On Mon, 15 Mar 2021 11:10:18 +0800 Yunsheng Lin wrote:
>>>> @@ -606,6 +623,11 @@ static const u8 prio2band[TC_PRIO_MAX + 1] = {
>>>>   */
>>>>  struct pfifo_fast_priv {
>>>>      struct skb_array q[PFIFO_FAST_BANDS];
>>>> +
>>>> +    /* protect against data race between enqueue/dequeue and
>>>> +     * qdisc->empty setting
>>>> +     */
>>>> +    spinlock_t lock;
>>>>  };
>>>>
>>>>  static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
>>>> @@ -623,7 +645,10 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>>>>      unsigned int pkt_len = qdisc_pkt_len(skb);
>>>>      int err;
>>>>
>>>> -    err = skb_array_produce(q, skb);
>>>> +    spin_lock(&priv->lock);
>>>> +    err = __ptr_ring_produce(&q->ring, skb);
>>>> +    WRITE_ONCE(qdisc->empty, false);
>>>> +    spin_unlock(&priv->lock);
>>>>
>>>>      if (unlikely(err)) {
>>>>              if (qdisc_is_percpu_stats(qdisc))
>>>> @@ -642,6 +667,7 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>      struct sk_buff *skb = NULL;
>>>>      int band;
>>>>
>>>> +    spin_lock(&priv->lock);
>>>>      for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>>              struct skb_array *q = band2list(priv, band);
>>>>
>>>> @@ -655,6 +681,7 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>      } else {
>>>>              WRITE_ONCE(qdisc->empty, true);
>>>>      }
>>>> +    spin_unlock(&priv->lock);
>>>>
>>>>      return skb;
>>>>  }
>>>
>>> I thought pfifo was supposed to be "lockless" and this change
>>> re-introduces a lock between producer and consumer, no?
>>
>> Yes, the lock breaks the "lockless" of the lockless qdisc for now
>> I do not how to solve the below data race locklessly:
>>
>>         CPU1:                                   CPU2:
>>       dequeue skb                                .
>>           .                                      .
>>           .                                 enqueue skb
>>           .                                      .
>>           .                      WRITE_ONCE(qdisc->empty, false);
>>           .                                      .
>>           .                                      .
>> WRITE_ONCE(qdisc->empty, true);
> 
> 
> Maybe it is time to fully document/explain how this can possibly work.

I might be able to provide some document/explain on how the lockless
qdisc work for I was looking through the code the past few days.

By "lockless", I suppose it means there is no lock between enqueuing and
dequeuing, whoever grasps the qdisc->seqlock can dequeue the skb and send
it out after enqueuing the skb it tries to send, other CPU which does not
grasp the qdisc->seqlock just enqueue the skb and return, hoping other cpu
holding the qdisc->seqlock can dequeue it's skb and send it out.

For the locked qdisck(the one without TCQ_F_NOLOCK flags set), it holds
the qdisc_lock() when doing the enqueuing/dequeuing and sch_direct_xmit(),
and in sch_direct_xmit() it releases the qdisc_lock() when doing skb validation
and calling dev_hard_start_xmit() to send skb to the driver, and hold the
qdisc_lock() again after calling dev_hard_start_xmit(), so other cpu may
grasps the qdisc_lock() and repeat the above process during that qdisc_lock()
release period.

So the main difference between lockess qdisc and locked qdisc is if
there is a lock to protect both enqueuing and dequeuing. For example, pfifo_fast
uses ptr_ring to become lockless for enqueuing or dequeuing, but it still needs
producer_lock to protect the concurrent enqueue, and consumer_lock to protect
the concurrent dequeue. for Other qdisc that can not provide the lockless between
enqueuing or dequeuing, maybe we implement the locking in the specific qdisc
implementation, so that it still can claim to be "lockless", like the locking
added for pfifo_fast in this patch.

Idealy we can claim all qdisc to be lockess qdisc as long as we make sure
all qdisc either use lockless implementation, or use internal lock to protect
between enqueuing and dequeuing, so that we can remove the locked dqisc and
will have less contention between enqueuing and dequeuing.

Below patch is the first try to remove the difference between lockess qdisc
and locked qdisc, so that we can see if we can remove the locked qdisc in the
future:

https://patchwork.kernel.org/project/netdevbpf/patch/1615800610-34700-1-git-send-email-linyunsheng@huawei.com/

I may be wrong about the above analysis, if there is any error, please
point it out.

> 
> lockless qdisc used concurrently by multiple cpus, using
> WRITE_ONCE() and READ_ONCE() ?
> 
> Just say no to this.

what do you mean "no" here? Do you mean it is impossible to implement lockless
qdisc correctly?  Or TCQ_F_CAN_BYPASS for lockless qdisc is a wrong direction?
And is there any reason?

> 
>>
>> If the above happens, the qdisc->empty is true even if the qdisc has some
>> skb, which may cuase out of order or packet stuck problem.
>>
>> It seems we may need to update ptr_ring' status(empty or not) while
>> enqueuing/dequeuing atomically in the ptr_ring implementation.
>>
>> Any better idea?
> 
> .
>