netdev - Re: packet stuck in qdisc : patch proposal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b35b766c-a25a-fcf7-d329-31948e219f5d@gmail.com>
Date:   Mon, 23 May 2022 19:55:38 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Vincent Ray <vray@...rayinc.com>,
        linyunsheng <linyunsheng@...wei.com>
Cc:     davem <davem@...emloft.net>,
        方国炬 <guoju.fgj@...baba-inc.com>,
        kuba <kuba@...nel.org>, netdev <netdev@...r.kernel.org>,
        Samuel Jones <sjones@...rayinc.com>,
        vladimir oltean <vladimir.oltean@....com>,
        Guoju Fang <gjfang@...ux.alibaba.com>,
        Remy Gauguey <rgauguey@...rayinc.com>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: packet stuck in qdisc : patch proposal


On 5/23/22 06:54, Vincent Ray wrote:
> Hi Yunsheng, all,
>
> I finally spotted the bug that caused (nvme-)tcp packets to remain stuck in the qdisc once in a while.
> It's in qdisc_run_begin within sch_generic.h :
>
> smp_mb__before_atomic();
>   
> // [comments]
>
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
>
> should be
>
> smp_mb();
>
> // [comments]
>
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
>
> I have written a more detailed explanation in the attached patch, including a race example, but in short that's because test_bit() is not an atomic operation.
> Therefore it does not give you any ordering guarantee on any architecture.
> And neither does spin_trylock() called at the beginning of qdisc_run_begin() when it does not grab the lock...
> So test_bit() may be reordered whith a preceding enqueue(), leading to a possible race in the dialog with pfifo_fast_dequeue().
> We may then end up with a skbuff pushed "silently" to the qdisc (MISSED cleared, nobody aware that there is something in the backlog).
> Then the cores pushing new skbuffs to the qdisc may all bypass it for an arbitrary amount of time, leaving the enqueued skbuff stuck in the backlog.
>
> I believe the reason for which you could not reproduce the issue on ARM64 is that, on that architecture, smp_mb__before_atomic() will translate to a memory barrier.
> It does not on x86 (turned into a NOP) because you're supposed to use this function just before an atomic operation, and atomic operations themselves provide full ordering effects on x86.
>
> I think the code has been flawed for some time but the introduction of a (true) bypass policy in 5.14 made it more visible, because without this, the "victim" skbuff does not stay very long in the backlog : it is bound to pe popped by the next core executing __qdic_run().
>
> In my setup, with our use case (16 (virtual) cpus in a VM shooting 4KB buffers with fio through a -i4 nvme-tcp connection to a target), I did not notice any performance degradation using smp_mb() in place of smp_mb__before_atomic(), but of course that does not mean it cannot happen in other configs.
>
> I think Guoju's patch is also correct and necessary so that both patches, his and mine, should be applied "asap" to the kernel.
> A difference between Guoju's race and "mine" is that, in his case, the MISSED bit will be set : though no one will take care of the skbuff immediately, the next cpu pushing to the qdisc (if ever ...) will notice and dequeue it (so Guoju's race probably happens in my use case too but is not noticeable).
>
> Finally, given the necessity of these two new full barriers in the code, I wonder if the whole lockless (+ bypass) thing should be reconsidered.
> At least, I think general performance tests should be run to check that lockless qdics still outperform locked qdiscs, in both bypassable and not-bypassable modes.
>      
> More generally, I found this piece of code quite tricky and error-prone, as evidenced by the numerous fixes it went through in the recent history.
> I believe most of this complexity comes from the lockless qdisc handling in itself, but of course the addition of the bypass support does not really help ;-)
> I'm a linux kernel beginner however, so I'll let more experienced programmers decide about that :-)
>
> I've made sure that, with this patch, no stuck packets happened any more on both v5.15 and v5.18-rc2 (whereas without the patch, numerous occurrences of stuck packets are visible).
> I'm quite confident it will apply to any concerned version, that is from 5.14 (or before) to mainline.
>
> Can you please tell me :
>
> 1) if you agree with this ?
>
> 2) how to proceed to push this patch (and Guoju's) for quick integration into the mainline ?
>
> NB : an alternative fix (which I've tested OK too) would be to simply remove the
>
> if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
> 	return false;
>
> code path, but I have no clue if this would be better or worse than the present patch in terms of performance.
>      
> Thank you, best regards,
>
> V


We keep adding code and comments, this is quite silly.

test_and_set_bit() is exactly what we need.


diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 
9bab396c1f3ba3d143de4d63f4142cff3c9b9f3e..9d1b448c0dfc3925967635f3390b884a4ef7c55a 
100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -187,35 +187,9 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
                 if (spin_trylock(&qdisc->seqlock))
                         return true;

-               /* Paired with smp_mb__after_atomic() to make sure
-                * STATE_MISSED checking is synchronized with clearing
-                * in pfifo_fast_dequeue().
-                */
-               smp_mb__before_atomic();
-
-               /* If the MISSED flag is set, it means other thread has
-                * set the MISSED flag before second spin_trylock(), so
-                * we can return false here to avoid multi cpus doing
-                * the set_bit() and second spin_trylock() concurrently.
-                */
-               if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
+               if (test_and_set_bit(__QDISC_STATE_MISSED, &qdisc->state))
                         return false;

-               /* Set the MISSED flag before the second spin_trylock(),
-                * if the second spin_trylock() return false, it means
-                * other cpu holding the lock will do dequeuing for us
-                * or it will see the MISSED flag set after releasing
-                * lock and reschedule the net_tx_action() to do the
-                * dequeuing.
-                */
-               set_bit(__QDISC_STATE_MISSED, &qdisc->state);
-
-               /* spin_trylock() only has load-acquire semantic, so use
-                * smp_mb__after_atomic() to ensure STATE_MISSED is set
-                * before doing the second spin_trylock().
-                */
-               smp_mb__after_atomic();
-
                 /* Retry again in case other CPU may not see the new flag
                  * after it releases the lock at the end of 
qdisc_run_end().
                  */