lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHmME9oHFzL6CYVh8nLGkNKOkMeWi2gmxs_f7S8PATWwc6uQsw@mail.gmail.com>
Date:   Fri, 18 Mar 2022 12:19:45 -0600
From:   "Jason A. Donenfeld" <Jason@...c4.com>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc:     Netdev <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Toke Høiland-Jørgensen <toke@...hat.com>
Subject: Re: [PATCH net-next] net: Add lockdep asserts to ____napi_schedule().

Hi Sebastian,

On Fri, Mar 18, 2022 at 4:57 AM Sebastian Andrzej Siewior
<bigeasy@...utronix.de> wrote:
> > Hi Sebastian,
> Hi Jason,
>
> > I stumbled upon this commit when noticing a new failure in WireGuard's
> > test suite:
> …
> > [    1.339289] WARNING: CPU: 0 PID: 11 at ../../../../../../../../net/core/dev.c:4268 __napi_schedule+0xa1/0x300
> …
> > [    1.352417]  wg_packet_decrypt_worker+0x2ac/0x470
> …
> > Sounds like wg_packet_decrypt_worker() might be doing something wrong? I
> > vaguely recall a thread where you started looking into some things there
> > that seemed non-optimal, but I didn't realize there were correctness
> > issues. If your patch is correct, and wg_packet_decrypt_worker() is
> > wrong, do you have a concrete idea of how we should approach fixing
> > wireguard? Or do you want to send a patch for that?
>
> In your case it is "okay" since that ptr_ring_consume_bh() will do BH
> disable/enable which forces the softirq to run. It is not obvious.

In that case, isn't the lockdep assertion you added wrong and should
be reverted? If correct code is hitting it, something seems wrong...

> What
> about the following:
>
> diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
> index 7b8df406c7737..26ffa3afa542e 100644
> --- a/drivers/net/wireguard/receive.c
> +++ b/drivers/net/wireguard/receive.c
> @@ -502,15 +502,21 @@ void wg_packet_decrypt_worker(struct work_struct *work)
>         struct crypt_queue *queue = container_of(work, struct multicore_worker,
>                                                  work)->ptr;
>         struct sk_buff *skb;
> +       unsigned int packets = 0;
>
> -       while ((skb = ptr_ring_consume_bh(&queue->ring)) != NULL) {
> +       local_bh_disable();
> +       while ((skb = ptr_ring_consume(&queue->ring)) != NULL) {
>                 enum packet_state state =
>                         likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
>                                 PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
>                 wg_queue_enqueue_per_peer_rx(skb, state);
> -               if (need_resched())
> +               if (!(++packets % 4)) {
> +                       local_bh_enable();
>                         cond_resched();
> +                       local_bh_disable();
> +               }
>         }
> +       local_bh_enable();
>  }
>
>  static void wg_packet_consume_data(struct wg_device *wg, struct sk_buff *skb)
>
> It would decrypt 4 packets in a row and then after local_bh_enable() it
> would invoke wg_packet_rx_poll() (assuming since it is the only napi
> handler in wireguard) and after that it will attempt cond_resched() and
> then continue with the next batch.

I'm willing to consider batching and all sorts of heuristics in there,
though probably for 5.19 rather than 5.18. Indeed there's some
interesting optimization work to be done. But if you want to propose a
change like this, can you send some benchmarks with it, preferably
taken with something like flent so we can see if it negatively affects
latency?

Regards,
Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ