netdev - Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAF=yD-K2xH6uGxC3P7UQab5v26kP85z3axvY8L9NfigCLrP6Hw@mail.gmail.com>
Date:   Wed, 14 Mar 2018 12:32:01 -0400
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Vinicius Costa Gomes <vinicius.gomes@...el.com>
Cc:     Network Development <netdev@...r.kernel.org>,
        randy.e.witt@...el.com, David Miller <davem@...emloft.net>,
        Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors

On Tue, Mar 13, 2018 at 4:35 PM, Vinicius Costa Gomes
<vinicius.gomes@...el.com> wrote:
> Hi,
>
> Changes from the RFC:
>  - tweaked commit messages;
>
> Original cover letter:
>
> This is actually a "bug report"-RFC instead of the more usual "new
> feature"-RFC.
>
> We are developing an application that uses TX hardware timestamping to
> make some measurements, and during development Randy Witt initially
> reported that the application poll() never unblocked when TX hardware
> timestamping was enabled.
>
> After some investigation, it turned out the problem wasn't only
> exclusive to hardware timestamping, and could be reproduced with
> software timestamping.
>
> Applying patch (1), and running txtimestamp like this, for example:
>
> $ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F
>
> ('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
> packets for each test, '-D' to remove the delay between packets, '-l
> 1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
> wait forever)
>
> will cause the application to become stuck in the poll() call in most
> of the times. (Note: I couldn't reproduce the issue running against an
> address that is routed through loopback.)
>
> Another interesting fact is that if the POLLIN event is added to the
> poll() .events, poll() no longer becomes stuck,

The process has registered interest only in POLLIN, which the call to
sk_data_read (sock_def_readable) will trigger.

> and more interestingly
> the returned event in .revents is only POLLERR.

datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.

> After a few debugging sessions, we got to 'sock_queue_err_skb()' and
> how it notifies applications of the error just enqueued. Changing it
> to use 'sk->sk_error_report()', fixes the issue for hardware and
> software timestamping. That is patch (2).
>
> The "solution" proposed in patch (2) looks like too big a hammer,

It looks fine to me. POLLERR is returned regardless of the mask a
process sets up in pollfd.events. So waking with sk_error_report
will fix this while still waking callers waiting on POLLIN.

Note that on sock_dequeue_err_skb, if another notification (of the
right kind) is waiting, sk_error_report is already called instead of
sk_data_ready.

This should perhaps go to net, instead of net-next (though not the test).

If resending, a small nit in the test: please keep the alphabetical
order in getopt. The filepath also looks a bit fishy, but git am applied
the mbox from patchwork without issue.