netdev - [RACE] net/unix: SIOCGOUTQ returns off-by-one data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CADyDSO6_ybM1-j=DqcHNt_SkQj5cPNukxT8LcbckW+OMpfWe+Q@mail.gmail.com>
Date:   Wed, 16 Dec 2020 14:10:14 +0100
From:   David Rheinsberg <david.rheinsberg@...il.com>
To:     netdev@...r.kernel.org
Subject: [RACE] net/unix: SIOCGOUTQ returns off-by-one data

Hi

We currently use SIOCOUTQ on AF_UNIX+SOCK_STREAM sockets to figure out
whether data that we wrote to a socket is still pending or was
dequeued. Preferably, we would like to know whether a specific message
we queued was dequeued by the other side, but sadly SIOCOUTQ does not
report the actual payload-size, but `skb->truesize`, making it
impossible for us to predict how much data the other side has
dequeued. Hence, we simply use the value to wait for the queue to
empty.

Practically, this means under special circumstances we refrain from
writing data to a socket until SIOCOUTQ returns 0. If it does not
return 0, we wait to be woken-up by a following EPOLLOUT (which works
fine in edge-triggered mode).

This worked fine for us, until we spuriously got off-by-one errors,
where SIOCOUTQ started returning 1, but did not send a wakeup /
EPOLLOUT afterwards despite the counter at some point dropping to 0.
The culprit seems to be sock_wfree():

[slightly simplified]
void sock_wfree(struct sk_buff *skb)
{
    struct sock *sk = skb->sk;
    unsigned int len = skb->truesize;

    if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
        [...]
        refcount_sub_and_test(len - 1, &sk->sk_wmem_alloc);
        sk->sk_write_space(sk);
        len = 1;
    }
    [...]
    if (refcount_sub_and_test(len, &sk->sk_wmem_alloc))
        __sk_free(sk);
}

The kernel seems to use the `sk_wmem_alloc` counter for
reference-counting. Therefore, if we check SIOCOUTQ exactly between
`sk->sk_write_space()` and the following `refcount` update, we will
see a counter of 1, but never get woken up afterwards. Some of our
users can hit this reliably on newer arm64 machines.

My question now is whether this is intended behavior? I couldn't come
up with a simple fix, as I assume this overload on sk_wmem_alloc was
done for performance reasons. We could acquire an sk_refcnt before
sk_write_space and drop it afterwards, but it would probably need
further adjustments in sk_free() and friends, and would negatively
affect performance.

As a workaround we assume a return value lower than 128 means
effectively 0, because even a single queued skb would cause the
counter to be way bigger than 128, due to the size of `struct
sk_buff`. This gives us room to deal with up to 128 temporary
references the kernel holds. However, this does feel slightly awkward,
so I wondered whether someone has a better idea to fix this? Or
whether SIOCOUTQ is just not meant to be used that way?

NOTE: Our SIOCOUTQ code is used to properly account for
inflight-file-descriptors. The kernel accounts them on the sender's
quota, so we use this to reliably track how long an inflight FD is
queued on a socket, and thus accounted on us. This way we prevent
malicious users from causing us to queue too many inflight
file-descriptors and thus exceeding our quota.

Thanks
David