netdev - Lock contention around unix_gc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CABWYdi2GG3qi6ucxtyk3=Bu1eXi0N9Dow42F4gzi9DUUc3XhLw@mail.gmail.com>
Date:   Tue, 10 Dec 2019 13:32:21 -0800
From:   Ivan Babrou <ivan@...udflare.com>
To:     linux-kernel <linux-kernel@...r.kernel.org>
Cc:     "David S. Miller" <davem@...emloft.net>, hare@...e.com,
        axboe@...nel.dk, allison@...utok.net, tglx@...utronix.de,
        Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Lock contention around unix_gc_lock

Hello,

We're seeing very high contention on unix_gc_lock when a bug in an
application makes it stop reading incoming messages with inflight unix
sockets. In our system we churn through a lot of unix sockets and we
have 96 logical CPUs in the system, so spinlock gets very hot.

I was able to halve overall system throughput with 1024 inflight unix
sockets, which is the default RLIMIT_NOFILE. This doesn't sound too
good for isolation, one user should not be able to affect the system
as much. One might even consider this as DoS vector.

There's a lot of time is spent in _raw_spin_unlock_irqrestore, which
is triggered by wait_for_unix_gc, which in turn is unconditionally
called from unix_stream_sendmsg:

ffffffff9f64f3ea _raw_spin_unlock_irqrestore+0xa
ffffffff9eea6ab0 prepare_to_wait_event+0x70
ffffffff9f5a4ac6 wait_for_unix_gc+0x76
ffffffff9f5a182c unix_stream_sendmsg+0x3c
ffffffff9f4bb7f9 sock_sendmsg+0x39

* https://elixir.bootlin.com/linux/v4.19.80/source/net/unix/af_unix.c#L1849

Even more time is spent in waiting on spinlock because of call to
unix_gc from unix_release_sock, where condition is having any inflight
sockets whatsoever:

ffffffff9eeb1758 queued_spin_lock_slowpath+0x158
ffffffff9f5a4718 unix_gc+0x38
ffffffff9f5a28f3 unix_release_sock+0x2b3
ffffffff9f5a2929 unix_release+0x19
ffffffff9f4b902d __sock_release+0x3d
ffffffff9f4b90a1 sock_close+0x11

* https://elixir.bootlin.com/linux/v4.19.80/source/net/unix/af_unix.c#L586

Should this condition take the number of inflight sockets into
account, just like unix_stream_sendmsg does via wait_for_unix_gc?

Static number of inflight sockets that trigger a GC from
wait_for_unix_gc may also be something that is scaled with system
size, rather than be a hardcoded value.

I know that our case is a pathological one, but it sounds like
scalability of garbage collection can be better, especially on systems
with large number of CPUs.