[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DM5PR11MB1756E19258EABA5643011031CFFB0@DM5PR11MB1756.namprd11.prod.outlook.com>
Date: Sat, 14 Mar 2020 11:56:47 +0000
From: "Jon Rosen (jrosen)" <jrosen@...co.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC: "davem@...emloft.net" <davem@...emloft.net>,
Willem de Bruijn <willemb@...gle.com>
Subject: FW: [PATCH net] net/packet: tpacket_rcv: avoid a producer race
condition
From: Willem de Bruijn <willemb@...gle.com>
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).
Synchronize kernel thread ownership of rx ring slots with a bitmap.
Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.
Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.
This is the one of a variety of discussed options (see Link below):
* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.
* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.
* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.
* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.
The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.
Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.
Based on an earlier patchset by Jon Rosen. See links below.
I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@...co.com>
Signed-off-by: Willem de Bruijn <willemb@...gle.com>
Signed-off-by: Jon Rosen <jrosen@...co.com>
Powered by blists - more mailing lists