[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20200624171749.11927-1-tom@herbertland.com>
Date: Wed, 24 Jun 2020 10:17:39 -0700
From: Tom Herbert <tom@...bertland.com>
To: netdev@...r.kernel.org
Cc: Tom Herbert <tom@...bertland.com>
Subject: [RFC PATCH 00/11] ptq: Per Thread Queues
Per Thread Queues allows application threads to be assigned dedicated
hardware network queues for both transmit and receive. This facility
provides a high degree of traffic isolation between applications and
can also help facilitate high performance due to fine grained packet
steering. An overview and design considerations of Per Thread Queues
has been add to Documentation/networking/scaling.rst.
This patch set provides a basic implementation of Per Thread Queues.
The patch set includes:
- Minor Infrastructure changes to cgroups (just export a
couple of functions)
- netqueue.h to hold generic definitions for network queues
- Minor infrastructure in aRFS and net-sysfs to accommodate
PTQ
- Introduce the concept of "global queues". These are used
in cgroup configuration of PTQ. Global queues can be
mapped to real device queues. A per device queue sysfs
parameter is added to configure the mapping of device
queue to a global queue
- Creation of a new cgroup controller, "net_queues", that
is used to configure Per Thread Queues
- Hook up the transmit path. This has two parts: 1) In
send socket operations record the transmit queue
associated with a task in the sock structure, 2) In
netdev_pick_tx, check if the sock structure of the skb
has a valid transmit global queue set. If so, convert the
queue identifier to a device queue identifier based on the per
device mapping table. This selection precedes XPS
- Hook up the receive path. This has two parts: 1) In
rps_record_sock_flow check if a receive global queue is
assigned to the running task, if so then set it in the
sock_flow_table entry for the flow. Note this in lieu of
setting the running CPU in the entry. 2) Change get_rps_cpu to
query the sock_flow_table to see if a queue index has been
stored (as opposed to a CPU number). If a queue index is
present, use it for steering including for it to be the
target of ndo_rx_flow_steer.
Related features and concepts:
- netprio and prio_tc_map: Similar to those, PTQ allows control,
via cgroups and per device maps, over mapping applications'
packets to transmit queues. However, PTQ is intended to
perform fine grained per application mapping to queues such
that each application thread, possibly thousands of them, can
have its own dedicate transmit queue.
- aRFS: On the transmit side PTQ extends aRFS to steer packets
for a flow based on assigned global queue as opposed to only
running CPU for the processing thread. In PTQ, the queue
"follows" the thread so that when threads are scheduled to
run on a different CPU, the packets for flows of the thread
continue to be received on the right queue. This addresses
a problem in aRFS where when a thread is rescheduled all
of its aRFS steered flows may be to moved to a different queue
(i.e. ndo_rx_flow_steer needs to be called for each flow).
- Busy polling: PTQ provides silo'ing of an application packets
into queues and busy polling of those queue can then be
applied for high performance. This is likely the fist
instantiation of PTQ to combine it with busy polling
(moving interrupts for those queues as threads are scheduled
is most likely prohibitive). Busy polling is only practical
with a few queues, like maybe at most one per CPU, and
won't scale to thousands of per thread queues in use.
(to address that sleeping-busy-poll with completion
queues is suggested below).
- Making Networking Queues a First Class Citizen in the Kernel
https://linuxplumbersconf.org/event/4/contributions/462/
attachments/241/422/LPC_2019_kernel_queue_manager.pdf:
The concept of "global queues" should be a good complement
to this proposal. Global queue provide an abstract
representation of device queues. the abstraction is resolved
when the global queue is mapped to a real hardware queue. This
layering allows exposing queues to the user and configuration
which might be associated with general attributes (like high
priority, QoS characteristics, etc.). The mapping to a
specific device queue gives the low level queue that satisfies
the implied service of the global queue. Any attributes and
associations are configured and in no way hardcoded, so that
the use of queues in the manner is fully extensible and can be
driven be arbitrary user defined policy. Since global queues
are device agnostic they not just can be managed as local
system resource, but also across across the distributed
tasks for a job in the datacenter like as a property of a
container in Kubernetes (similar to how we might manage
network priority as a global DC resource, but global queues
provide much more granularity and richness in what they can
convey).
There are a number of possible extensions to this work
- Queue selection could be done on a per process basis
or a per socket basis as well as a per thread basis. (per
packet basis probably makes little sense due to OOO)
- The mechanism for selecting a queue to assign to a thread
could be programmed. For instance, an eBPF hook could be
added that would allow very fine grained policies to do
queue selection.
- "Global queue groups" could be created where a global queue
identifier maps to some group of device queues and there is
a selection algorithm, possibly another eBPF hook, that
maps to a specific device queue for use.
- Another attribute in the cgroup could be added to enable
or disable aRFS on a per thread basis.
- Extend the net_queues cgroup to allow control over
busy-polling on a per cgroup basis. This could further
be enhanced by eBPF hooks to control busy-polling for
individual sockets of the cgroup per some arbitrary policy
(similar to eBPF hook for SO_RESUSEPORT).
- Elasticity in listener sockets. As described in the
Documentation we expect that a filter can be installed to
direct packets an application to the set of queues for the
applications. The problem is that the application may
create threads on demand so that we don't know a priori
how many queues the application needs. Optimally, we
want a mechanism to dynamically enable/disable a
queue in the filter set so that at any given time the
application is receive packets only on queues it is
actively using. This may entail a new ndo_function.
- The sleeping-busy-poll with completion queue model
described in the documentation could be integrated. This
would most entail creating a reverse mapping from queue
to threads, and then allowing the thread processing a
device completion queue to schedule the threads of interest.
Tom Herbert (11):
cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next
net: Create netqueue.h and define NO_QUEUE
arfs: Create set_arfs_queue
net-sysfs: Create rps_create_sock_flow_table
net: Infrastructure for per queue aRFS
net: Function to check against maximum number for RPS queues
net: Introduce global queues
ptq: Per Thread Queues
ptq: Hook up transmit side of Per Queue Threads
ptq: Hook up receive side of Per Queue Threads
doc: Documentation for Per Thread Queues
Documentation/networking/scaling.rst | 195 +++++++-
include/linux/cgroup.h | 3 +
include/linux/cgroup_subsys.h | 4 +
include/linux/netdevice.h | 204 +++++++-
include/linux/netqueue.h | 25 +
include/linux/sched.h | 4 +
include/net/ptq.h | 45 ++
include/net/sock.h | 75 ++-
kernel/cgroup/cgroup.c | 9 +-
kernel/fork.c | 4 +
net/Kconfig | 18 +
net/core/Makefile | 1 +
net/core/dev.c | 177 +++++--
net/core/filter.c | 4 +-
net/core/net-sysfs.c | 201 +++++++-
net/core/ptq.c | 688 +++++++++++++++++++++++++++
net/core/sysctl_net_core.c | 152 ++++--
net/ipv4/af_inet.c | 6 +
18 files changed, 1693 insertions(+), 122 deletions(-)
create mode 100644 include/linux/netqueue.h
create mode 100644 include/net/ptq.h
create mode 100644 net/core/ptq.c
--
2.25.1
Powered by blists - more mailing lists