[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b7cd260-1eb6-c8a4-e59a-337482e39a3a@intel.com>
Date: Thu, 25 Jun 2020 16:00:40 -0700
From: Jacob Keller <jacob.e.keller@...el.com>
To: Tom Herbert <tom@...bertland.com>, netdev@...r.kernel.org
Subject: Re: [RFC PATCH 11/11] doc: Documentation for Per Thread Queues
On 6/24/2020 10:17 AM, Tom Herbert wrote:
> Add a section on Per Thread Queues to scaling.rst.
> ---
> Documentation/networking/scaling.rst | 195 ++++++++++++++++++++++++++-
> 1 file changed, 194 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
> index 8f0347b9fb3d..42f1dc639ab7 100644
> --- a/Documentation/networking/scaling.rst
> +++ b/Documentation/networking/scaling.rst
> @@ -250,7 +250,7 @@ RFS: Receive Flow Steering
> While RPS steers packets solely based on hash, and thus generally
> provides good load distribution, it does not take into account
> application locality. This is accomplished by Receive Flow Steering
> -(RFS). The goal of RFS is to increase datacache hitrate by steering
> +(RFS). The goal of RFS is to increase datacache hit rate by steering
> kernel processing of packets to the CPU where the application thread
> consuming the packet is running. RFS relies on the same RPS mechanisms
> to enqueue packets onto the backlog of another CPU and to wake up that
> @@ -508,6 +508,199 @@ a max-rate attribute is supported, by setting a Mbps value to::
> A value of zero means disabled, and this is the default.
>
>
It might be helpful to expand this with a few examples that show the
user experience for setting this up via the sysfs entries or similar.
That would also aid in giving an example of what a reviewer might want
to do to try this out!
Thanks,
Jake
> +PTQ: Per Thread Queues
> +======================
> +
> +Per Thread Queues allows application threads to be assigned dedicated
> +hardware network queues for both transmit and receive. This facility
> +provides a high degree of traffic isolation between applications and
> +can also help facilitate high performance due to fine grained packet
> +steering.
> +
> +PTQ has three major design components:
> + - A method to assign transmit and receive queues to threads
> + - A means to associate packets with threads and then to steer
> + those packets to the queues assigned to the threads
> + - Mechanisms to process the per thread hardware queues
> +
> +Global network queues
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Global network queues are an abstraction of hardware networking
> +queues that can be used in generic non-device specific configuration.
> +Global queues may mapped to real device queues. The mapping is
> +performed on a per device queue basis. A device sysfs parameter
> +"global_queue_mapping" in queues/{tx,rx}-<num> indicates the mapping
> +of a device queue to a global queue. Each device maintains a table
> +that maps global queues to device queues for the device. Note that
> +for a single device, the global to device queue mapping is 1 to 1,
> +however each device may map a global queue to a different device
> +queue.
> +
> +net_queues cgroup controller
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For assigning queues to the threads, a cgroup controller named
> +"net_queues" is used. A cgroup can be configured with pools of transmit
> +and receive global queues from which individual threads are assigned
> +queues. The contents of the net_queues controller are described below in
> +the configuration section.
> +
> +Handling PTQ in the transmit path
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a socket operation is performed that may result in sending packets
> +(i.e. listen, accept, sendmsg, sendpage), the task structure for the
> +current thread is consulted to see if there is an assigned transmit
> +queue for the thread. If there is a queue assignment, the queue index is
> +set in a field of the sock structure for the corresponding socket.
> +Subsequently, when transmit queue selection is performed, the sock
> +structure associated with packet being sent is consulted. If a transmit
> +global queue is set in the sock then that index is mapped to a device
> +queue for the output networking device. If a valid device queue is
> +discovered then that queue is used, else if a device queue is not found
> +then queue selection proceeds to XPS.
> +
> +Handling PTQ in the receive path
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The receive path uses the infrastructure of RFS which is extended
> +to steer based on the assigned received global queue for a thread in
> +addition to steering based on the CPU. The rps_sock_flow_table is
> +modified to contain either the desired CPU for flows or the desired
> +receive global queue. A queue is updated at the same time that the
> +desired CPU would updated during calls to recvmsg and sendmsg (see RFS
> +description above). The process is to consult the running task structure
> +to see if a receive queue is assigned to the task. If a queue is assigned
> +to the task then the corresponding queue index is set in the
> +rps_sock_flow_table; if no queue is assigned then the current CPU is
> +set as the desired per canonical RFS.
> +
> +When packets are received, the rps_sock_flow table is consulted to check
> +if they were received on the proper queue. If the rps_sock_flow_table
> +entry for a corresponding flow of a received packet contains a global
> +queue index, then the index is mapped to a device queue on the received
> +device. If the mapped device queue is equal to the receive queue then
> +packets are being steered properly. If there is a mismatch then the
> +local flow to queue mapping in the device is changed and
> +ndo_rx_flow_steer is invoked to set the receive queue for the flow in
> +the device as described in the aRFS section.
> +
> +Processing queues in Per Queue Threads
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When Per Queue Threads is used, the queue "follows" the thread. So when
> +a thread is rescheduled from one queue to another we expect that the
> +processing of the device queues that map to the thread are processed on
> +the CPU where the thread is currently running. This is a bit tricky
> +especially with respect to the canonical device interrupt driven model.
> +There are at least three possible approaches:
> + - Arrange for interrupts to follow threads as they are
> + rescheduled, or alternatively pin threads to CPUs and
> + statically configure the interrupt mappings for the queues for
> + each thread
> + - Use busy polling
> + - Use "sleeping busy-poll" with completion queues. The basic
> + idea is to have one CPU busy poll a device completion queue
> + that reports device queues with received or completed transmit
> + packets. When a queue is ready, the thread associated with the
> + queue (derived by reverse mapping the queue back to its
> + assigned thread) is scheduled. When the thread runs it polls
> + its queues to process any packets.
> +
> +Future work may further elaborate on solutions in this area.
> +
> +Reducing flow state in devices
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +PTQ (and aRFS as well) potentially create per flow state in a device.
> +This is costly in at least two ways: 1) State requires device memory
> +which is almost always much less than host memory can and thus the
> +number of flows that can be instantiated in a device are less than that
> +in the host. 2) State requires instantiation and synchronization
> +messages, i.e. ndo_rx_flow_steer causes a message over PCIe bus; if
> +there is a highly turnover rate of connections this messaging becomes
> +a bottleneck.
> +
> +Mitigations to reduce the amount of flow state in the device should be
> +considered.
> +
> +In PTQ (and aRFS) the device flow state is a considered cache. A flow
> +entry is only set in the device on a cache miss which occurs when the
> +receive queue for a packet doesn't match the desired receive queue. So
> +conceptually, if a packets for a flow are always received on the desired
> +queue from the beginning of the flow then a flow state might never need
> +to be instantiated in the device. This motivates a strategy to try to
> +use stateless steering mechanisms before resorting to stateful ones.
> +
> +As an example of applying this strategy, consider an application that
> +creates four threads where each threads creates a TCP listener socket
> +for some port that is shared amongst the threads via SO_REUSEPORT.
> +Four global queues can be assigned to the application (via a cgroup
> +for the application), and a filter rule can be set up in each device
> +that matches the listener port and any bound destination address. The
> +filter maps to a set of four device queues that map to the four global
> +queues for the application. When a packet is received that matches the
> +filter, one of the four queues is chosen via a hash over the packet's
> +four tuple. So in this manner, packets for the application are
> +distributed amongst the four threads. As long as processing for sockets
> +doesn't move between threads and the number of listener threads is
> +constant then packets are always received on the desired queue and no
> +flow state needs to be instantiated. In practice, we want to allow
> +elasticity in applications to create and destroy threads on demand, so
> +additional techniques, such as consistent hashing, are probably needed.
> +
> +Per Thread Queues Configuration
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Per Thread Queues is only available if the kernel is compiled with
> +CONFIG_PER_THREAD_QUEUES. For PTQ in the receive path, aRFS needs to be
> +supported and configured (see aRFS section above).
> +
> +The net_queues cgroup controller is in:
> + /sys/fs/cgroup/<cgrp>/net_queues
> +
> +The net_queues controller contains the following attributes:
> + - tx-queues, rx-queues
> + Specifies the transmit queue pool and receive queue pool
> + respectively as a range of global queue indices. The
> + format of these entries is "<base>:<extent>" where
> + <base> is the first queue index in the pool, and
> + <extent> is the number of queues in the range of pool.
> + If <extent> is zero the queue pool is empty.
> + - tx-assign,rx-assign
> + Boolean attributes ("0" or "1") that indicate unique
> + queue assignment from the respective transmit or receive
> + queue pool. When the "assign" attribute is enabled, a
> + thread is assigned a queue that is not already assigned
> + to another thread.
> + - symmetric
> + A boolean attribute ("0" or "1") that indicates the
> + receive and transmit queue assignment for a thread
> + should be the same. That is the assigned transmit queue
> + index is equal to the assigned receive queue index.
> + - task-queues
> + A read-only attribute that lists the threads of the
> + cgroup and their assigned queues.
> +
> +The mapping of global queues to device queues is in:
> +
> + /sys/class/net/<dev>/queues/tx-<n>/global_queue_mapping
> + -and -
> + /sys/class/net/<dev>/queues/rx-<n>/global_queue_mapping
> +
> +A value of "none" indicates no mapping, an integer value (up to
> +a maximum of 32,766) indicates a global queue.
> +
> +Suggested Configuration
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +Unlike aRFS, PTQ requires per application application configuration. To
> +most effectively use PTQ some understanding of the threading model of
> +the application is warranted. The section above describes one possible
> +configuration strategy for a canonical application using SO_REUSEPORT.
> +
> +
> Further Information
> ===================
> RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
>
Powered by blists - more mailing lists