lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 01 Aug 2011 11:49:08 -0700
From:	Rick Jones <rick.jones2@...com>
To:	Tom Herbert <therbert@...gle.com>
CC:	rdunlap@...otime.net, linux-doc@...r.kernel.org,
	davem@...emloft.net, netdev@...r.kernel.org, willemb@...gle.com
Subject: Re: [PATCH] net: add Documentation/networking/scaling.txt

On 07/31/2011 11:56 PM, Tom Herbert wrote:
> Describes RSS, RPS, RFS, accelerated RFS, and XPS.
>
> Signed-off-by: Tom Herbert<therbert@...gle.com>
> ---
>   Documentation/networking/scaling.txt |  346 ++++++++++++++++++++++++++++++++++
>   1 files changed, 346 insertions(+), 0 deletions(-)
>   create mode 100644 Documentation/networking/scaling.txt
>
> diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
> new file mode 100644
> index 0000000..aa51f0f
> --- /dev/null
> +++ b/Documentation/networking/scaling.txt
> @@ -0,0 +1,346 @@
> +Scaling in the Linux Networking Stack
> +
> +
> +Introduction
> +============
> +
> +This document describes a set of complementary techniques in the Linux
> +networking stack to increase parallelism and improve performance (in
> +throughput, latency, CPU utilization, etc.) for multi-processor systems.

Why not just leave-out the parenthetical lest some picky pedant find a 
specific example where either of those three are not improved?

> +
> +The following technologies are described:
> +
> +  RSS: Receive Side Scaling
> +  RPS: Receive Packet Steering
> +  RFS: Receive Flow Steering
> +  Accelerated Receive Flow Steering
> +  XPS: Transmit Packet Steering
> +
> +
> +RSS: Receive Side Scaling
> +=========================
> +
> +Contemporary NICs support multiple receive queues (multi-queue), which
> +can be used to distribute packets amongst CPUs for processing. The NIC
> +distributes packets by applying a filter to each packet to assign it to
> +one of a small number of logical flows.  Packets for each flow are
> +steered to a separate receive queue, which in turn can be processed by
> +separate CPUs.  This mechanism is generally known as “Receive-side
> +Scaling” (RSS).
> +
> +The filter used in RSS is typically a hash function over the network or
> +transport layer headers-- for example, a 4-tuple hash over IP addresses

Network *and* transport layer headers?  And/or?


> +== RSS IRQ Configuration
> +
> +Each receive queue has a separate IRQ associated with it. The NIC
> +triggers this to notify a CPU when new packets arrive on the given
> +queue. The signaling path for PCIe devices uses message signaled
> +interrupts (MSI-X), that can route each interrupt to a particular CPU.
> +The active mapping of queues to IRQs can be determined from
> +/proc/interrupts. By default, all IRQs are routed to CPU0.  Because a

Really?

> +non-negligible part of packet processing takes place in receive
> +interrupt handling, it is advantageous to spread receive interrupts
> +between CPUs. To manually adjust the IRQ affinity of each interrupt see
> +Documentation/IRQ-affinity. On some systems, the irqbalance daemon is
> +running and will try to dynamically optimize this setting.

I would probably make it explicit that the irqbalance daemon will undo 
one's manual changes:

"Some systems will be running an irqbalance daemon which will be trying 
to dynamically optimize IRQ assignments and will undo manual adjustments."

Whether one needs to go so far as to explicitly suggest that the 
irqbalance daemon should be disabled in such cases I'm not sure.


> +RPS: Receive Packet Steering
> +============================
> +
> +Receive Packet Steering (RPS) is logically a software implementation of
> ...
> +
> +Each receive hardware qeueue has associated list of CPUs which can

"queue has an associated" (spelling and grammar nits)

> +process packets received on the queue for RPS.  For each received
> +packet, an index into the list is computed from the flow hash modulo the
> +size of the list.  The indexed CPU is the target for processing the
> +packet, and the packet is queued to the tail of that CPU’s backlog
> +queue. At the end of the bottom half routine, inter-processor interrupts
> +(IPIs) are sent to any CPUs for which packets have been queued to their
> +backlog queue. The IPI wakes backlog processing on the remote CPU, and
> +any queued packets are then processed up the networking stack. Note that
> +the list of CPUs can be configured separately for each hardware receive
> +queue.
> +
> +== RPS Configuration
> +
> +RPS requires a kernel compiled with the CONFIG_RPS flag (on by default
> +for smp). Even when compiled in, it is disabled without any
> +configuration. The list of CPUs to which RPS may forward traffic can be
> +configured for each receive queue using the sysfs file entry:
> +
> + /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
> +
> +This file implements a bitmap of CPUs. RPS is disabled when it is zero
> +(the default), in which case packets are processed on the interrupting
> +CPU.  IRQ-affinity.txt explains how CPUs are assigned to the bitmap.

Earlier in the writeup (snipped) it is presented as 
"Documentation/IRQ-affinity" and here as IRQ-affinity.txt, should that 
be "Documentation/IRQ-affinity.txt" in both cases?

> +For a single queue device, a typical RPS configuration would be to set
> +the rps_cpus to the CPUs in the same cache domain of the interrupting
> +CPU for a queue. If NUMA locality is not an issue, this could also be
> +all CPUs in the system. At high interrupt rate, it might wise to exclude
> +the interrupting CPU from the map since that already performs much work.
> +
> +For a multi-queue system, if RSS is configured so that a receive queue

Multple hardware queue to help keep the "queues" separate in the mind of 
the reader?

> +is mapped to each CPU, then RPS is probably redundant and unnecessary.
> +If there are fewer queues than CPUs, then RPS might be beneficial if the

same.

> +rps_cpus for each queue are the ones that share the same cache domain as
> +the interrupting CPU for the queue.
> +
> +RFS: Receive Flow Steering
> +==========================
> +
> +While RPS steers packet solely based on hash, and thus generally
> +provides good load distribution, it does not take into account
> +application locality. This is accomplished by Receive Flow Steering

Should it also mention how an application thread of execution might be 
processing requests on multiple connections, which themselves might not 
normally hash to the same place?


> +== RFS Configuration
> +
> +RFS is only available if the kernel flag CONFIG_RFS is enabled (on by
> +default for smp). The functionality is disabled without any
> +configuration.

Perhaps just wordsmithing, but "This functionality remains disabled 
until explicitly configured." seems clearer.

> +== Accelerated RFS Configuration
> +
> +Accelerated RFS is only available if the kernel is compiled with
> +CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
> +It also requires that ntuple filtering is enabled via ethtool.

Requires that ntuple filtering be enabled?

> +XPS: Transmit Packet Steering
> +=============================
> +
> +Transmit Packet Steering is a mechanism for intelligently selecting
> +which transmit queue to use when transmitting a packet on a multi-queue
> +device.

Minor nit.  Up to this point a multi-queue device was only described as 
one with multiple receive queues.


> +Further Information
> +===================
> +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
> +2.6.38. Original patches were submitted by Tom Herbert
> +(therbert@...gle.com)
> +
> +
> +Accelerated RFS was introduced in 2.6.35. Original patches were
> +submitted by Ben Hutchings (bhutchings@...arflare.com)
> +
> +Authors:
> +Tom Herbert (therbert@...gle.com)
> +Willem de Bruijn (willemb@...gle.com)
> +

While there are tidbits and indications in the descriptions of each 
mechanism, a section with explicit description of when one would use the 
different mechanisms would be goodness.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ