[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250926171921.7106b19b@kernel.org>
Date: Fri, 26 Sep 2025 17:19:21 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Oleksij Rempel <o.rempel@...gutronix.de>
Cc: Andrew Lunn <andrew@...n.ch>, Heiner Kallweit <hkallweit1@...il.com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>, Rob Herring
<robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>, Florian
Fainelli <f.fainelli@...il.com>, Maxime Chevallier
<maxime.chevallier@...tlin.com>, Kory Maincent <kory.maincent@...tlin.com>,
Lukasz Majewski <lukma@...x.de>, Jonathan Corbet <corbet@....net>, Donald
Hunter <donald.hunter@...il.com>, Vadim Fedorenko
<vadim.fedorenko@...ux.dev>, Jiri Pirko <jiri@...nulli.us>, Vladimir Oltean
<vladimir.oltean@....com>, Alexei Starovoitov <ast@...nel.org>, Daniel
Borkmann <daniel@...earbox.net>, Jesper Dangaard Brouer <hawk@...nel.org>,
John Fastabend <john.fastabend@...il.com>, kernel@...gutronix.de,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org, Russell King
<linux@...linux.org.uk>, Divya.Koppera@...rochip.com, Sabrina Dubroca
<sd@...asysnail.net>, Stanislav Fomichev <sdf@...ichev.me>
Subject: Re: [PATCH net-next v7 1/1] Documentation: net: add flow control
guide and document ethtool API
On Wed, 24 Sep 2025 14:02:41 +0200 Oleksij Rempel wrote:
> name: pause-stat
> + doc: Statistics counters for link-wide PAUSE frames (IEEE 802.3 Annex 31B).
> attr-cnt-name: __ethtool-a-pause-stat-cnt
> + enum-name: ethtool-a-pause-stat
Naming attribute enums is relatively rare and kinda unnecessary TBH,
because the values are almost never held as state or passed around.
99.9% of the time we use the literals.
enums for actual enum attributes (the value is the enum) - sure,
enums for attr types - 🤷️
> name: stats
> + doc: |
> + Contains the pause statistics counters. The source of these
> + statistics is determined by stats-src.
I'd skip mentioning the source here TBH. Or we need to describe what
the MM is, shortly? I don't have recent embedded experience but I
thought MM is relatively rare. So mentioning it for a very common
attribute could confuse.
> type: nest
> nested-attributes: pause-stat
> -
> name: stats-src
> + doc: |
> + Selects the source of the MAC statistics, values from
> + enum ethtool_mac_stats_src. This allows requesting statistics
> + from the individual components of the MAC Merge layer.
> type: u32
> -
> name: eee
> diff --git a/Documentation/networking/flow_control.rst b/Documentation/networking/flow_control.rst
> new file mode 100644
> index 000000000000..48646d54513f
> --- /dev/null
> +++ b/Documentation/networking/flow_control.rst
> @@ -0,0 +1,373 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _ethernet-flow-control:
> +
> +=====================
> +Ethernet Flow Control
> +=====================
> +
> +This document is a practical guide to Ethernet Flow Control in Linux, covering
> +what it is, how it works, and how to configure it.
> +
> +What is Flow Control?
> +=====================
> +
> +Flow control is a mechanism to prevent a fast sender from overwhelming a
> +slow receiver with data, which would cause buffer overruns and dropped packets.
> +The receiver can signal the sender to temporarily stop transmitting, giving it
> +time to process its backlog.
> +
> +Standards references
> +====================
> +
> +Ethernet flow control mechanisms are specified across consolidated IEEE base
nit: Flow Control ? we should be consistent
> +standards; some originated as amendments:
> +
> +- Collision-based flow control is part of CSMA/CD in **IEEE 802.3**
> + (half-duplex).
> +- Link-wide PAUSE is defined in **IEEE 802.3 Annex 31B**
> + (originally **802.3x**).
> +- Priority-based Flow Control (PFC) is defined in **IEEE 802.1Q Clause 36**
> + (originally **802.1Qbb**).
> +
> +In the remainder of this document, the consolidated clause numbers are used.
> +
> +How It Works: The Mechanisms
> +============================
> +
> +The method used for flow control depends on the link's duplex mode.
> +
> +.. note::
> + The user-visible ``ethtool`` pause API described in this document controls
> + **link-wide PAUSE** (IEEE 802.3 Annex 31B) only. It does not control the
> + collision-based behavior that exists on half-duplex links.
... or PFC ?
> +1. Half-Duplex: Collision-Based Flow Control
> +--------------------------------------------
> +On half-duplex links, a device cannot send and receive simultaneously, so PAUSE
> +frames are not used. Flow control is achieved by leveraging the CSMA/CD
> +(Carrier Sense Multiple Access with Collision Detection) protocol itself.
> +
> +* **How it works**: To inhibit incoming data, a receiving device can force a
> + collision on the line. When the sending station detects this collision, it
> + terminates its transmission, sends a "jam" signal, and then executes the
> + "Collision backoff and retransmission" procedure as defined in IEEE 802.3,
> + Section 4.2.3.2.5. This algorithm makes the sender wait for a random
> + period before attempting to retransmit. By repeatedly forcing collisions,
> + the receiver can effectively throttle the sender's transmission rate.
> +
> +.. note::
> + While this mechanism is part of the IEEE standard, there is currently no
> + generic kernel API to configure or control it. Drivers should not enable
> + this feature until a standardized interface is available.
> +
> +.. warning::
> + On shared-medium networks (e.g. 10BASE2, or twisted-pair networks using a
> + hub rather than a switch) forcing collisions inhibits traffic **across the
> + entire shared segment**, not just a single point-to-point link. Enabling
> + such behavior is generally undesirable.
> +
> +2. Full-Duplex: Link-wide PAUSE (IEEE 802.3 Annex 31B)
> +------------------------------------------------------
> +On full-duplex links, devices can send and receive at the same time. Flow
> +control is achieved by sending a special **PAUSE frame**, defined by IEEE
> +802.3 Annex 31B. This mechanism pauses all traffic on the link and is therefore
> +called *link-wide PAUSE*.
> +
> +* **What it is**: A standard Ethernet frame with a globally reserved
> + destination MAC address (``01-80-C2-00-00-01``). This address is in a range
> + that standard IEEE 802.1D-compliant bridges do not forward. However, some
> + unmanaged or misconfigured bridges have been reported to forward these
> + frames, which can disrupt flow control across a network.
> +
> +* **How it works**: The frame contains a MAC Control opcode for PAUSE
> + (``0x0001``) and a ``pause_time`` value, telling the sender how long to
> + wait before sending more data frames. This time is specified in units of
> + "pause quantum", where one quantum is the time it takes to transmit 512 bits.
> + For example, one pause quantum is 51.2 microseconds on a 10 Mbit/s link,
> + and 512 nanoseconds on a 1 Gbit/s link. A ``pause_time`` of zero indicates
> + that the transmitter can resume transmission, even if a previous non-zero
> + pause time has not yet elapsed.
> +
> +* **Who uses it**: Any full-duplex link, from 10 Mbit/s to multi-gigabit speeds.
> +
> +3. Full-Duplex: Priority-based Flow Control (PFC) (IEEE 802.1Q Clause 36)
> +-------------------------------------------------------------------------
> +Priority-based Flow Control is an enhancement to the standard PAUSE mechanism
> +that allows flow control to be applied independently to different classes of
> +traffic, identified by their priority level.
should we add .. specified in the 802.1Q VLAN tag ?
> +
> +* **What it is**: PFC allows a receiver to pause traffic for one or more of the
> + 8 standard priority levels without stopping traffic for other priorities.
> + This is critical in data center environments for protocols that cannot
> + tolerate packet loss due to congestion (e.g., Fibre Channel over Ethernet
> + or RoCE).
nit: either
FCoE and RoCE
or
Fibre Channel .. and RDMA over Converged ..
?
> +* **How it works**: PFC uses a specific PAUSE frame format. It shares the same
> + globally reserved destination MAC address (``01-80-C2-00-00-01``) as legacy
> + PAUSE frames but uses a unique opcode (``0x0101``). The frame payload
> + contains two key fields:
> +Kernel Policy: "Set and Trust"
> +==============================
> +
> +The ethtool pause API is defined as a **wish policy** for
> +IEEE 802.3 link-wide PAUSE only. A user request is always accepted
> +as the preferred configuration, but it may not be possible to apply
> +it in all link states.
> +
> +Key constraints:
> +
> +- Link-wide PAUSE is not valid on half-duplex links.
> +- Link-wide PAUSE cannot be used together with Priority-based Flow Control
> + (PFC, IEEE 802.1Q Clause 36).
> +- If autonegotiation is active and the link is currently down, the future
> + mode is not yet known.
> +
> +Because of these constraints, the kernel stores the requested setting
> +and applies it only when the link is in a compatible state.
> +
> +Implications for userspace:
> +
> +1. Set once (the "wish"): the requested Rx/Tx PAUSE policy is
> + remembered even if it cannot be applied immediately.
> +2. Applied conditionally: when the link comes up, the kernel enables
> + PAUSE only if the active mode allows it.
IDK about this section and also ...
> Keeping Close Tabs on the PAL
> =============================
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index c869b7f8bce8..1f121108f236 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -931,9 +931,48 @@ struct kernel_ethtool_ts_info {
> * @get_pause_stats: Report pause frame statistics. Drivers must not zero
> * statistics which they don't report. The stats structure is initialized
> * to ETHTOOL_STAT_NOT_SET indicating driver does not report statistics.
> - * @get_pauseparam: Report pause parameters
> - * @set_pauseparam: Set pause parameters. Returns a negative error code
> - * or zero.
> + *
> + * @get_pauseparam: Report the configured policy for link-wide PAUSE
> + * (IEEE 802.3 Annex 31B). Drivers must fill struct ethtool_pauseparam
> + * such that:
> + * @autoneg:
> + * This refers to **Pause Autoneg** (IEEE 802.3 Annex 31B) only
> + * and is independent of generic link autonegotiation configured
> + * via ethtool -s.
> + * true -> the device follows the negotiated result of pause
> + * autonegotiation (Pause/Asym);
> + * false -> the device uses a forced MAC state independent of
> + * negotiation.
> + * @rx_pause/@...pause:
> + * represent the desired policy (preferred configuration).
> + * In autoneg mode they describe what is to be advertised;
... this. IDK what you guys do in the Linux-managed code but the
convention for integrated devices is spelled out here:
/**
* struct ethtool_pauseparam - Ethernet pause (flow control) parameters
* @cmd: Command number = %ETHTOOL_GPAUSEPARAM or %ETHTOOL_SPAUSEPARAM
* @autoneg: Flag to enable autonegotiation of pause frame use
* @rx_pause: Flag to enable reception of pause frames
* @tx_pause: Flag to enable transmission of pause frames
*
* Drivers should reject a non-zero setting of @autoneg when <<< [1]
* autoneogotiation is disabled (or not supported) for the link. <<<
*
* If the link is autonegotiated, drivers should use
* mii_advertise_flowctrl() or similar code to set the advertised
* pause frame capabilities based on the @rx_pause and @tx_pause flags,
* even if @autoneg is zero. They should also allow the advertised
* pause frame capabilities to be controlled directly through the
* advertising field of &struct ethtool_cmd.
*
* If @autoneg is non-zero, the MAC is configured to send and/or
* receive pause frames according to the result of autonegotiation.
* Otherwise, it is configured directly based on the @rx_pause and
* @tx_pause flags.
*/
Doesn't [1] contradict your description of kernel "storing the config"?
Also you're not reflecting this in the help for the set op..
Powered by blists - more mailing lists