[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKPTWBdi8upoxjFok2CPFhkGB9S3crZcefZ0mRhFHGPhQ@mail.gmail.com>
Date: Thu, 21 Aug 2025 05:29:38 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: chia-yu.chang@...ia-bell-labs.com
Cc: pabeni@...hat.com, linux-doc@...r.kernel.org, corbet@....net,
horms@...nel.org, dsahern@...nel.org, kuniyu@...zon.com, bpf@...r.kernel.org,
netdev@...r.kernel.org, dave.taht@...il.com, jhs@...atatu.com,
kuba@...nel.org, stephen@...workplumber.org, xiyou.wangcong@...il.com,
jiri@...nulli.us, davem@...emloft.net, andrew+netdev@...n.ch,
donald.hunter@...il.com, ast@...erby.net, liuhangbin@...il.com,
shuah@...nel.org, linux-kselftest@...r.kernel.org, ij@...nel.org,
ncardwell@...gle.com, koen.de_schepper@...ia-bell-labs.com,
g.white@...lelabs.com, ingemar.s.johansson@...csson.com,
mirja.kuehlewind@...csson.com, cheshire@...le.com, rs.ietf@....at,
Jason_Livingood@...cast.com, vidhi_goel@...le.com
Subject: Re: [PATCH v15 net-next 10/14] tcp: accecn: AccECN option
On Fri, Aug 15, 2025 at 1:40 AM <chia-yu.chang@...ia-bell-labs.com> wrote:
>
> From: Ilpo Järvinen <ij@...nel.org>
>
> The Accurate ECN allows echoing back the sum of bytes for
> each IP ECN field value in the received packets using
> AccECN option. This change implements AccECN option tx & rx
> side processing without option send control related features
> that are added by a later change.
>
> Based on specification:
> https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
> (Some features of the spec will be added in the later changes
> rather than in this one).
>
> A full-length AccECN option is always attempted but if it does
> not fit, the minimum length is selected based on the counters
> that have changed since the last update. The AccECN option
> (with 24-bit fields) often ends in odd sizes so the option
> write code tries to take advantage of some nop used to pad
> the other TCP options.
>
> The delivered_ecn_bytes pairs with received_ecn_bytes similar
> to how delivered_ce pairs with received_ce. In contrast to
> ACE field, however, the option is not always available to update
> delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
> bytes calculated based on the cumulative ACK+SACK information
> are assigned to one of the counters using an estimation
> heuristic to select the most likely ECN byte counter. Any
> estimation error is corrected when the next AccECN option
> arrives. It may occur that the heuristic gets too confused
> when there are enough different byte counter deltas between
> ACKs with the AccECN option in which case the heuristic just
> gives up on updating the counters for a while.
>
> tcp_ecn_option sysctl can be used to select option sending
> mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM,
> and TCP_ECN_OPTION_FULL.
>
> This patch increases the size of tcp_info struct, as there is
> no existing holes for new u32 variables. Below are the pahole
> outcomes before and after this patch:
>
> [BEFORE THIS PATCH]
> struct tcp_info {
> [...]
> __u32 tcpi_total_rto_time; /* 244 4 */
>
> /* size: 248, cachelines: 4, members: 61 */
> }
>
> [AFTER THIS PATCH]
> struct tcp_info {
> [...]
> __u32 tcpi_total_rto_time; /* 244 4 */
> __u32 tcpi_received_ce; /* 248 4 */
> __u32 tcpi_delivered_e1_bytes; /* 252 4 */
> __u32 tcpi_delivered_e0_bytes; /* 256 4 */
> __u32 tcpi_delivered_ce_bytes; /* 260 4 */
> __u32 tcpi_received_e1_bytes; /* 264 4 */
> __u32 tcpi_received_e0_bytes; /* 268 4 */
> __u32 tcpi_received_ce_bytes; /* 272 4 */
>
> /* size: 280, cachelines: 5, members: 68 */
> }
>
> This patch uses the existing 1-byte holes in the tcp_sock_write_txrx
> group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx
> group after the new u32 delivered_ecn_bytes[3] member. Therefore, the
> group size of tcp_sock_write_rx is increased from 96 to 112. Below
> are the pahole outcomes before and after this patch:
>
> [BEFORE THIS PATCH]
> struct tcp_sock {
> [...]
> u8 received_ce_pending:4; /* 2522: 0 1 */
> u8 unused2:4; /* 2522: 4 1 */
> /* XXX 1 byte hole, try to pack */
>
> [...]
> u32 rcv_rtt_last_tsecr; /* 2668 4 */
>
> [...]
> __cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */
>
> [...]
> /* size: 3200, cachelines: 50, members: 167 */
> }
>
> [AFTER THIS PATCH]
> struct tcp_sock {
> [...]
> u8 received_ce_pending:4;/* 2522: 0 1 */
> u8 unused2:4; /* 2522: 4 1 */
> u8 accecn_minlen:2; /* 2523: 0 1 */
> u8 est_ecnfield:2; /* 2523: 2 1 */
> u8 unused3:4; /* 2523: 4 1 */
>
> [...]
> u32 rcv_rtt_last_tsecr; /* 2668 4 */
> u32 delivered_ecn_bytes[3];/* 2672 12 */
> /* XXX 4 bytes hole, try to pack */
>
> [...]
> __cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */
>
> [...]
> /* size: 3200, cachelines: 50, members: 171 */
> }
>
> Signed-off-by: Ilpo Järvinen <ij@...nel.org>
> Signed-off-by: Neal Cardwell <ncardwell@...gle.com>
> Co-developed-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@...ia-bell-labs.com>
>
> ---
> v13
> - Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and
> TCP_ACCECN_CEB_INIT_OFFSET to this patch
> - Use static array lookup in tcp_accecn_optfield_to_ecnfield()
> - Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option()
> - Make synack_ecn_bytes as static const array and use const u32 pointer
> in tcp_options_write()
> - Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN
> option to dword
>
> v10
> - Add documentation of tcp_ecn_option in ip-sysctl.rst to this patch
> - Remove the global variable u32 synack_ecn_bytes[3]
> - Add READ_ONCE() over every reads of sysctl
>
> v9:
> - Restruct the code in the for loop of tcp_accecn_process_option()
> - Remove ecn_bytes and add use_synack_ecn_bytes flag in tcp_out_options
> struct to identify whether syn_ack_bytes or received_ecn_bytes is used
> - Replace leftover_bytes and leftover_size with leftover_highbyte and
> leftover_lowbyte and add comments in tcp_options_write()
>
> v8:
> - Reset leftover_size to 2 once leftover_bytes is used
> ---
> Documentation/networking/ip-sysctl.rst | 19 ++
> .../networking/net_cachelines/tcp_sock.rst | 3 +
> include/linux/tcp.h | 9 +-
> include/net/netns/ipv4.h | 1 +
> include/net/tcp.h | 13 ++
> include/net/tcp_ecn.h | 89 +++++++++-
> include/uapi/linux/tcp.h | 7 +
> net/ipv4/sysctl_net_ipv4.c | 9 +
> net/ipv4/tcp.c | 15 +-
> net/ipv4/tcp_input.c | 94 +++++++++-
> net/ipv4/tcp_ipv4.c | 1 +
> net/ipv4/tcp_output.c | 165 +++++++++++++++++-
> 12 files changed, 412 insertions(+), 13 deletions(-)
>
minlen);
> + }
> }
> }
>
> @@ -263,6 +347,9 @@ static inline void tcp_accecn_init_counters(struct tcp_sock *tp)
> tp->received_ce = 0;
> tp->received_ce_pending = 0;
> __tcp_accecn_init_bytes_counters(tp->received_ecn_bytes);
> + __tcp_accecn_init_bytes_counters(tp->delivered_ecn_bytes);
> + tp->accecn_minlen = 0;
> + tp->est_ecnfield = 0;
> }
>
> /* Used for make_synack to form the ACE flags */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index bdac8c42fa82..53e0e85b52be 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -316,6 +316,13 @@ struct tcp_info {
> * in milliseconds, including any
> * unfinished recovery.
> */
> + __u32 tcpi_received_ce; /* # of CE marks received */
> + __u32 tcpi_delivered_e1_bytes; /* Accurate ECN byte counters */
> + __u32 tcpi_delivered_e0_bytes;
> + __u32 tcpi_delivered_ce_bytes;
> + __u32 tcpi_received_e1_bytes;
> + __u32 tcpi_received_e0_bytes;
> + __u32 tcpi_received_ce_bytes;
> };
>
We do not add more fields to tcp_info, unless added fields are a
multiple of 64 bits.
Otherwise a hole is added and can not be recovered.
Powered by blists - more mailing lists