[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ec77b414-32f5-b512-7f1a-fb939430cea6@drivenets.com>
Date: Mon, 26 Apr 2021 19:58:50 +0300
From: Leonard Crestez <lcrestez@...venets.com>
To: Neal Cardwell <ncardwell@...gle.com>,
Matt Mathis <mattmathis@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Willem de Bruijn <willemb@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
David Ahern <dsahern@...nel.org>,
John Heffner <johnwheffner@...il.com>,
Soheil Hassas Yeganeh <soheil@...gle.com>,
Roopa Prabhu <roopa@...ulusnetworks.com>,
Leonard Crestez <cdleonard@...il.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC] tcp: Consider mtu probing for tcp_xmit_size_goal
On 26.04.2021 18:40, Neal Cardwell wrote:
> Thanks for the patch. Some thoughts, in-line below:
>
> On Sun, Apr 25, 2021 at 10:22 PM Leonard Crestez <lcrestez@...venets.com> wrote:
>>
>> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
>> in order to accumulate enough data" but linux almost never does that.
>>
>> Linux checks for probe_size + (1 + retries) * mss_cache to be available
>
> I think this means to say "reordering" rather than "retries"?
>
>> in the send buffer and if that condition is not met it will send anyway
>> using the current MSS. The feature can be made to work by sending very
>> large chunks of data from userspace (for example 128k) but for small
>> writes on fast links tcp mtu probes almost never happen.
>>
>> This patch tries to take mtu probe into account in tcp_xmit_size_goal, a
>> function which otherwise attempts to accumulate a packet suitable for
>> TSO. No delays are introduced beyond existing autocork heuristics.
>
> I would suggest phrasing this paragraph as something like:
>
> This patch generalizes tcp_xmit_size_goal(), a
> function which is used in autocorking in order to attempt to
> accumulate suitable transmit sizes, so that it takes into account
> not only TSO and receive window, but now also transmit
> sizes needed for efficient PMTU discovery (when a flow is
> eligible for PMTU discovery).
>
>>
>> Suggested-by: Matt Mathis <mattmathis@...gle.com>
>> Signed-off-by: Leonard Crestez <lcrestez@...venets.com>
>> ---
> ...
>> +
>> + /* Timer for wait_data */
>> + struct timer_list wait_data_timer;
>
> This timer_list seems left over from a previous patch version, and no
> longer used?
Yes, I'll most a cleaned-up version.
>> @@ -1375,10 +1376,12 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk)
>> s32 delta;
>>
>> if (!sock_net(sk)->ipv4.sysctl_tcp_slow_start_after_idle || tp->packets_out ||
>> ca_ops->cong_control)
>> return;
>> + if (inet_csk(sk)->icsk_mtup.wait_data)
>> + return;
>> delta = tcp_jiffies32 - tp->lsndtime;
>> if (delta > inet_csk(sk)->icsk_rto)
>> tcp_cwnd_restart(sk, delta);
>> }
>
> I would argue that the MTU probing heuristics should not override
> congestion control. If congestion control thinks it's proper to reset
> the cwnd to a lower value due to the connection being idle, then this
> should happen irrespective of MTU probing activity.
This is a hack from the previous version, removed.
>> int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>> {
>> int mss_now;
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index bde781f46b41..c15ed548a48a 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2311,10 +2311,23 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)
>> }
>>
>> return true;
>> }
>>
>> +int tcp_mtu_probe_size_needed(struct sock *sk)
>> +{
>> + struct inet_connection_sock *icsk = inet_csk(sk);
>> + struct tcp_sock *tp = tcp_sk(sk);
>> + int probe_size;
>> + int size_needed;
>> +
>> + probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high + icsk->icsk_mtup.search_low) >> 1);
>> + size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> Adding this helper without refactoring tcp_mtu_probe() to use this
> helper would leave some significant unnecessary code duplication. To
> avoid duplication, AFAICT your tcp_mtu_probe() should call your new
> helper.
Yes. This function should also check if RACK is enabled and try to send
smaller probes in that case.
> You may also want to make this little helper static inline, since
> AFAICT it is called in some hot paths.
It's only called when tp->icsk_mtup.wait_data is marked true so only for
a brief while every ten minutes or so. It is possible that this "wait"
stretches for a long time though.
>> static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>> int push_one, gfp_t gfp)
>> {
>> + struct inet_connection_sock *icsk = inet_csk(sk);
>> struct tcp_sock *tp = tcp_sk(sk);
>> + struct net *net = sock_net(sk);
>> struct sk_buff *skb;
>> unsigned int tso_segs, sent_pkts;
>> int cwnd_quota;
>> int result;
>> bool is_cwnd_limited = false, is_rwnd_limited = false;
>> u32 max_segs;
>>
>> sent_pkts = 0;
>>
>> tcp_mstamp_refresh(tp);
>> + /*
>> + * Waiting for tcp probe data also applies when push_one=1
>> + * If user does many small writes we hold them until we have have enough
>> + * for a probe.
>> + */
>
> This comment seems incomplete; with this patch AFAICT small writes are
> not simply always held unconditionally until there is enough data for
> a full-sized MTU probe. Rather the autocorking mechanism is used, so
> that if the flow if a flow is eligible for MTU probes, autocorking
> attempts to wait for a full amount of data with which to probe (and as
> is the case with autocorking, if this does not become available then
> when the local send queues are empty then the flow sends out whatever
> it has).
Comment is from a previous version, should be deleted.
--
Regards,
Leonard
Powered by blists - more mailing lists