[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2d7557d6-6702-485a-863a-f29abeb5ed4f@linux.ibm.com>
Date: Sat, 13 Jul 2024 07:35:09 -0500
From: Nick Child <nnac123@...ux.ibm.com>
To: Jay Vosburgh <jv@...sburgh.net>
Cc: netdev@...r.kernel.org
Subject: Re: [RFC PATCH net-next 0/1] bonding: Return TX congested if no
active slave
Hi Jay, thanks for taking a look.
On 7/12/24 18:34, Jay Vosburgh wrote:
> Nick Child <nnac123@...ux.ibm.com> wrote:
>
>> 1. During a zero window probe, the socket attempts to get an updated
>> window from the peer 15 times (default of tcp_retries2). Looking at
>> tcp_send_probe0(), the timer between probes is either doubled or 0.5
>> seconds. The timer is doubled if the skb transmit returns
>> NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the
>> timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means
>> the socket can ETIMEOUT after 7.5 seconds. The return code is typically
>> the return code of __dev_queue_xmit()
>
> I'm not sure that your description of the behavior of
> tcp_send_probe0() matches the code.
>
> It looks to me like the timer doubles for "if (err <= 0)",
> meaning a negative value or NET_XMIT_SUCCESS. The 0.5 timeout value (in
> TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)",
> so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably
> result in the 0.5 second timeout.
>
> However, since tcp_write_wakeup() -> tcp_transmit_skb() ->
> __tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is
> NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible
> for err to equal NET_XMIT_CN here.
Apologies, I was oversimplifying in my explanation. I was referencing
dev_queue_xmit returning either CN or DROP. On its way up the stack,
return code CN is mapped to 0 (in net_xmit_eval()). So, proper phrasing
is "when dev_queue_xmit returns CN, the timer is doubled. When
dev_queue_xmit returns DROPPED, the timer is 0.5".
> I'll note that the 0.5 second timeout logic had a relatively
> recent change in c1d5674f8313 ("tcp: less aggressive window probing on
> local congestion"). From reading the log, the intent seems to be to
> bound the maximum probe interval to 0.5 seconds in low-RTT environments.
>
I don't have any complaints with the quick timeout. I believe it has
valid applications, I just don't think this scenario (when the bond
device is waiting for dev_deactivate) falls into that.
>> 2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the
>> enqueue function is defined. In this circumstance, the qdisc enqueue
>> function return code propagates up the stack. On the other hand, if the
>> qdisc enqueue function is NULL then the drivers xmit function is called
>> directly via dev_hard_start_xmit(). In this circumstance, the drivers
>> xmit return code propagates up the stack.
>>
>> 3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to
>> noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue
>> function. Therefore, when the bonding device is carrier UP,
>> bond_start_xmit is called directly. In this function, depending on
>> bonding mode, a slave device is assigned to the skb and
>> __dev_queue_xmit() is called again. This time the slaves qdisc enqueue
>> function (which is almost always defined) is called.
>
> Does your analysis or behavior change if the bond itself does
> have a qdisc? IFF_NO_QUEUE does not install one by default, but users
> are free to add one.
>
Good question. I did not try. Though I don't think we would see this
issue because there would not be a way for the return code of
dev_queue_xmit() to propagate all the way up to tcp_write_wakeup().
>> 4. When a device calls netif_carrier_off(), it schedules dev_deactivate
>> which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue
>> function of noop_qdisc is defined but simply returns NET_XMIT_CN.
>>
>> 5. The miimon of the bonding device periodically checks for the carrier
>> status of its slaves. If it detects that all of its slaves are down,
>> then it sets currently_active_slave to NULL and calls
>> netif_carrier_off() on itself.
>>
>> 6. In the bonding devices xmit function, if it does not have any active
>> slaves, it returns NET_XMIT_DROP.
>>
>> Given these observations. Assume a bonding devices slaves all suddenly
>> call netif_carrier_off(). Also assume that a tcp connection is in a zero
>> window probe. There is a window for a behavioral issue here:
>
>> 1. If the bonding device does not notice that its slaves are down
>> (maybe its miimon interval is too large or the miimon commit could not
>> grab rtnl), then the slaves enqueue function is invoked. This will
>> either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is
>> doubled.
>
> As I mentioned above, I'm not sure this accurately describes the
> behavior in tcp_send_probe0()
>
> -J
>
The call stack I describe here looks like this:
pfifo_fast_enqueue+12
dev_qdisc_enqueue+76
__dev_queue_xmit+1700
bond_dev_queue_xmit+68
bond_start_xmit+1092
dev_hard_start_xmit+300
__dev_queue_xmit+3124
ip_finish_output2+1020
ip_output+228
ip_local_out+104
__ip_queue_xmit+384
__tcp_transmit_skb+1644
Considering that tcp_write_wakeup calls tcp_transmit_skb and
__tcp_transmit_skb translates NET_XMIT_CN to 0 through net_xmit_eval, I
believe my statement is correct. If the slaves
enqueue function is invoked (in this case pfifo_fast) and
NET_XMIT_SUCCESS or NET_XMIT_CN is returned then the probe timer is doubled.
>> --
>> 2.43.0
>
> ---
> -Jay Vosburgh, jv@...sburgh.net
Powered by blists - more mailing lists