lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2d7557d6-6702-485a-863a-f29abeb5ed4f@linux.ibm.com>
Date: Sat, 13 Jul 2024 07:35:09 -0500
From: Nick Child <nnac123@...ux.ibm.com>
To: Jay Vosburgh <jv@...sburgh.net>
Cc: netdev@...r.kernel.org
Subject: Re: [RFC PATCH net-next 0/1] bonding: Return TX congested if no
 active slave

Hi Jay, thanks for taking a look.

On 7/12/24 18:34, Jay Vosburgh wrote:
> Nick Child <nnac123@...ux.ibm.com> wrote:
> 
>> 1. During a zero window probe, the socket attempts to get an updated
>> window from the peer 15 times (default of tcp_retries2). Looking at
>> tcp_send_probe0(), the timer between probes is either doubled or 0.5
>> seconds. The timer is doubled if the skb transmit returns
>> NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the
>> timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means
>> the socket can ETIMEOUT after 7.5 seconds. The return code is typically
>> the return code of __dev_queue_xmit()
> 
> 	I'm not sure that your description of the behavior of
> tcp_send_probe0() matches the code.
> 
> 	It looks to me like the timer doubles for "if (err <= 0)",
> meaning a negative value or NET_XMIT_SUCCESS.  The 0.5 timeout value (in
> TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)",
> so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably
> result in the 0.5 second timeout.
> 
> 	However, since tcp_write_wakeup() -> tcp_transmit_skb() ->
> __tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is
> NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible
> for err to equal NET_XMIT_CN here.

Apologies, I was oversimplifying in my explanation. I was referencing
dev_queue_xmit returning either CN or DROP. On its way up the stack,
return code CN is mapped to 0 (in net_xmit_eval()). So, proper phrasing
is "when dev_queue_xmit returns CN, the timer is doubled. When 
dev_queue_xmit returns DROPPED, the timer is 0.5".

> 	I'll note that the 0.5 second timeout logic had a relatively
> recent change in c1d5674f8313 ("tcp: less aggressive window probing on
> local congestion").  From reading the log, the intent seems to be to
> bound the maximum probe interval to 0.5 seconds in low-RTT environments.
> 

I don't have any complaints with the quick timeout. I believe it has 
valid applications, I just don't think this scenario (when the bond
device is waiting for dev_deactivate) falls into that.

>> 2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the
>> enqueue function is defined. In this circumstance, the qdisc enqueue
>> function return code propagates up the stack. On the other hand, if the
>> qdisc enqueue function is NULL then the drivers xmit function is called
>> directly via dev_hard_start_xmit(). In this circumstance, the drivers
>> xmit return code propagates up the stack.
>>
>> 3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to
>> noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue
>> function. Therefore, when the bonding device is carrier UP,
>> bond_start_xmit is called directly. In this function, depending on
>> bonding mode, a slave device is assigned to the skb and
>> __dev_queue_xmit() is called again. This time the slaves qdisc enqueue
>> function (which is almost always defined) is called.
> 
> 	Does your analysis or behavior change if the bond itself does
> have a qdisc?  IFF_NO_QUEUE does not install one by default, but users
> are free to add one.
> 

Good question. I did not try. Though I don't think we would see this 
issue because there would not be a way for the return code of 
dev_queue_xmit() to propagate all the way up to tcp_write_wakeup().

>> 4. When a device calls netif_carrier_off(), it schedules dev_deactivate
>> which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue
>> function of noop_qdisc is defined but simply returns NET_XMIT_CN.
>>
>> 5. The miimon of the bonding device periodically checks for the carrier
>> status of its slaves. If it detects that all of its slaves are down,
>> then it sets currently_active_slave to NULL and calls
>> netif_carrier_off() on itself.
>>
>> 6. In the bonding devices xmit function, if it does not have any active
>> slaves, it returns NET_XMIT_DROP.
>>
>> Given these observations. Assume a bonding devices slaves all suddenly
>> call netif_carrier_off(). Also assume that a tcp connection is in a zero
>> window probe. There is a window for a behavioral issue here:
> 
>> 1. If the bonding device does not notice that its slaves are down
>> (maybe its miimon interval is too large or the miimon commit could not
>> grab rtnl), then the slaves enqueue function is invoked. This will
>> either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is
>> doubled.
> 
> 	As I mentioned above, I'm not sure this accurately describes the
> behavior in tcp_send_probe0()
> 
> 	-J
> 
The call stack I describe here looks like this:
    pfifo_fast_enqueue+12
     dev_qdisc_enqueue+76
     __dev_queue_xmit+1700
     bond_dev_queue_xmit+68
     bond_start_xmit+1092
     dev_hard_start_xmit+300
     __dev_queue_xmit+3124
     ip_finish_output2+1020
     ip_output+228
     ip_local_out+104
     __ip_queue_xmit+384
     __tcp_transmit_skb+1644

Considering that tcp_write_wakeup calls tcp_transmit_skb and 
__tcp_transmit_skb translates NET_XMIT_CN to 0 through net_xmit_eval, I 
believe my statement is correct. If the slaves
enqueue function is invoked (in this case pfifo_fast) and 
NET_XMIT_SUCCESS or NET_XMIT_CN is returned then the probe timer is doubled.
>> -- 
>> 2.43.0
> 
> ---
> 	-Jay Vosburgh, jv@...sburgh.net

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ