netdev - Re: [PATCH net-next] tcp: reduce memory needs of out of order queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 15 Oct 2011 08:39:35 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Rick Jones <rick.jones2@...com>
Cc:	David Miller <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] tcp: reduce memory needs of out of order queue

Le vendredi 14 octobre 2011 à 15:12 -0700, Rick Jones a écrit :

Thanks Rick


> So, a test as above from a system running 2.6.38-11-generic to a system 
> running 3.0.0-12-generic.  On the sender we have:
> 
> raj@...dy:~/netperf2_trunk$ netstat -s > before; src/netperf -H 
> raj-8510w.americas.hpqcorp.net -t tcp_rr -- -b 256 -D -o 
> throughput,local_transport_retrans,remote_transport_retrans,lss_size_end,rsr_size_end 
> ; netstat -s > after
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
> to internal-host.americas.hpqcorp.net (16.89.245.115) port 0 AF_INET : 
> nodelay : first burst 256
> Throughput,Local Transport Retransmissions,Remote Transport 
> Retransmissions,Local Send Socket Size Final,Remote Recv Socket Size Final
> 76752.43,274,0,16384,98304
> 
> 274 retransmissions at the sender.  The "beforeafter" of that on the sender:
> 
> raj@...dy:~/netperf2_trunk$ cat delta.send

> Tcp:
>      2 active connections openings
>      0 passive connection openings
>      0 failed connection attempts
>      0 connection resets received
>      0 connections established
>      766727 segments received
>      734408 segments send out

>      274 segments retransmited

	Exactly the count of dropped frames because of receiver sk_rmem_alloc +
backlog.len hitting receiver sk_rcvbuf

static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
{
        unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);

        return qsize + skb->truesize > sk->sk_rcvbuf;
}

static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
        if (sk_rcvqueues_full(sk, skb))
                return -ENOBUFS;

        __sk_add_backlog(sk, skb);
        sk->sk_backlog.len += skb->truesize;
        return 0;
}

In very old kernels, we had no limit on backlog, so we could queue lot
of extra skbs in it and eventually consume all kernel memory (OOM)

refs : commit c377411f249 (net: sk_add_backlog() take rmem_alloc into
account)
	commit 6b03a53a5ab7 (tcp: use limited socket backlog)

	commit 8eae939f14003 (net: add limit for socket backlog )

Now we enforce a limit, better to chose a correct limit / tcpwindow
combination so that normal trafic doesnt trigger drops at receiver

>      0 bad segments received.
>      0 resets sent
> Udp:
>      7 packets received
>      0 packets to unknown port received.
>      0 packet receive errors
>      7 packets sent
> UdpLite:
> TcpExt:
>      0 packets pruned from receive queue because of socket buffer overrun
>      0 ICMP packets dropped because they were out-of-window
>      0 TCP sockets finished time wait in fast timer
>      2 delayed acks sent
>      0 delayed acks further delayed because of locked socket
>      Quick ack mode was activated 0 times
>      170856 packets directly queued to recvmsg prequeue.
>      1204 bytes directly in process context from backlog
>      170678 bytes directly received in process context from prequeue
>      592090 packet headers predicted
>      170626 packets header predicted and directly queued to user
>      1375 acknowledgments not containing data payload received
>      174911 predicted acknowledgments
>      150 times recovered from packet loss by selective acknowledgements
>      0 congestion windows recovered without slow start by DSACK
>      0 congestion windows recovered without slow start after partial ack
>      299 TCP data loss events
>      TCPLostRetransmit: 9
>      0 timeouts after reno fast retransmit
>      0 timeouts after SACK recovery
>      253 fast retransmits
>      14 forward retransmits
>      6 retransmits in slow start
>      0 other TCP timeouts
>      1 SACK retransmits failed
>      0 times receiver scheduled too late for direct processing
>      0 packets collapsed in receive queue due to low socket buffer
>      0 DSACKs sent for old packets
>      0 DSACKs received
>      0 connections reset due to unexpected data
>      0 connections reset due to early user close
>      0 connections aborted due to timeout
>      0 times unabled to send RST due to no memory
>      TCPDSACKIgnoredOld: 0
>      TCPDSACKIgnoredNoUndo: 0
>      TCPSackShifted: 0
>      TCPSackMerged: 1031
>      TCPSackShiftFallback: 240
>      TCPBacklogDrop: 0
>      IPReversePathFilter: 0
> IpExt:
>      InMcastPkts: 0
>      OutMcastPkts: 0
>      InBcastPkts: 1
>      InOctets: -1012182764
>      OutOctets: -1436530450
>      InMcastOctets: 0
>      OutMcastOctets: 0
>      InBcastOctets: 147
> 
> and then the deltas on the receiver:
> 
> raj@...-8510w:~/netperf2_trunk$ cat delta.recv
> Ip:
>      734669 total packets received
>      0 with invalid addresses
>      0 forwarded
>      0 incoming packets discarded
>      734669 incoming packets delivered
>      766696 requests sent out
>      0 dropped because of missing route
> Icmp:
>      0 ICMP messages received
>      0 input ICMP message failed.
>      ICMP input histogram:
>          destination unreachable: 0
>      0 ICMP messages sent
>      0 ICMP messages failed
>      ICMP output histogram:
> IcmpMsg:
>          InType3: 0
> Tcp:
>      0 active connections openings
>      2 passive connection openings
>      0 failed connection attempts
>      0 connection resets received
>      0 connections established
>      734651 segments received
>      766695 segments send out
>      0 segments retransmited
>      0 bad segments received.
>      0 resets sent
> Udp:
>      1 packets received
>      0 packets to unknown port received.
>      0 packet receive errors
>      1 packets sent
> UdpLite:
> TcpExt:
>      28 packets pruned from receive queue because of socket buffer overrun
>      0 delayed acks sent
>      0 delayed acks further delayed because of locked socket
>      19 packets directly queued to recvmsg prequeue.
>      0 bytes directly in process context from backlog
>      667 bytes directly received in process context from prequeue
>      727842 packet headers predicted
>      9 packets header predicted and directly queued to user
>      161 acknowledgments not containing data payload received
>      229704 predicted acknowledgments


>      6774 packets collapsed in receive queue due to low socket buffer
>      TCPBacklogDrop: 276

	Yes, these two counters explain all.

	1) "6774 packets collapsed in receive queue due to low socket buffer"

We spend a _lot_ of cpu time in "collapsing" process : Taking several
skb and build a compound one (using one PAGE and trying to fill all the
available bytes in it with contigous parts).

Doing this work is of course last desperate attempt before the much
painfull :

	2) TCPBacklogDrop: 276

	We plain drop incoming messages because too much kernel memory is used
by the socket.

> IpExt:
>      InMcastPkts: 0
>      OutMcastPkts: 0
>      InBcastPkts: 17
>      OutBcastPkts: 0
>      InOctets: 38973144
>      OutOctets: 40673137
>      InMcastOctets: 0
>      OutMcastOctets: 0
>      InBcastOctets: 1816
>      OutBcastOctets: 0
> 
> this is an otherwise clean network, no errors reported by ifconfig or 
> ethtool -S, and the packet rate was well within the limits of 1 GbE and 
> the ProCurve 2724 switch between the two systems.
> 
>  From just a very quick look it looks like tcp_v[46]_rcv is called, 
> finds that the socket is owned by the user, attempts to add to the 
> backlog, but the path called by sk_add_backlog does not seem to make any 
> attempts to compress things, so when the quantity of data is << the 
> truesize it starts tossing babies out with the bathwater.
> 

Rick, could you redo the test, using following bit on receiver :

echo 1 >/proc/sys/net/ipv4/tcp_adv_win_scale

If you still have collapses/retransmits, you then could try :

echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html