netdev - Re: [PATCH] net: reduce number of reference taken on sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Thu, 21 May 2009 11:07:19 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	David Miller <davem@...emloft.net>
CC:	khc@...waw.pl, netdev@...r.kernel.org, satoru.satoh@...il.com
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt

David Miller a écrit :
> From: Eric Dumazet <dada1@...mosbay.com>
> Date: Sun, 10 May 2009 12:45:56 +0200
> 
>> Patch follows for RFC only (not Signed-of...), and based on net-next-2.6 
> 
> Thanks for the analysis.
> 
>> @@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
>>  	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
>>  		wake_up_interruptible_poll(sk->sk_sleep,
>>  					   POLLIN | POLLRDNORM | POLLRDBAND);
>> -		if (!inet_csk_ack_scheduled(sk))
>> +		if (!inet_csk_ack_scheduled(sk)) {
>> +			unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
>> +
>> +			delay = min(inet_csk(sk)->icsk_ack.ato, delay);
>>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
>> -						  (3 * tcp_rto_min(sk)) / 4,
>> -						  TCP_RTO_MAX);
>> +						  delay, TCP_RTO_MAX);
>> +		}
>>  	}
>>  	return 1;
> 
> I think this code is trying to aggressively stretch the ACK when
> prequeueing.  In order to make sure there is enough time to get
> the process on the CPU and send a response, and thus piggyback
> the ACK.
> 
> If that turns out not to really matter, or matter less than your
> problem, then we can make your change and I'm all for it.

This change gave me about 15% increase in bandwidth in a multiflow
tcp benchmark. But this optimization worked because tasks could be
wakeup and send their answer in the same jiffies, and 'rearming'
the xmit timer with the same value...

(135.000 messages received/sent per second in my benchmark, with 60 flows)

mod_timer() has special heuristic to avoid calling __mod_timer()

int mod_timer(struct timer_list *timer, unsigned long expires)
{
        /*
         * This is a common optimization triggered by the
         * networking code - if the timer is re-modified
         * to be the same thing then just return:
         */
        if (timer->expires == expires && timer_pending(timer))
                return 1;

        return __mod_timer(timer, expires, false);
}

with HZ=1000, and real applications (using more than 1 msec to process the request),
I suppose this kind of optimization is unlikely to happen,
so we might extend mod_timer() heuristic to avoid changing timer->expires
if the new value is almost the same than previous, and not "exactly the same value"

int mod_timer_unexact(struct timer_list *timer, unsigned long expires, long maxerror)
{
        /*
         * This is a common optimization triggered by the
         * networking code - if the timer is re-modified
         * to be about the same thing then just return:
         */
        if (timer_pending(timer)) {
		long delta = expires - timer->expires;

		if (delta <= maxerror && delta >= -maxerror)
	                return 1;
	}
        return __mod_timer(timer, expires, false);
}



But to be effective, prequeue needs a blocked task
for each flow, and modern daemons prefer to use poll/epoll and
prequeue is thus not used.

Another possibility would be to use a different timer for prequeue 
exclusive use instead of sharing xmit_timer.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html