[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A057387.4080308@cosmosbay.com>
Date: Sat, 09 May 2009 14:13:59 +0200
From: Eric Dumazet <dada1@...mosbay.com>
To: David Miller <davem@...emloft.net>
CC: khc@...waw.pl, netdev@...r.kernel.org
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
David Miller a écrit :
> From: Eric Dumazet <dada1@...mosbay.com>
> Date: Fri, 08 May 2009 17:12:39 +0200
>
>> For example, we can avoid the dst_release() cache miss if this
>> is done in start_xmit(), and not later in TX completion while freeing skb.
>> I tried various patches in the past but unfortunatly it seems
>> only safe way to do this is in the driver xmit itself, not in core
>> network stack. This would need many patches, one for each driver.
>
> There might be a way around having to hit every driver.
>
> The case we can't muck with is when the route will be used.
> Devices which create this kind of situation can be marked with
> a flag bit in struct netdevice. If that flag bit isn't set,
> you can drop the DST in dev_hard_start_xmit().
Yes, this is a possibility, I'll think about it, thank you.
I'll have to recall which devices would need this flag (loopback for sure)..
>
>> [PATCH] net: reduce number of reference taken on sk_refcnt
>>
>> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
>> in flight. This hurts some workloads at TX completion time, because
>> sock_wfree() has three cache lines to touch at least.
>> (one for sk_wmem_alloc, one for testing sk_flags, one
>> to decrement sk_refcnt)
>>
>> We could use only one reference count, taken only when sk_wmem_alloc
>> is changed from or to ZERO value (ie one reference count for any number
>> of in-flight packets)
>>
>> Not all atomic_add() must be changed to atomic_add_return(), if we
>> know current sk_wmem_alloc is already not null.
>>
>> This patch reduces by one number of cache lines dirtied in sock_wfree()
>> and number of atomic operation in some workloads.
>>
>> Signed-off-by: Eric Dumazet <dada1@...mosbay.com>
>
> I like this idea. Let me know when you have some at least
> basic performance numbers and wish to submit this formally.
Sure, but I am focusing right now on the opposite situation
(many tcp flows but with small in/out trafic, where this patch
has no impact, since I have only (0 <--> !0) transitions.)
BTW, oprofile for this kind of workload gives a surprising result.
(timer stuff being *very* expensive)
CPU doing the NAPI stuff has this profile :
88688 88688 9.7805 9.7805 lock_timer_base
72692 161380 8.0165 17.7970 bnx2_poll_work
66958 228338 7.3842 25.1812 mod_timer
47980 276318 5.2913 30.4724 __wake_up
43312 319630 4.7765 35.2489 task_rq_lock
43193 362823 4.7633 40.0122 __slab_alloc
36388 399211 4.0129 44.0251 __alloc_skb
30285 429496 3.3398 47.3650 skb_release_data
29236 458732 3.2242 50.5891 ip_rcv
29219 487951 3.2223 53.8114 resched_task
29094 517045 3.2085 57.0199 __inet_lookup_established
28695 545740 3.1645 60.1844 tcp_v4_rcv
27479 573219 3.0304 63.2148 sock_wfree
26722 599941 2.9469 66.1617 ip_route_input
21401 621342 2.3601 68.5218 select_task_rq_fair
19390 640732 2.1383 70.6601 __kfree_skb
17763 658495 1.9589 72.6190 sched_clock_cpu
17565 676060 1.9371 74.5561 try_to_wake_up
17366 693426 1.9151 76.4712 __enqueue_entity
16174 709600 1.7837 78.2549 update_curr
14323 723923 1.5795 79.8345 __kmalloc_track_caller
14003 737926 1.5443 81.3787 enqueue_task_fair
12456 750382 1.3737 82.7524 __tcp_prequeue
12212 762594 1.3467 84.0991 __wake_up_common
11437 774031 1.2613 85.3604 kmem_cache_alloc
10927 784958 1.2050 86.5654 place_entity
10535 795493 1.1618 87.7272 netif_receive_skb
9971 805464 1.0996 88.8268 ipt_do_table
8551 814015 0.9430 89.7698 internal_add_timer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists