[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51B1EB7D.7060801@telenet.dn.ua>
Date: Fri, 07 Jun 2013 17:17:33 +0300
From: "Vitaly V. Bursov" <vitalyb@...enet.dn.ua>
To: Daniel Borkmann <dborkman@...hat.com>
CC: Mike Galbraith <bitbucket@...ine.de>, linux-kernel@...r.kernel.org,
netdev <netdev@...r.kernel.org>
Subject: Re: Scaling problem with a lot of AF_PACKET sockets on different
interfaces
07.06.2013 16:05, Daniel Borkmann пишет:
> On 06/07/2013 02:41 PM, Mike Galbraith wrote:
>> (CC's net-fu dojo)
>>
>> On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:
>>> Hello,
>>>
>>> I have a Linux router with a lot of interfaces (hundreds or
>>> thousands of VLANs) and an application that creates AF_PACKET
>>> socket per interface and bind()s sockets to interfaces.
>>>
>>> Each socket has attached BPF filter too.
>>>
>>> The problem is observed on linux-3.8.13, but as far I can see
>>> from the source the latest version has alike behavior.
>>>
>>> I noticed that box has strange performance problems with
>>> most of the CPU time spent in __netif_receive_skb:
>>> 86.15% [k] __netif_receive_skb
>>> 1.41% [k] _raw_spin_lock
>>> 1.09% [k] fib_table_lookup
>>> 0.99% [k] local_bh_enable_ip
>>>
>>> and this the assembly with the "hot spot":
>>> │ shr $0x8,%r15w
>>> │ and $0xf,%r15d
>>> 0.00 │ shl $0x4,%r15
>>> │ add $0xffffffff8165ec80,%r15
>>> │ mov (%r15),%rax
>>> 0.09 │ mov %rax,0x28(%rsp)
>>> │ mov 0x28(%rsp),%rbp
>>> 0.01 │ sub $0x28,%rbp
>>> │ jmp 5c7
>>> 1.72 │5b0: mov 0x28(%rbp),%rax
>>> 0.05 │ mov 0x18(%rsp),%rbx
>>> 0.00 │ mov %rax,0x28(%rsp)
>>> 0.03 │ mov 0x28(%rsp),%rbp
>>> 5.67 │ sub $0x28,%rbp
>>> 1.71 │5c7: lea 0x28(%rbp),%rax
>>> 1.73 │ cmp %r15,%rax
>>> │ je 640
>>> 1.74 │ cmp %r14w,0x0(%rbp)
>>> │ jne 5b0
>>> 81.36 │ mov 0x8(%rbp),%rax
>>> 2.74 │ cmp %rax,%r8
>>> │ je 5eb
>>> 1.37 │ cmp 0x20(%rbx),%rax
>>> │ je 5eb
>>> 1.39 │ cmp %r13,%rax
>>> │ jne 5b0
>>> 0.04 │5eb: test %r12,%r12
>>> 0.04 │ je 6f4
>>> │ mov 0xc0(%rbx),%eax
>>> │ mov 0xc8(%rbx),%rdx
>>> │ testb $0x8,0x1(%rdx,%rax,1)
>>> │ jne 6d5
>>>
>>> This corresponds to:
>>>
>>> net/core/dev.c:
>>> type = skb->protocol;
>>> list_for_each_entry_rcu(ptype,
>>> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>> if (ptype->type == type &&
>>> (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
>>> ptype->dev == orig_dev)) {
>>> if (pt_prev)
>>> ret = deliver_skb(skb, pt_prev, orig_dev);
>>> pt_prev = ptype;
>>> }
>>> }
>>>
>>> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
>>> the socket adds a protocol to ptype list:
>>>
>>> # cat /proc/net/ptype
>>> Type Device Function
>>> 0800 eth2.1989 packet_rcv+0x0/0x400
>>> 0800 eth2.1987 packet_rcv+0x0/0x400
>>> 0800 eth2.1986 packet_rcv+0x0/0x400
>>> 0800 eth2.1990 packet_rcv+0x0/0x400
>>> 0800 eth2.1995 packet_rcv+0x0/0x400
>>> 0800 eth2.1997 packet_rcv+0x0/0x400
>>> .......
>>> 0800 eth2.1004 packet_rcv+0x0/0x400
>>> 0800 ip_rcv+0x0/0x310
>>> 0011 llc_rcv+0x0/0x3a0
>>> 0004 llc_rcv+0x0/0x3a0
>>> 0806 arp_rcv+0x0/0x150
>>>
>>> And this obviously results in a huge performance penalty.
>>>
>>> ptype_all, by the looks, should be the same.
>>>
>>> Probably one way to fix this it to perform interface name matching in
>>> af_packet handler, but there could be other cases, other protocols.
>>>
>>> Ideas are welcome :)
>
> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
> an alternative if you have only a few packet sockets (maybe one pinned to each
> cpu) and cluster/load-balance them together via packet fanout? (Where you
> bind the socket to ifindex 0, so that you get traffic from all devs...) That
> would at least avoid that "hot spot", and you could post-process the interface
> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
> observed. ;-)
I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
work for me (applications is a custom DHCP server) it'll surely
increase the overhead of BPF (I don't need to tap the traffic from all
interfaces), there are vlans, bridges and bonds - likely the server will receive
same packets multiple times and replies must be sent too...
but it still should be faster.
I just checked isc-dhcpd-V3.1.3 running on multiple interfaces
(another system with 2.6.32):
$ cat /proc/net/ptype
Type Device Function
ALL eth0 packet_rcv_spkt+0x0/0x190
ALL eth0.10 packet_rcv_spkt+0x0/0x190
ALL eth0.11 packet_rcv_spkt+0x0/0x190
....
As I understand, it'll hit this code:
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
which scales the same.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists