[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <201c7e4d-83c2-e251-bbaa-4c6fceadc5d1@itcare.pl>
Date: Thu, 17 Aug 2017 14:52:16 +0200
From: Paweł Staszewski <pstaszewski@...are.pl>
To: Julian Anastasov <ja@....bg>, Eric Dumazet <eric.dumazet@...il.com>
Cc: Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: Re: 100% CPU load when generating traffic to destination network that
nexthop is not reachable
Hi
Wondering if someone have idea how to optimise this ?
From reali life perspective it is really important for optimising this
behavior cause imagine situation - just normal situation with linux
acting as a router:
Lets say we have Linux router with connected 3 customers and one
upstream - and have some situation with ddos
1. There is comming ddos traffic from upstream (and lets say we are
handling this ddos on our forwarding router with 50% cpu load) to our
router and router is forwarding this traffic to some customer
(remember we have 3 customers - one is getting ddos now dirested to some
of his ip)
2. Customer X is getting ddos - his router is 100% cpu load (cause low
end hardware or 100% load on up-link.
3. Customer X router is not responding or some watchdog restarted it -
or customer X disabled router for security reasons
4. After fdb expires on our service router - two other customers start
to have problems cause our router goes from 50% to 100% on all cores -
and everybody experiencing now packet drops and bandwidth drops
And this does'nt need to be ddos - it can be many servers connected by
linux router and one server will push some udp stream (iptv or other
filesystem syncing protocol) to the other via forwarding linux router -
if receiving server will goes down and dissapear from arp all other
streams that are forwarded by linux router will suffer from this.
W dniu 2017-08-16 o 12:07, Paweł Staszewski pisze:
> Hi
>
>
> Patch applied - but no big change - from 0.7Mpps per vlan to 1.2Mpps
> per vlan
>
> previously(without patch) 100% cpu load:
>
> bwm-ng v0.6.1 (probing every 0.500s), press 'h' for help
> input: /proc/net/dev type: rate
> | iface Rx Tx Total
> ==============================================================================
>
> vlan1002: 0.00 P/s 1.99
> P/s 1.99 P/s
> vlan1001: 0.00 P/s 717227.12 P/s 717227.12 P/s
> enp175s0f0: 2713679.25 P/s 0.00 P/s 2713679.25
> P/s
> vlan1000: 0.00 P/s 716145.44 P/s 716145.44 P/s
> ------------------------------------------------------------------------------
>
> total: 2713679.25 P/s 1433374.50 P/s 4147054.00
> P/s
>
>
> With patch (100% cpu load a little better pps performance)
>
> bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> input: /proc/net/dev type: rate
> | iface Rx Tx Total
> ==============================================================================
>
> vlan1002: 0.00 P/s 1.00
> P/s 1.00 P/s
> vlan1001: 0.00 P/s 1202161.50 P/s 1202161.50
> P/s
> enp175s0f0: 3699864.50 P/s 0.00 P/s 3699864.50
> P/s
> vlan1000: 0.00 P/s 1196870.38 P/s 1196870.38
> P/s
> ------------------------------------------------------------------------------
>
> total: 3699864.50 P/s 2399033.00 P/s 6098897.50
> P/s
>
>
> perf top attached below:
>
> 1.90% 0.00% ksoftirqd/39 [kernel.vmlinux] [k] run_ksoftirqd
> |
> --1.90%--run_ksoftirqd
> |
> --1.90%--__softirqentry_text_start
> |
> --1.90%--net_rx_action
> |
> --1.90%--mlx5e_napi_poll
> |
> --1.89%--mlx5e_poll_rx_cq
> |
> --1.88%--mlx5e_handle_rx_cqe
> |
> --1.85%--napi_gro_receive
> |
> --1.85%--netif_receive_skb_internal
> |
> --1.85%--__netif_receive_skb
> |
> --1.85%--__netif_receive_skb_core
> |
> --1.85%--ip_rcv
> |
> --1.85%--ip_rcv_finish
> |
> --1.83%--ip_forward
> |
> --1.82%--ip_forward_finish
> |
> --1.82%--ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.74%--_raw_write_lock_bh
> |
> --1.74%--queued_write_lock
> queued_write_lock_slowpath
> |
> --1.70%--queued_spin_lock_slowpath
>
>
> 1.90% 0.00% ksoftirqd/34 [kernel.vmlinux] [k]
> __softirqentry_text_start
> |
> ---__softirqentry_text_start
> |
> --1.90%--net_rx_action
> |
> --1.90%--mlx5e_napi_poll
> |
> --1.89%--mlx5e_poll_rx_cq
> |
> --1.88%--mlx5e_handle_rx_cqe
> |
> --1.86%--napi_gro_receive
> |
> --1.85%--netif_receive_skb_internal
> |
> --1.85%--__netif_receive_skb
> |
> --1.85%--__netif_receive_skb_core
> |
> --1.85%--ip_rcv
> |
> --1.85%--ip_rcv_finish
> |
> --1.83%--ip_forward
> |
> --1.82%--ip_forward_finish
> |
> --1.82%--ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.74%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.71%--queued_spin_lock_slowpath
>
> 1.85% 0.00% ksoftirqd/38 [kernel.vmlinux] [k]
> ip_rcv_finish
> |
> --1.85%--ip_rcv_finish
> |
> --1.83%--ip_forward
> |
> --1.82%--ip_forward_finish
> |
> --1.82%--ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.74%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.71%--queued_spin_lock_slowpath
>
> 1.85% 0.00% ksoftirqd/22 [kernel.vmlinux] [k] ip_rcv
> |
> --1.85%--ip_rcv
> |
> --1.85%--ip_rcv_finish
> |
> --1.83%--ip_forward
> |
> --1.82%--ip_forward_finish
> |
> --1.82%--ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.73%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.70%--queued_spin_lock_slowpath
>
> 1.83% 0.00% ksoftirqd/9 [kernel.vmlinux] [k]
> ip_forward
> |
> --1.83%--ip_forward
> |
> --1.82%--ip_forward_finish
> |
> --1.82%--ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.74%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.70%--queued_spin_lock_slowpath
>
>
> 1.82% 0.00% ksoftirqd/35 [kernel.vmlinux] [k] ip_output
> |
> --1.82%--ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.74%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.71%--queued_spin_lock_slowpath
>
> 1.82% 0.00% ksoftirqd/38 [kernel.vmlinux] [k]
> ip_finish_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.77%--neigh_event_send
> |
> --1.77%--__neigh_event_send
> |
> --1.74%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.71%--queued_spin_lock_slowpath
>
> 1.82% 0.00% ksoftirqd/37 [kernel.vmlinux] [k]
> ip_forward_finish
> |
> --1.82%--ip_forward_finish
> ip_output
> |
> --1.82%--ip_finish_output
> |
> --1.82%--ip_finish_output2
> |
> --1.79%--neigh_resolve_output
> |
> --1.76%--neigh_event_send
> __neigh_event_send
> |
> --1.73%--_raw_write_lock_bh
> queued_write_lock
> queued_write_lock_slowpath
> |
> --1.70%--queued_spin_lock_slowpath
>
>
> W dniu 2017-08-16 o 09:42, Julian Anastasov pisze:
>> Hello,
>>
>> On Tue, 15 Aug 2017, Eric Dumazet wrote:
>>
>>> It must be possible to add a fast path without locks.
>>>
>>> (say if jiffies has not changed before last state change)
>> New day - new idea. Something like this? But it
>> has bug: without checking neigh->dead under lock we don't
>> have the right to access neigh->parms, it can be destroyed
>> immediately by neigh_release->neigh_destroy->neigh_parms_put->
>> neigh_parms_destroy->kfree. Not sure, may be kfree_rcu can help
>> for this...
>>
>> diff --git a/include/net/neighbour.h b/include/net/neighbour.h
>> index 9816df2..f52763c 100644
>> --- a/include/net/neighbour.h
>> +++ b/include/net/neighbour.h
>> @@ -428,10 +428,10 @@ static inline int neigh_event_send(struct
>> neighbour *neigh, struct sk_buff *skb)
>> {
>> unsigned long now = jiffies;
>>
>> - if (neigh->used != now)
>> - neigh->used = now;
>> if (!(neigh->nud_state&(NUD_CONNECTED|NUD_DELAY|NUD_PROBE)))
>> return __neigh_event_send(neigh, skb);
>> + if (neigh->used != now)
>> + neigh->used = now;
>> return 0;
>> }
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index 16a1a4c..52a8718 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -991,8 +991,18 @@ static void neigh_timer_handler(unsigned long arg)
>> int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
>> {
>> - int rc;
>> bool immediate_probe = false;
>> + unsigned long now = jiffies;
>> + int rc;
>> +
>> + if (neigh->used != now) {
>> + neigh->used = now;
>> + } else if (neigh->nud_state == NUD_INCOMPLETE &&
>> + (!skb || neigh->arp_queue_len_bytes + skb->truesize >
>> + NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES))) {
>> + kfree_skb(skb);
>> + return 1;
>> + }
>> write_lock_bh(&neigh->lock);
>> @@ -1005,7 +1015,7 @@ int __neigh_event_send(struct neighbour
>> *neigh, struct sk_buff *skb)
>> if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {
>> if (NEIGH_VAR(neigh->parms, MCAST_PROBES) +
>> NEIGH_VAR(neigh->parms, APP_PROBES)) {
>> - unsigned long next, now = jiffies;
>> + unsigned long next;
>> atomic_set(&neigh->probes,
>> NEIGH_VAR(neigh->parms, UCAST_PROBES));
>>
>> Regards
>>
>> --
>> Julian Anastasov <ja@....bg>
>>
>
>
Powered by blists - more mailing lists