[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <A16DCD3E-43AA-4D50-97FC-EBB776481840@gmail.com>
Date: Sat, 11 Sep 2021 09:26:45 +0300
From: Martin Zaharinov <micron10@...il.com>
To: Guillaume Nault <gnault@...hat.com>
Cc: Pali Rohár <pali@...nel.org>,
Greg KH <gregkh@...uxfoundation.org>,
netdev <netdev@...r.kernel.org>,
Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport
endpoint is not connected
Hi Guillaume
Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
But how to find where is a problem any locking or other.
And is there options to make fast remove ppp interface from kernel to reduce this load.
Martin
> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@...il.com> wrote:
>
> Perf top from text
>
>
> PerfTop: 28391 irqs/sec kernel:98.0% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 17.01% [nf_conntrack] [k] nf_ct_iterate_cleanup
> 9.73% [kernel] [k] mutex_spin_on_owner
> 9.07% [pppoe] [k] pppoe_rcv
> 2.77% [nf_nat] [k] device_cmp
> 1.66% [kernel] [k] osq_lock
> 1.65% [kernel] [k] _raw_spin_lock
> 1.61% [kernel] [k] __local_bh_enable_ip
> 1.35% [nf_nat] [k] inet_cmp
> 1.30% [kernel] [k] __netif_receive_skb_core.constprop.0
> 1.16% [kernel] [k] menu_select
> 0.99% [kernel] [k] cpuidle_enter_state
> 0.96% [ixgbe] [k] ixgbe_clean_rx_irq
> 0.86% [kernel] [k] __dev_queue_xmit
> 0.70% [kernel] [k] __cond_resched
> 0.69% [sch_cake] [k] cake_dequeue
> 0.67% [nf_tables] [k] nft_do_chain
> 0.63% [kernel] [k] rcu_all_qs
> 0.61% [kernel] [k] fib_table_lookup
> 0.57% [kernel] [k] __schedule
> 0.57% [kernel] [k] skb_release_data
> 0.54% [kernel] [k] sched_clock
> 0.54% [kernel] [k] __copy_skb_header
> 0.53% [kernel] [k] dev_queue_xmit_nit
> 0.53% [kernel] [k] _raw_spin_lock_irqsave
> 0.50% [kernel] [k] kmem_cache_free
> 0.48% libfrr.so.0.0.0 [.] 0x00000000000ce970
> 0.47% [ixgbe] [k] ixgbe_clean_tx_irq
> 0.45% [kernel] [k] timerqueue_add
> 0.45% [kernel] [k] lapic_next_deadline
> 0.45% [kernel] [k] csum_partial_copy_generic
> 0.44% [nf_flow_table] [k] nf_flow_offload_ip_hook
> 0.44% [kernel] [k] kmem_cache_alloc
> 0.44% [nf_conntrack] [k] nf_conntrack_lock
>
>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@...il.com> wrote:
>>
>> Hi
>> Sorry for delay but not easy to catch moment .
>>
>>
>> See this is mpstatl 1 :
>>
>> Linux 5.14.1 (demobng) 09/07/21 _x86_64_ (12 CPU)
>>
>> 11:12:16 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 11:12:17 all 0.17 0.00 6.66 0.00 0.00 4.13 0.00 0.00 0.00 89.05
>> 11:12:18 all 0.25 0.00 8.36 0.00 0.00 4.88 0.00 0.00 0.00 86.51
>> 11:12:19 all 0.26 0.00 9.62 0.00 0.00 3.91 0.00 0.00 0.00 86.21
>> 11:12:20 all 0.85 0.00 6.00 0.00 0.00 4.31 0.00 0.00 0.00 88.84
>> 11:12:21 all 0.08 0.00 4.45 0.00 0.00 4.79 0.00 0.00 0.00 90.67
>> 11:12:22 all 0.17 0.00 9.50 0.00 0.00 4.58 0.00 0.00 0.00 85.75
>> 11:12:23 all 0.00 0.00 6.92 0.00 0.00 2.48 0.00 0.00 0.00 90.61
>> 11:12:24 all 0.17 0.00 5.45 0.00 0.00 4.27 0.00 0.00 0.00 90.11
>> 11:12:25 all 0.25 0.00 5.38 0.00 0.00 4.79 0.00 0.00 0.00 89.58
>> 11:12:26 all 0.60 0.00 1.45 0.00 0.00 2.65 0.00 0.00 0.00 95.30
>> 11:12:27 all 0.42 0.00 6.91 0.00 0.00 4.47 0.00 0.00 0.00 88.20
>> 11:12:28 all 0.00 0.00 6.75 0.00 0.00 4.18 0.00 0.00 0.00 89.07
>> 11:12:29 all 0.17 0.00 3.52 0.00 0.00 5.11 0.00 0.00 0.00 91.20
>> 11:12:30 all 1.45 0.00 10.14 0.00 0.00 3.49 0.00 0.00 0.00 84.92
>> 11:12:31 all 0.09 0.00 5.11 0.00 0.00 4.77 0.00 0.00 0.00 90.03
>> 11:12:32 all 0.25 0.00 3.11 0.00 0.00 4.46 0.00 0.00 0.00 92.17
>> Average: all 0.32 0.00 6.21 0.00 0.00 4.21 0.00 0.00 0.00 89.26
>>
>>
>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>>
>> And I see in lsmod
>>
>> pppoe 20480 8198
>> pppox 16384 1 pppoe
>> ppp_generic 45056 16364 pppox,pppoe
>> slhc 16384 1 ppp_generic
>>
>> To slow remove pppoe session .
>>
>> And from log :
>>
>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>
>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@...hat.com> wrote:
>>>
>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>>> And one more that see.
>>>>
>>>> Problem is come when accel start finishing sessions,
>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>>> is there a way to speed up the closing of stopped/dead sessions.
>>>
>>> What are the CPU stats when that happen? Is it users space or kernel
>>> space that keeps it busy?
>>>
>>> One easy way to check is to run "mpstat 1" for a few seconds when the
>>> problem occurs.
>>>
>>
>
Powered by blists - more mailing lists