netdev - Re: Urgent Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport endpoint is not connected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <E95FDB1D-488B-4780-96A1-A2D5C9616A7A@gmail.com>
Date:   Tue, 14 Sep 2021 09:16:55 +0300
From:   Martin Zaharinov <micron10@...il.com>
To:     Guillaume Nault <gnault@...hat.com>
Cc:     Pali Rohár <pali@...nel.org>,
        Greg KH <gregkh@...uxfoundation.org>,
        netdev <netdev@...r.kernel.org>,
        Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: Urgent  Bug report: PPPoE ioctl(PPPIOCCONNECT): Transport
 endpoint is not connected

Hi Nault

See this stats :

Linux 5.14.2 (testb)   09/14/21        _x86_64_        (12 CPU)

11:33:44     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:33:45     all    1.75    0.00   18.85    0.00    0.00    5.00    0.00    0.00    0.00   74.40
11:33:46     all    1.74    0.00   17.88    0.00    0.00    4.72    0.00    0.00    0.00   75.66
11:33:47     all    2.23    0.00   17.62    0.00    0.00    5.05    0.00    0.00    0.00   75.10
11:33:48     all    1.82    0.00   13.64    0.00    0.00    5.70    0.00    0.00    0.00   78.84
11:33:49     all    1.50    0.00   13.46    0.00    0.00    5.15    0.00    0.00    0.00   79.90
11:33:50     all    3.06    0.00   13.96    0.00    0.00    4.79    0.00    0.00    0.00   78.20
11:33:51     all    1.40    0.00   16.53    0.00    0.00    5.21    0.00    0.00    0.00   76.86
11:33:52     all    4.43    0.00   19.44    0.00    0.00    6.56    0.00    0.00    0.00   69.57
11:33:53     all    1.51    0.00   16.40    0.00    0.00    4.77    0.00    0.00    0.00   77.32
11:33:54     all    1.51    0.00   16.55    0.00    0.00    4.71    0.00    0.00    0.00   77.23
11:33:55     all    1.00    0.00   13.21    0.00    0.00    5.90    0.00    0.00    0.00   79.90
Average:     all    2.00    0.00   16.14    0.00    0.00    5.23    0.00    0.00    0.00   76.63


  PerfTop:   28046 irqs/sec  kernel:96.3%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    23.37%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
    17.76%  [kernel]                 [k] mutex_spin_on_owner
     9.47%  [pppoe]                  [k] pppoe_rcv
     7.71%  [kernel]                 [k] osq_lock
     2.77%  [nf_nat]                 [k] inet_cmp
     2.59%  [nf_nat]                 [k] device_cmp
     2.55%  [kernel]                 [k] __local_bh_enable_ip
     2.04%  [kernel]                 [k] _raw_spin_lock
     1.23%  [kernel]                 [k] __cond_resched
     1.16%  [kernel]                 [k] rcu_all_qs
     1.13%  libfrr.so.0.0.0          [.] 0x00000000000ce970
     0.79%  [nf_conntrack]           [k] nf_conntrack_lock
     0.75%  libfrr.so.0.0.0          [.] 0x00000000000ce94e
     0.53%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
     0.46%  [kernel]                 [k] fib_table_lookup
     0.46%  [ip_tables]              [k] ipt_do_table
     0.45%  [ixgbe]                  [k] ixgbe_clean_rx_irq
     0.37%  [kernel]                 [k] __dev_queue_xmit
     0.34%  [nf_conntrack]           [k] __nf_conntrack_find_get.isra.0
     0.33%  [ixgbe]                  [k] ixgbe_clean_tx_irq
     0.30%  [kernel]                 [k] menu_select
     0.25%  [kernel]                 [k] vlan_do_receive
     0.21%  [kernel]                 [k] ip_finish_output2
     0.21%  [ixgbe]                  [k] ixgbe_poll
     0.20%  [kernel]                 [k] _raw_spin_lock_irqsave
     0.19%  [kernel]                 [k] get_rps_cpu
     0.19%  libc.so.6                [.] 0x0000000000186afa
     0.19%  [kernel]                 [k] queued_read_lock_slowpath
     0.19%  [kernel]                 [k] do_poll.constprop.0
     0.19%  [kernel]                 [k] cpuidle_enter_state
     0.18%  [kernel]                 [k] dev_hard_start_xmit
     0.18%  [kernel]                 [k] ___slab_alloc.constprop.0
     0.17%  zebra                    [.] 0x00000000000b9271
     0.16%  [kernel]                 [k] csum_partial_copy_generic
     0.16%  zebra                    [.] 0x00000000000b91f1
     0.16%  [kernel]                 [k] page_frag_free
     0.16%  [kernel]                 [k] kmem_cache_alloc
     0.15%  [kernel]                 [k] __skb_flow_dissect
     0.15%  [kernel]                 [k] sched_clock
     0.15%  libc.so.6                [.] 0x00000000000965a2
     0.15%  [kernel]                 [k] kmem_cache_free_bulk.part.0
     0.15%  [pppoe]                  [k] pppoe_flush_dev
     0.15%  [ixgbe]                  [k] ixgbe_tx_map
     0.14%  [kernel]                 [k] _raw_spin_lock_bh
     0.14%  [kernel]                 [k] fib_table_flush
     0.14%  [kernel]                 [k] native_irq_return_iret
     0.14%  [kernel]                 [k] __dev_xmit_skb
     0.13%  [kernel]                 [k] nf_hook_slow
     0.13%  [kernel]                 [k] fib_lookup_good_nhc
     0.12%  [kernel]                 [k] __fget_files
     0.12%  [kernel]                 [k] process_backlog
     0.12%  [xt_dtvqos]              [k] 0x00000000000008d1
     0.12%  [kernel]                 [k] __list_del_entry_valid
     0.12%  [kernel]                 [k] skb_release_data
     0.12%  [kernel]                 [k] ip_route_input_slow
     0.11%  [kernel]                 [k] netif_skb_features
     0.11%  [kernel]                 [k] sock_poll
     0.11%  [kernel]                 [k] __schedule
     0.11%  [kernel]                 [k] __softirqentry_text_start


And on time of problem when try to write : ip a 
to list interface wait 15-20 sec i finaly have options to simulate but users is angry when down internet.

In case need to know why system is overloaded when deconfig ppp interface.


Best regards,
Martin




> On 11 Sep 2021, at 9:26, Martin Zaharinov <micron10@...il.com> wrote:
> 
> Hi Guillaume
> 
> Main problem is overload of service because have many finishing ppp (customer) last two day down from 40-50 to 100-200 users and make problem when is happen if try to type : ip a wait 10-20 sec to start list interface .
> But how to find where is a problem any locking or other.
> And is there options to make fast remove ppp interface from kernel to reduce this load.
> 
> 
> Martin
> 
>> On 7 Sep 2021, at 9:42, Martin Zaharinov <micron10@...il.com> wrote:
>> 
>> Perf top from text
>> 
>> 
>> PerfTop:   28391 irqs/sec  kernel:98.0%  exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles],  (all, 12 CPUs)
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>>   17.01%  [nf_conntrack]           [k] nf_ct_iterate_cleanup
>>    9.73%  [kernel]                 [k] mutex_spin_on_owner
>>    9.07%  [pppoe]                  [k] pppoe_rcv
>>    2.77%  [nf_nat]                 [k] device_cmp
>>    1.66%  [kernel]                 [k] osq_lock
>>    1.65%  [kernel]                 [k] _raw_spin_lock
>>    1.61%  [kernel]                 [k] __local_bh_enable_ip
>>    1.35%  [nf_nat]                 [k] inet_cmp
>>    1.30%  [kernel]                 [k] __netif_receive_skb_core.constprop.0
>>    1.16%  [kernel]                 [k] menu_select
>>    0.99%  [kernel]                 [k] cpuidle_enter_state
>>    0.96%  [ixgbe]                  [k] ixgbe_clean_rx_irq
>>    0.86%  [kernel]                 [k] __dev_queue_xmit
>>    0.70%  [kernel]                 [k] __cond_resched
>>    0.69%  [sch_cake]               [k] cake_dequeue
>>    0.67%  [nf_tables]              [k] nft_do_chain
>>    0.63%  [kernel]                 [k] rcu_all_qs
>>    0.61%  [kernel]                 [k] fib_table_lookup
>>    0.57%  [kernel]                 [k] __schedule
>>    0.57%  [kernel]                 [k] skb_release_data
>>    0.54%  [kernel]                 [k] sched_clock
>>    0.54%  [kernel]                 [k] __copy_skb_header
>>    0.53%  [kernel]                 [k] dev_queue_xmit_nit
>>    0.53%  [kernel]                 [k] _raw_spin_lock_irqsave
>>    0.50%  [kernel]                 [k] kmem_cache_free
>>    0.48%  libfrr.so.0.0.0          [.] 0x00000000000ce970
>>    0.47%  [ixgbe]                  [k] ixgbe_clean_tx_irq
>>    0.45%  [kernel]                 [k] timerqueue_add
>>    0.45%  [kernel]                 [k] lapic_next_deadline
>>    0.45%  [kernel]                 [k] csum_partial_copy_generic
>>    0.44%  [nf_flow_table]          [k] nf_flow_offload_ip_hook
>>    0.44%  [kernel]                 [k] kmem_cache_alloc
>>    0.44%  [nf_conntrack]           [k] nf_conntrack_lock
>> 
>>> On 7 Sep 2021, at 9:16, Martin Zaharinov <micron10@...il.com> wrote:
>>> 
>>> Hi 
>>> Sorry for delay but not easy to catch moment .
>>> 
>>> 
>>> See this is mpstatl 1 :
>>> 
>>> Linux 5.14.1 (demobng) 	09/07/21 	_x86_64_	(12 CPU)
>>> 
>>> 11:12:16     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>> 11:12:17     all    0.17    0.00    6.66    0.00    0.00    4.13    0.00    0.00    0.00   89.05
>>> 11:12:18     all    0.25    0.00    8.36    0.00    0.00    4.88    0.00    0.00    0.00   86.51
>>> 11:12:19     all    0.26    0.00    9.62    0.00    0.00    3.91    0.00    0.00    0.00   86.21
>>> 11:12:20     all    0.85    0.00    6.00    0.00    0.00    4.31    0.00    0.00    0.00   88.84
>>> 11:12:21     all    0.08    0.00    4.45    0.00    0.00    4.79    0.00    0.00    0.00   90.67
>>> 11:12:22     all    0.17    0.00    9.50    0.00    0.00    4.58    0.00    0.00    0.00   85.75
>>> 11:12:23     all    0.00    0.00    6.92    0.00    0.00    2.48    0.00    0.00    0.00   90.61
>>> 11:12:24     all    0.17    0.00    5.45    0.00    0.00    4.27    0.00    0.00    0.00   90.11
>>> 11:12:25     all    0.25    0.00    5.38    0.00    0.00    4.79    0.00    0.00    0.00   89.58
>>> 11:12:26     all    0.60    0.00    1.45    0.00    0.00    2.65    0.00    0.00    0.00   95.30
>>> 11:12:27     all    0.42    0.00    6.91    0.00    0.00    4.47    0.00    0.00    0.00   88.20
>>> 11:12:28     all    0.00    0.00    6.75    0.00    0.00    4.18    0.00    0.00    0.00   89.07
>>> 11:12:29     all    0.17    0.00    3.52    0.00    0.00    5.11    0.00    0.00    0.00   91.20
>>> 11:12:30     all    1.45    0.00   10.14    0.00    0.00    3.49    0.00    0.00    0.00   84.92
>>> 11:12:31     all    0.09    0.00    5.11    0.00    0.00    4.77    0.00    0.00    0.00   90.03
>>> 11:12:32     all    0.25    0.00    3.11    0.00    0.00    4.46    0.00    0.00    0.00   92.17
>>> Average:     all    0.32    0.00    6.21    0.00    0.00    4.21    0.00    0.00    0.00   89.26
>>> 
>>> 
>>> I attache and one screenshot from perf top (Screenshot is send on preview mail)
>>> 
>>> And I see in lsmod 
>>> 
>>> pppoe                  20480  8198
>>> pppox                  16384  1 pppoe
>>> ppp_generic            45056  16364 pppox,pppoe
>>> slhc                   16384  1 ppp_generic
>>> 
>>> To slow remove pppoe session .
>>> 
>>> And from log : 
>>> 
>>> [2021-09-07 11:01:11.129] vlan3020: ebdd1c5d8b5900f6: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:01:53.621] vlan643: ebdd1c5d8b59014e: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:00.359] vlan1616: ebdd1c5d8b590195: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:05.859] vlan3020: ebdd1c5d8b5900d8: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:08.258] vlan3005: ebdd1c5d8b590190: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:13.820] vlan643: ebdd1c5d8b590152: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:15.839] vlan727: ebdd1c5d8b590144: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> [2021-09-07 11:02:20.139] vlan1693: ebdd1c5d8b59019f: ioctl(PPPIOCCONNECT): Transport endpoint is not connected
>>> 
>>>> On 11 Aug 2021, at 19:48, Guillaume Nault <gnault@...hat.com> wrote:
>>>> 
>>>> On Wed, Aug 11, 2021 at 02:10:32PM +0300, Martin Zaharinov wrote:
>>>>> And one more that see.
>>>>> 
>>>>> Problem is come when accel start finishing sessions,
>>>>> Now in server have 2k users and restart on one of vlans 3 Olt with 400 users and affect other vlans ,
>>>>> And problem is start when start destroying dead sessions from vlan with 3 Olt and this affect all other vlans.
>>>>> May be kernel destroy old session slow and entrained other users by locking other sessions.
>>>>> is there a way to speed up the closing of stopped/dead sessions.
>>>> 
>>>> What are the CPU stats when that happen? Is it users space or kernel
>>>> space that keeps it busy?
>>>> 
>>>> One easy way to check is to run "mpstat 1" for a few seconds when the
>>>> problem occurs.
>>>> 
>>> 
>> 
>