lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250617090846.cxnYD_fg@linutronix.de>
Date: Tue, 17 Jun 2025 11:08:46 +0200
From: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To: Marc Strämke <marc.straemke@...ropuls.de>
Cc: linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org
Subject: Re: Latency spikes on V6.15.1 Preempt RT and maybe related to intel?
 IGB

On 2025-06-14 10:52:36 [+0200], Marc Strämke wrote:
> Sebastian,
Hi,

> i tried that in the past (rtla top auto analysis). But i do not really
> understand the result:
> 
> rtla timerlat hit stop tracing
> ## CPU 1 hit stop tracing, analyzing it ##
>   IRQ handler delay:                                         0.00 us (0.00
> %)
>   IRQ latency:                                             709.25 us
>   Blocking thread:
>                               ip:3567
>     Blocking thread stack trace
>                 -> timerlat_irq
>                 -> __hrtimer_run_queues
>                 -> hrtimer_interrupt
>                 -> __sysvec_apic_timer_interrupt
>                 -> sysvec_apic_timer_interrupt
>                 -> asm_sysvec_apic_timer_interrupt
>                 -> igb_update_mc_addr_list
>                 -> igb_set_rx_mode
>                 -> __dev_change_flags
>                 -> netif_change_flags
>                 -> do_setlink.constprop.0
>                 -> rtnl_newlink
>                 -> rtnetlink_rcv_msg
>                 -> netlink_rcv_skb
>                 -> netlink_unicast
>                 -> netlink_sendmsg
>                 -> ____sys_sendmsg
>                 -> ___sys_sendmsg
>                 -> __sys_sendmsg
>                 -> do_syscall_64
>                 -> entry_SYSCALL_64_after_hwframe
> ------------------------------------------------------------------------
>      IRQ latency:                                          709.25 us (100%)
> 
> 
> I do not really understand where the IRQ/Preemption disabling is happening.
> What would the next thing be to do? Function (graph?) tracing on all the
> functions visible in the backtrace?

So somewhere between entry_SYSCALL_64_after_hwframe() and timerlat_irq()
something happened that led to the spike. Up to __dev_change_flags() it
is mostly regular and asm_sysvec_apic_timer_interrupt() is already the
interrupt. 
There is only one calling site for igb_update_mc_addr_list() and
everything before that looks harmless.
igb_update_mc_addr_list() has these loops where it writes to FIFO
register. This is followed by wrfl() which reads the status register.
This read forces the writes to go through the bus. What might happen is
that all the writes are "cached" and scheduled for a better time. That
read will force the writes to actually happen and the CPU will stall
until this done.
My guess would be one the writes cause the stall here.

> I tried to look at the event race output starting with the call to
> igb_set_rx_mode. I have attached the trace with all events and a function
> filter on igb on only the cpu executing ip.  I cannot understand what is
> happening between timestasmp 700.149995 and the IRQ disable event on
> 700.150795....
> 
>               ip-4931    [001] b...3   700.149994: igb_set_rx_mode <-__dev_change_flags
>               ip-4931    [001] b...3   700.149995: kmalloc: call_site=igb_set_rx_mode+0x4f8/0x5a0 ptr=00000000b0b4e5c8 bytes_req=12 bytes_alloc=16 gfp_flags=GFP_ATOMIC|__GFP_ZERO node=-1 accounted=false
>               ip-4931    [001] b...3   700.149995: igb_update_mc_addr_list <-igb_set_rx_mode
>               ip-4931    [001] Dn..3   700.150795: irq_disable: caller=irqentry_enter+0x2a/0x60 parent=0x0

The CPU is somewhere in igb_update_mc_addr_list(). Based on the
timestamps it was there for 800us before the irq_disable() occurred.
This did not happen within igb_update_mc_addr_list() but due to an
interrupt.

> Thanks for your help,
> 
> Marc

Sebastian

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ