linux-kernel - Re: [PATCH net-next v10 2/2] net: af_packet: Use hrtimer to do the retire operation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250905040021.1893488-1-jackzxcui1989@163.com>
Date: Fri,  5 Sep 2025 12:00:21 +0800
From: Xin Zhao <jackzxcui1989@....com>
To: kerneljasonxing@...il.com,
	willemdebruijn.kernel@...il.com,
	edumazet@...gle.com,
	ferenc@...es.dev
Cc: davem@...emloft.net,
	kuba@...nel.org,
	pabeni@...hat.com,
	horms@...nel.org,
	netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v10 2/2] net: af_packet: Use hrtimer to do the retire operation

On Thu, Sep 4, 2025 at 11:26 +0800 Jason Xing <kerneljasonxing@...il.com> wrote:

> > In the description of [PATCH net-next v10 0/2] net: af_packet: optimize retire operation:
> >
> > Changes in v8:
> > - Delete delete_blk_timer field, as suggested by Willem de Bruijn,
> >   hrtimer_cancel will check and wait until the timer callback return and ensure
> >   enter enter callback again;
> 
> I see the reason now :)
> 
> Please know that the history changes through versions will finally be
> removed, only the official message that will be kept in the git. So
> this kind of change, I think, should be clarified officially since
> you're removing a structure member. Adding more descriptions will be
> helpful to readers in the future. Thank you.

I will add some more information to the commit message of this 2/2 PATCH.



> > Consider the following timing sequence:
> > timer   cpu0 (softirq context, hrtimer timeout)                cpu1 (process context)
> > 0       hrtimer_run_softirq
> > 1         __hrtimer_run_queues
> > 2           __run_hrtimer
> > 3             prb_retire_rx_blk_timer_expired
> > 4               spin_lock(&po->sk.sk_receive_queue.lock);
> > 5               _prb_refresh_rx_retire_blk_timer
> > 6                 hrtimer_forward_now
> > 7               spin_unlock(&po->sk.sk_receive_queue.lock)
> > 8             raw_spin_lock_irq(&cpu_base->lock);              tpacket_rcv
> > 9             enqueue_hrtimer                                    spin_lock(&sk->sk_receive_queue.lock);
> > 10                                                               packet_current_rx_frame
> > 11                                                                 __packet_lookup_frame_in_block
> > 12            finish enqueue_hrtimer                                 prb_open_block
> > 13                                                                     _prb_refresh_rx_retire_blk_timer
> > 14                                                                       hrtimer_is_queued(&pkc->retire_blk_timer) == true
> > 15                                                                       hrtimer_forward_now
> > 16                                                                         WARN_ON
> > On cpu0 in the timing sequence above, enqueue_hrtimer is not protected by sk_receive_queue.lock,
> > while the hrtimer_forward_now is not protected by raw_spin_lock_irq(&cpu_base->lock).
> >
> > In my previous email, I provided an explanation. As a supplement, I would
> > like to reiterate a paragraph from my earlier response to Willem.
> > The point is that when the hrtimer is in the enqueued state, you cannot
> 
> How about tring hrtimer_is_queued() beforehand?
> 
> IIUC, with this patch applied, we will lose the opportunity to refresh
> the timer when the lookup function (in the above path I mentioned)
> gets called compared to before. If the packet socket tries to look up
> a new block and it doesn't update its expiry time, the timer will soon
> wake up. Does it sound unreasonable?


I actually pointed out the issue with the timeout setting in a previous email:
https://lore.kernel.org/netdev/20250826030328.878001-1-jackzxcui1989@163.com/.

Regarding the method you mentioned, using hrtimer_is_queued to assist in judgment, I had
discussed this extensively with Willem in previous emails, and the conclusion was that
it is not feasible. The reason is that in our scenario, the hrtimer always returns
HRTIMER_RESTART, unlike the places you pointed out, such as tcp_pacing_check, where the
corresponding hrtimer callbacks all return HRTIMER_NORESTART. Since our scenario returns
HRTIMER_RESTART, this can lead to many troublesome issues. The fundamental reason is that
if HRTIMER_RESTART is returned, the hrtimer module will enqueue the hrtimer after the
callback returns, which leads to exiting the protection of our sk_receive_queue lock.

Returning to the functionality here, if we really want to update the hrtimer's timeout
outside of the timer callback, there are two key points to note:

1. Accurately knowing whether the current context is a timer callback or tpacket_rcv.
2. How to update the hrtimer's timeout in a non-timer callback scenario.

To start with the first point, it has already been explained in previous emails that
executing hrtimer_forward outside of a timer callback is not allowed. Therefore, we
must accurately determine whether we are in a timer callback; only in that context can
we use the hrtimer_forward function to update.
In the original code, since the same _prb_refresh_rx_retire_blk_timer function was called,
distinguishing between contexts required code restructuring. Now that this patch removes
the _prb_refresh_rx_retire_blk_timer function, achieving this accurate distinction is not
too difficult.
The key issue is the second point. If we are not inside the hrtimer's callback, we cannot
use hrtimer_forward to update the timeout. So what other interface can we use? You might
suggest using hrtimer_start, but fundamentally, hrtimer_start cannot be called if it has
already been started previously. Therefore, wouldn’t you need to add hrtimer_cancel to
confirm that the hrtimer has been canceled? Once hrtimer_cancel is added, there will also
be scenarios where it is restarted, which means we need to consider the concurrent
scenario when the socket exits and also calls hrtimer_cancel. This might require adding
logic for that concurrency scenario, and you might even need to reintroduce the
delete_blk_timer variable to indicate whether the packet_release operation has been
triggered so that the hrtimer does not restart in the tpacket_rcv scenario.

In fact, in a previous v7 version, I proposed a change that I personally thought was
quite good, which can be seen here:
https://lore.kernel.org/netdev/20250822132051.266787-1-jackzxcui1989@163.com/. However,
this change introduced an additional variable and more logic. Willem also pointed out
that the added complexity to avoid a non-problematic issue was unnecessary.

As mentioned in Changes in v8:
  The only special case is when prb_open_block is called from tpacket_rcv.
  That would set the timeout further into the future than the already queued
  timer. An earlier timeout is not problematic. No need to add complexity to
  avoid that.

It is not problematic, as Willem point it out in
https://lore.kernel.org/netdev/willemdebruijn.kernel.2d7599ee951fd@gmail.com/


In the end:

So, if you agree with the current changes in v10 and do not wish to add the timeout
setting under tpacket_rcv, that’s fine.
If you do not agree, then the only alternative I can think of is to use a combination
of hrtimer_cancel and hrtimer_start under prb_open_block, and we would also need to
reintroduce the delete_blk_timer variable to help determine whether the hrtimer was
canceled due to the packet_release behavior. If we really want to make this change,
it does add quite a bit of logic, and we would also need Willem's agreement.


Thanks
Xin Zhao