[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <willemdebruijn.kernel.6a80e2e45f24@gmail.com>
Date: Fri, 05 Sep 2025 12:16:17 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Jason Xing <kerneljasonxing@...il.com>,
Xin Zhao <jackzxcui1989@....com>
Cc: willemdebruijn.kernel@...il.com,
edumazet@...gle.com,
ferenc@...es.dev,
davem@...emloft.net,
kuba@...nel.org,
pabeni@...hat.com,
horms@...nel.org,
netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next v10 2/2] net: af_packet: Use hrtimer to do the
retire operation
Jason Xing wrote:
> On Fri, Sep 5, 2025 at 2:03 PM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Fri, Sep 5, 2025 at 12:01 PM Xin Zhao <jackzxcui1989@....com> wrote:
> > >
> > > On Thu, Sep 4, 2025 at 11:26 +0800 Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > > > In the description of [PATCH net-next v10 0/2] net: af_packet: optimize retire operation:
> > > > >
> > > > > Changes in v8:
> > > > > - Delete delete_blk_timer field, as suggested by Willem de Bruijn,
> > > > > hrtimer_cancel will check and wait until the timer callback return and ensure
> > > > > enter enter callback again;
> > > >
> > > > I see the reason now :)
> > > >
> > > > Please know that the history changes through versions will finally be
> > > > removed, only the official message that will be kept in the git. So
> > > > this kind of change, I think, should be clarified officially since
> > > > you're removing a structure member. Adding more descriptions will be
> > > > helpful to readers in the future. Thank you.
> > >
> > > I will add some more information to the commit message of this 2/2 PATCH.
> > >
> > >
> > >
> > > > > Consider the following timing sequence:
> > > > > timer cpu0 (softirq context, hrtimer timeout) cpu1 (process context)
> > > > > 0 hrtimer_run_softirq
> > > > > 1 __hrtimer_run_queues
> > > > > 2 __run_hrtimer
> > > > > 3 prb_retire_rx_blk_timer_expired
> > > > > 4 spin_lock(&po->sk.sk_receive_queue.lock);
> > > > > 5 _prb_refresh_rx_retire_blk_timer
> > > > > 6 hrtimer_forward_now
> > > > > 7 spin_unlock(&po->sk.sk_receive_queue.lock)
> > > > > 8 raw_spin_lock_irq(&cpu_base->lock); tpacket_rcv
> > > > > 9 enqueue_hrtimer spin_lock(&sk->sk_receive_queue.lock);
> > > > > 10 packet_current_rx_frame
> > > > > 11 __packet_lookup_frame_in_block
> > > > > 12 finish enqueue_hrtimer prb_open_block
> > > > > 13 _prb_refresh_rx_retire_blk_timer
> > > > > 14 hrtimer_is_queued(&pkc->retire_blk_timer) == true
> > > > > 15 hrtimer_forward_now
> > > > > 16 WARN_ON
> > > > > On cpu0 in the timing sequence above, enqueue_hrtimer is not protected by sk_receive_queue.lock,
> > > > > while the hrtimer_forward_now is not protected by raw_spin_lock_irq(&cpu_base->lock).
> > > > >
> > > > > In my previous email, I provided an explanation. As a supplement, I would
> > > > > like to reiterate a paragraph from my earlier response to Willem.
> > > > > The point is that when the hrtimer is in the enqueued state, you cannot
> > > >
> > > > How about tring hrtimer_is_queued() beforehand?
> > > >
> > > > IIUC, with this patch applied, we will lose the opportunity to refresh
> > > > the timer when the lookup function (in the above path I mentioned)
> > > > gets called compared to before. If the packet socket tries to look up
> > > > a new block and it doesn't update its expiry time, the timer will soon
> > > > wake up. Does it sound unreasonable?
> > >
> > >
> > > I actually pointed out the issue with the timeout setting in a previous email:
> > > https://lore.kernel.org/netdev/20250826030328.878001-1-jackzxcui1989@163.com/.
> > >
> > > Regarding the method you mentioned, using hrtimer_is_queued to assist in judgment, I had
> > > discussed this extensively with Willem in previous emails, and the conclusion was that
> > > it is not feasible. The reason is that in our scenario, the hrtimer always returns
> > > HRTIMER_RESTART, unlike the places you pointed out, such as tcp_pacing_check, where the
> > > corresponding hrtimer callbacks all return HRTIMER_NORESTART. Since our scenario returns
> > > HRTIMER_RESTART, this can lead to many troublesome issues. The fundamental reason is that
> > > if HRTIMER_RESTART is returned, the hrtimer module will enqueue the hrtimer after the
> > > callback returns, which leads to exiting the protection of our sk_receive_queue lock.
> > >
> > > Returning to the functionality here, if we really want to update the hrtimer's timeout
> > > outside of the timer callback, there are two key points to note:
> > >
> > > 1. Accurately knowing whether the current context is a timer callback or tpacket_rcv.
> > > 2. How to update the hrtimer's timeout in a non-timer callback scenario.
> > >
> > > To start with the first point, it has already been explained in previous emails that
> > > executing hrtimer_forward outside of a timer callback is not allowed. Therefore, we
> > > must accurately determine whether we are in a timer callback; only in that context can
> > > we use the hrtimer_forward function to update.
> > > In the original code, since the same _prb_refresh_rx_retire_blk_timer function was called,
> > > distinguishing between contexts required code restructuring. Now that this patch removes
> > > the _prb_refresh_rx_retire_blk_timer function, achieving this accurate distinction is not
> > > too difficult.
> > > The key issue is the second point. If we are not inside the hrtimer's callback, we cannot
> > > use hrtimer_forward to update the timeout.
> > > So what other interface can we use? You might
> > > suggest using hrtimer_start, but fundamentally, hrtimer_start cannot be called if it has
> > > already been started previously. Therefore, wouldn’t you need to add hrtimer_cancel to
> > > confirm that the hrtimer has been canceled? Once hrtimer_cancel is added, there will also
> > > be scenarios where it is restarted, which means we need to consider the concurrent
> > > scenario when the socket exits and also calls hrtimer_cancel. This might require adding
> > > logic for that concurrency scenario, and you might even need to reintroduce the
> > > delete_blk_timer variable to indicate whether the packet_release operation has been
> > > triggered so that the hrtimer does not restart in the tpacket_rcv scenario.
> > >
> > > In fact, in a previous v7 version, I proposed a change that I personally thought was
> > > quite good, which can be seen here:
> > > https://lore.kernel.org/netdev/20250822132051.266787-1-jackzxcui1989@163.com/. However,
> > > this change introduced an additional variable and more logic. Willem also pointed out
> > > that the added complexity to avoid a non-problematic issue was unnecessary.
> >
> > Admittedly it's a bit complex.
> >
> > >
> > > As mentioned in Changes in v8:
> > > The only special case is when prb_open_block is called from tpacket_rcv.
> > > That would set the timeout further into the future than the already queued
> > > timer. An earlier timeout is not problematic. No need to add complexity to
> > > avoid that.
> >
> > It'd be better to highlight this in the commit message as well to
> > avoid further repeat questions from others. It's an obvious change in
> > this patch :)
>
> BTW, I have to emphasize that after this patch, the hrtimer will run
> periodically and unconditionally. As far as I know, it's not possible
> to run hundreds and thousands packet sockets in production, so it
> might not be a huge problem.
In tcp each sk has its own hrtimers (tcp_init_xmit_timers).
> Or else, numerous timers are likely to
> cause spikes/jitters, especially when timeout is very small (which can
> be 1ms timeout for HZ=1000 system). It would be great if you state the
> possible side effects in the next version.
>
> Willem, any thoughts on this?
Essentially that is what the existing timer also did. There was no
path where the timer was being disabled.
It could be continuously delayed from tpacket_rcv if that happened
came before the timer fires each time. That is no longer the case
with this change. It is probably good to call that out explicitly.
Powered by blists - more mailing lists