netdev - Re: [PATCH net-next v7] net: af_packet: Use hrtimer to do the retire operation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <willemdebruijn.kernel.3b3e4091c4da@gmail.com>
Date: Sun, 24 Aug 2025 14:08:10 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Xin Zhao <jackzxcui1989@....com>, 
 willemdebruijn.kernel@...il.com, 
 edumazet@...gle.com, 
 ferenc@...es.dev
Cc: davem@...emloft.net, 
 kuba@...nel.org, 
 pabeni@...hat.com, 
 horms@...nel.org, 
 netdev@...r.kernel.org, 
 linux-kernel@...r.kernel.org, 
 Xin Zhao <jackzxcui1989@....com>
Subject: Re: [PATCH net-next v7] net: af_packet: Use hrtimer to do the retire
 operation

Xin Zhao wrote:
> In a system with high real-time requirements, the timeout mechanism of
> ordinary timers with jiffies granularity is insufficient to meet the
> demands for real-time performance. Meanwhile, the optimization of CPU
> usage with af_packet is quite significant. Use hrtimer instead of timer
> to help compensate for the shortcomings in real-time performance.
> In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time
> enough, with fluctuations reaching over 8ms (on a system with HZ=250).
> This is unacceptable in some high real-time systems that require timely
> processing of network packets. By replacing it with hrtimer, if a timeout
> of 2ms is set, the update of TP_STATUS_USER can be stabilized to within
> 3 ms.
> 
> Signed-off-by: Xin Zhao <jackzxcui1989@....com>
> 
> ---
> Changes in v7:
> - Only update the hrtimer expire time within the hrtimer callback.
>   When the callback return, without sk_buff_head lock protection, __run_hrtimer will
>   enqueue the timer if return HRTIMER_RESTART. Setting the hrtimer expires while
>   enqueuing a timer may cause chaos in the hrtimer red-black tree.
>   The setting expire time is monotonic, so if we do not update the expire time to the
>   retire_blk_timer when it is not in callback, it will not cause problem if we skip
>   the timeout event and update it when find out that expire_ktime is bigger than the
>   expire time of retire_blk_timer.
> - Use hrtimer_set_expires here instead of hrtimer_forward_now.
>   The end time for retiring each block is not fixed because when network packets are
>   received quickly, blocks are retired rapidly, and the new block retire time needs
>   to be recalculated. However, hrtimer_forward_now increments the previous timeout
>   by an interval, which is not correct.
> - The expire time is monotonic, so if we do not update the expire time to the
>   retire_blk_timer when it is not in callback, it will not cause problem if we skip
>   the timeout event and update it when find out that expire_ktime is bigger than the
>   expire time of retire_blk_timer.
> - Adding the 'bool callback' parameter back is intended to more accurately determine
>   whether we are inside the hrtimer callback when executing
>   _prb_refresh_rx_retire_blk_timer. This ensures that we only update the hrtimer's
>   timeout value within the hrtimer callback.

This is getting more complex than needed.

Essentially the lifecycle is that packet_set_ring calls hrtimer_setup
and hrtimer_del_sync.

Inbetween, while the ring is configured, the timer is either

- scheduled from tpacket_rcv and !is_scheduled
    -> call hrtimer_start
- scheduled from tpacket_rcv and is_scheduled
    -> call hrtimer_set_expires
- rescheduled from the timer callback
    -> call hrtimer_set_expires and return HRTIMER_RESTART

The only complication is that the is_scheduled check can race with the
HRTIMER_RESTART restart, as that happens outside the sk_receive_queue
critical section.

One option that I suggested before is to convert pkc->delete_blk_timer
to pkc->blk_timer_scheduled to record whether the timer is scheduled
without relying on hrtimer_is_queued. Set it on first open_block and
clear it from the callback when returning HR_NORESTART.

> Changes in v6:
> - Use hrtimer_is_queued instead to check whether it is within the callback function.
>   So do not need to add 'bool callback' parameter to _prb_refresh_rx_retire_blk_timer
>   as suggested by Willem de Bruijn;
> - Do not need local_irq_save and local_irq_restore to protect the race of the timer
>   callback running in softirq context or the open_block from tpacket_rcv in process
>   context
>   as suggested by Willem de Bruijn;
> - Link to v6: https://lore.kernel.org/all/20250820092925.2115372-1-jackzxcui1989@163.com/
> 
> Changes in v5:
> - Remove the unnecessary comments at the top of the _prb_refresh_rx_retire_blk_timer,
>   branch is self-explanatory enough
>   as suggested by Willem de Bruijn;
> - Indentation of _prb_refresh_rx_retire_blk_timer, align with first argument on
>   previous line
>   as suggested by Willem de Bruijn;
> - Do not call hrtimer_start within the hrtimer callback
>   as suggested by Willem de Bruijn
>   So add 'bool callback' parameter to _prb_refresh_rx_retire_blk_timer to indicate
>   whether it is within the callback function. Use hrtimer_forward_now instead of
>   hrtimer_start when it is in the callback function and is doing prb_open_block.
> - Link to v5: https://lore.kernel.org/all/20250819091447.1199980-1-jackzxcui1989@163.com/
> 
> Changes in v4:
> - Add 'bool start' to distinguish whether the call to _prb_refresh_rx_retire_blk_timer
>   is for prb_open_block. When it is for prb_open_block, execute hrtimer_start to
>   (re)start the hrtimer; otherwise, use hrtimer_forward_now to set the expiration
>   time as it is more commonly used compared to hrtimer_set_expires.
>   as suggested by Willem de Bruijn;
> - Delete the comments to explain why hrtimer_set_expires(not hrtimer_forward_now)
>   is used, as we do not use hrtimer_set_expires any more;
> - Link to v4: https://lore.kernel.org/all/20250818050233.155344-1-jackzxcui1989@163.com/
> 
> Changes in v3:
> - return HRTIMER_NORESTART when pkc->delete_blk_timer is true
>   as suggested by Willem de Bruijn;
> - Drop the retire_blk_tov field of tpacket_kbdq_core, add interval_ktime instead
>   as suggested by Willem de Bruijn;
> - Add comments to explain why hrtimer_set_expires(not hrtimer_forward_now) is used in
>   _prb_refresh_rx_retire_blk_timer
>   as suggested by Willem de Bruijn;
> - Link to v3: https://lore.kernel.org/all/20250816170130.3969354-1-jackzxcui1989@163.com/
> 
> Changes in v2:
> - Drop the tov_in_msecs field of tpacket_kbdq_core added by the patch
>   as suggested by Willem de Bruijn;
> - Link to v2: https://lore.kernel.org/all/20250815044141.1374446-1-jackzxcui1989@163.com/
> 
> Changes in v1:
> - Do not add another config for the current changes
>   as suggested by Eric Dumazet;
> - Mention the beneficial cases 'HZ=100 or HZ=250' in the changelog
>   as suggested by Eric Dumazet;
> - Add some performance details to the changelog
>   as suggested by Ferenc Fejes;
> - Delete the 'pkc->tov_in_msecs == 0' bounds check which is not necessary
>   as suggested by Willem de Bruijn;
> - Use hrtimer_set_expires instead of hrtimer_start_range_ns when retire timer needs update
>   as suggested by Willem de Bruijn. Start the hrtimer in prb_setup_retire_blk_timer;
> - Just return HRTIMER_RESTART directly as all cases return the same value
>   as suggested by Willem de Bruijn;
> - Link to v1: https://lore.kernel.org/all/20250813165201.1492779-1-jackzxcui1989@163.com/
> - Link to v0: https://lore.kernel.org/all/20250806055210.1530081-1-jackzxcui1989@163.com/
> ---
>  net/packet/af_packet.c | 93 +++++++++++++++++++++++++++++-------------
>  net/packet/diag.c      |  2 +-
>  net/packet/internal.h  |  6 +--
>  3 files changed, 69 insertions(+), 32 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index a7017d7f0..654ae65ea 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -197,14 +197,14 @@ static void *packet_previous_frame(struct packet_sock *po,
>  static void packet_increment_head(struct packet_ring_buffer *buff);
>  static int prb_curr_blk_in_use(struct tpacket_block_desc *);
>  static void *prb_dispatch_next_block(struct tpacket_kbdq_core *,
> -			struct packet_sock *);
> +			struct packet_sock *, bool);
>  static void prb_retire_current_block(struct tpacket_kbdq_core *,
>  		struct packet_sock *, unsigned int status);
>  static int prb_queue_frozen(struct tpacket_kbdq_core *);
>  static void prb_open_block(struct tpacket_kbdq_core *,
> -		struct tpacket_block_desc *);
> -static void prb_retire_rx_blk_timer_expired(struct timer_list *);
> -static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *);
> +		struct tpacket_block_desc *, bool);
> +static enum hrtimer_restart prb_retire_rx_blk_timer_expired(struct hrtimer *);
> +static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *, bool);
>  static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *);
>  static void prb_clear_rxhash(struct tpacket_kbdq_core *,
>  		struct tpacket3_hdr *);
> @@ -581,7 +581,7 @@ static __be16 vlan_get_protocol_dgram(const struct sk_buff *skb)
>  
>  static void prb_del_retire_blk_timer(struct tpacket_kbdq_core *pkc)
>  {
> -	timer_delete_sync(&pkc->retire_blk_timer);
> +	hrtimer_cancel(&pkc->retire_blk_timer);
>  }
>  
>  static void prb_shutdown_retire_blk_timer(struct packet_sock *po,
> @@ -603,9 +603,10 @@ static void prb_setup_retire_blk_timer(struct packet_sock *po)
>  	struct tpacket_kbdq_core *pkc;
>  
>  	pkc = GET_PBDQC_FROM_RB(&po->rx_ring);
> -	timer_setup(&pkc->retire_blk_timer, prb_retire_rx_blk_timer_expired,
> -		    0);
> -	pkc->retire_blk_timer.expires = jiffies;
> +	hrtimer_setup(&pkc->retire_blk_timer, prb_retire_rx_blk_timer_expired,
> +		      CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT);
> +	hrtimer_start(&pkc->retire_blk_timer, pkc->interval_ktime,
> +		      HRTIMER_MODE_REL_SOFT);
>  }
>  
>  static int prb_calc_retire_blk_tmo(struct packet_sock *po,
> @@ -672,27 +673,52 @@ static void init_prb_bdqc(struct packet_sock *po,
>  	p1->last_kactive_blk_num = 0;
>  	po->stats.stats3.tp_freeze_q_cnt = 0;
>  	if (req_u->req3.tp_retire_blk_tov)
> -		p1->retire_blk_tov = req_u->req3.tp_retire_blk_tov;
> +		p1->interval_ktime = ms_to_ktime(req_u->req3.tp_retire_blk_tov);
>  	else
> -		p1->retire_blk_tov = prb_calc_retire_blk_tmo(po,
> -						req_u->req3.tp_block_size);
> -	p1->tov_in_jiffies = msecs_to_jiffies(p1->retire_blk_tov);
> +		p1->interval_ktime = ms_to_ktime(prb_calc_retire_blk_tmo(po,
> +						req_u->req3.tp_block_size));
>  	p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;
>  	rwlock_init(&p1->blk_fill_in_prog_lock);
>  
>  	p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
>  	prb_init_ft_ops(p1, req_u);
>  	prb_setup_retire_blk_timer(po);
> -	prb_open_block(p1, pbd);
> +	prb_open_block(p1, pbd, false);
>  }
>  
>  /*  Do NOT update the last_blk_num first.
>   *  Assumes sk_buff_head lock is held.
>   */
> -static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *pkc)
> -{
> -	mod_timer(&pkc->retire_blk_timer,
> -			jiffies + pkc->tov_in_jiffies);
> +static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *pkc,
> +					     bool callback)
> +{
> +	if (likely(ktime_to_ns(pkc->expire_ktime))) {
> +		pkc->expire_ktime = ktime_add_safe(ktime_get(), pkc->interval_ktime);
> +		if (callback) {
> +			/* Three tips:
> +			 * 1) Only update the hrtimer expire time within the callback.
> +			 * When the callback return, without sk_buff_head lock protection,
> +			 * __run_hrtimer will enqueue the timer if return HRTIMER_RESTART.
> +			 * Setting the hrtimer expires while enqueuing a timer may cause
> +			 * chaos in the hrtimer red-black tree.
> +			 * 2) Use hrtimer_set_expires here instead of hrtimer_forward_now,
> +			 * because the end time for retiring each block is not fixed
> +			 * because when network packets are received quickly, blocks are
> +			 * retired rapidly, and the new block retire time needs to be
> +			 * recalculated. However, hrtimer_forward_now increments the
> +			 * previous timeout by an interval, which is not correct.
> +			 * 3）The expire time is monotonic, so if we do not update the
> +			 * expire time to the retire_blk_timer when it is not in callback,
> +			 * it will not cause problem if we skip the timeout event and
> +			 * update it when find out that expire_ktime is bigger than the
> +			 * expire time of retire_blk_timer.
> +			 */
> +			hrtimer_set_expires(&pkc->retire_blk_timer, pkc->expire_ktime);
> +		}
> +	} else {
> +		/* Just after prb_setup_retire_blk_timer. */
> +		pkc->expire_ktime = hrtimer_get_expires(&pkc->retire_blk_timer);
> +	}
>  	pkc->last_kactive_blk_num = pkc->kactive_blk_num;
>  }
>  
> @@ -719,8 +745,9 @@ static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *pkc)
>   * prb_calc_retire_blk_tmo() calculates the tmo.
>   *
>   */
> -static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
> +static enum hrtimer_restart prb_retire_rx_blk_timer_expired(struct hrtimer *t)
>  {
> +	enum hrtimer_restart ret = HRTIMER_RESTART;
>  	struct packet_sock *po =
>  		timer_container_of(po, t, rx_ring.prb_bdqc.retire_blk_timer);
>  	struct tpacket_kbdq_core *pkc = GET_PBDQC_FROM_RB(&po->rx_ring);
> @@ -732,8 +759,17 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
>  	frozen = prb_queue_frozen(pkc);
>  	pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
>  
> -	if (unlikely(pkc->delete_blk_timer))
> +	if (unlikely(pkc->delete_blk_timer)) {
> +		ret = HRTIMER_NORESTART;
>  		goto out;
> +	}
> +
> +	/* See comments in _prb_refresh_rx_retire_blk_timer. */
> +	if (ktime_compare(pkc->expire_ktime,
> +			  hrtimer_get_expires(&pkc->retire_blk_timer)) > 0) {
> +		hrtimer_set_expires(&pkc->retire_blk_timer, pkc->expire_ktime);
> +		goto out;
> +	}
>  
>  	/* We only need to plug the race when the block is partially filled.
>  	 * tpacket_rcv:
> @@ -757,7 +793,7 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
>  				goto refresh_timer;
>  			}
>  			prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO);
> -			if (!prb_dispatch_next_block(pkc, po))
> +			if (!prb_dispatch_next_block(pkc, po, true))
>  				goto refresh_timer;
>  			else
>  				goto out;
> @@ -779,17 +815,18 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t)
>  				* opening a block thaws the queue,restarts timer
>  				* Thawing/timer-refresh is a side effect.
>  				*/
> -				prb_open_block(pkc, pbd);
> +				prb_open_block(pkc, pbd, true);
>  				goto out;
>  			}
>  		}
>  	}
>  
>  refresh_timer:
> -	_prb_refresh_rx_retire_blk_timer(pkc);
> +	_prb_refresh_rx_retire_blk_timer(pkc, true);
>  
>  out:
>  	spin_unlock(&po->sk.sk_receive_queue.lock);
> +	return ret;
>  }
>  
>  static void prb_flush_block(struct tpacket_kbdq_core *pkc1,
> @@ -890,7 +927,7 @@ static void prb_thaw_queue(struct tpacket_kbdq_core *pkc)
>   *
>   */
>  static void prb_open_block(struct tpacket_kbdq_core *pkc1,
> -	struct tpacket_block_desc *pbd1)
> +	struct tpacket_block_desc *pbd1, bool callback)
>  {
>  	struct timespec64 ts;
>  	struct tpacket_hdr_v1 *h1 = &pbd1->hdr.bh1;
> @@ -921,7 +958,7 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1,
>  	pkc1->pkblk_end = pkc1->pkblk_start + pkc1->kblk_size;
>  
>  	prb_thaw_queue(pkc1);
> -	_prb_refresh_rx_retire_blk_timer(pkc1);
> +	_prb_refresh_rx_retire_blk_timer(pkc1, callback);
>  
>  	smp_wmb();
>  }
> @@ -965,7 +1002,7 @@ static void prb_freeze_queue(struct tpacket_kbdq_core *pkc,
>   * So, caller must check the return value.
>   */
>  static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc,
> -		struct packet_sock *po)
> +		struct packet_sock *po, bool callback)
>  {
>  	struct tpacket_block_desc *pbd;
>  
> @@ -985,7 +1022,7 @@ static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc,
>  	 * open this block and return the offset where the first packet
>  	 * needs to get stored.
>  	 */
> -	prb_open_block(pkc, pbd);
> +	prb_open_block(pkc, pbd, callback);
>  	return (void *)pkc->nxt_offset;
>  }
>  
> @@ -1124,7 +1161,7 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
>  			 * opening a block also thaws the queue.
>  			 * Thawing is a side effect.
>  			 */
> -			prb_open_block(pkc, pbd);
> +			prb_open_block(pkc, pbd, false);
>  		}
>  	}
>  
> @@ -1143,7 +1180,7 @@ static void *__packet_lookup_frame_in_block(struct packet_sock *po,
>  	prb_retire_current_block(pkc, po, 0);
>  
>  	/* Now, try to dispatch the next block */
> -	curr = (char *)prb_dispatch_next_block(pkc, po);
> +	curr = (char *)prb_dispatch_next_block(pkc, po, false);
>  	if (curr) {
>  		pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc);
>  		prb_fill_curr_block(curr, pkc, pbd, len);
> diff --git a/net/packet/diag.c b/net/packet/diag.c
> index 6ce1dcc28..c8f43e0c1 100644
> --- a/net/packet/diag.c
> +++ b/net/packet/diag.c
> @@ -83,7 +83,7 @@ static int pdiag_put_ring(struct packet_ring_buffer *ring, int ver, int nl_type,
>  	pdr.pdr_frame_nr = ring->frame_max + 1;
>  
>  	if (ver > TPACKET_V2) {
> -		pdr.pdr_retire_tmo = ring->prb_bdqc.retire_blk_tov;
> +		pdr.pdr_retire_tmo = ktime_to_ms(ring->prb_bdqc.interval_ktime);
>  		pdr.pdr_sizeof_priv = ring->prb_bdqc.blk_sizeof_priv;
>  		pdr.pdr_features = ring->prb_bdqc.feature_req_word;
>  	} else {
> diff --git a/net/packet/internal.h b/net/packet/internal.h
> index 1e743d031..f2df563c7 100644
> --- a/net/packet/internal.h
> +++ b/net/packet/internal.h
> @@ -45,12 +45,12 @@ struct tpacket_kbdq_core {
>  	/* Default is set to 8ms */
>  #define DEFAULT_PRB_RETIRE_TOV	(8)
>  
> -	unsigned short  retire_blk_tov;
> +	ktime_t		interval_ktime;
>  	unsigned short  version;
> -	unsigned long	tov_in_jiffies;
>  
>  	/* timer to retire an outstanding block */
> -	struct timer_list retire_blk_timer;
> +	struct hrtimer  retire_blk_timer;
> +	ktime_t		expire_ktime;
>  };
>  
>  struct pgv {
> -- 
> 2.34.1
>