linux-kernel - Re: [RFC vfs/for-next 1/1] eventpoll: Set epoll timeout if it's in the future

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <qg2whv57hpyiw66ocb6zuhcus5yajqm3gypau3p655hp53pwnj@vxdhp2m7d5qg>
Date: Wed, 16 Apr 2025 11:12:10 +0200
From: Jan Kara <jack@...e.cz>
To: Joe Damato <jdamato@...tly.com>
Cc: linux-fsdevel@...r.kernel.org, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	open list <linux-kernel@...r.kernel.org>
Subject: Re: [RFC vfs/for-next 1/1] eventpoll: Set epoll timeout if it's in
 the future

On Tue 15-04-25 18:43:46, Joe Damato wrote:
> Avoid an edge case where epoll_wait arms a timer and calls schedule()
> even if the timer will expire immediately.
> 
> For example: if the user has specified an epoll busy poll usecs which is
> equal or larger than the epoll_wait/epoll_pwait2 timeout, it is
> unnecessary to call schedule_hrtimeout_range; the busy poll usecs have
> consumed the entire timeout duration so it is unnecessary to induce
> scheduling latency by calling schedule() (via schedule_hrtimeout_range).
> 
> This can be measured using a simple bpftrace script:
> 
> tracepoint:sched:sched_switch
> / args->prev_pid == $1 /
> {
>   print(kstack());
>   print(ustack());
> }
> 
> Before this patch is applied:
> 
>   Testing an epoll_wait app with busy poll usecs set to 1000, and
>   epoll_wait timeout set to 1ms using the script above shows:
> 
>      __traceiter_sched_switch+69
>      __schedule+1495
>      schedule+32
>      schedule_hrtimeout_range+159
>      do_epoll_wait+1424
>      __x64_sys_epoll_wait+97
>      do_syscall_64+95
>      entry_SYSCALL_64_after_hwframe+118
> 
>      epoll_wait+82
> 
>   Which is unexpected; the busy poll usecs should have consumed the
>   entire timeout and there should be no reason to arm a timer.
> 
> After this patch is applied: the same test scenario does not generate a
> call to schedule() in the above edge case. If the busy poll usecs are
> reduced (for example usecs: 100, epoll_wait timeout 1ms) the timer is
> armed as expected.
> 
> Signed-off-by: Joe Damato <jdamato@...tly.com>

Hum, I guess this is about the interpretation of the 'timeout' value of
epoll_pwait2() and friends. Does the runtime of the system call itself
(including possible polling) count into the timeout or does timeout mean
how long we should sleep after we've done all our work? The manpage says
"The timeout argument specifies the number of milliseconds that
epoll_wait() will block." which I guess can be understood both ways. Seeing
the epoll code it seems the author's intention was indeed rather the former.
So I guess feel free to add:

Reviewed-by: Jan Kara <jack@...e.cz>

								Honza
> ---
>  fs/eventpoll.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index f9898e60dd8b..ca0c7e843da7 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1980,6 +1980,14 @@ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
>  	return ret;
>  }
>  
> +static int ep_schedule_timeout(ktime_t *to)
> +{
> +	if (to)
> +		return ktime_after(*to, ktime_get());
> +	else
> +		return 1;
> +}
> +
>  /**
>   * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
>   *           event buffer.
> @@ -2095,7 +2103,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  
>  		write_unlock_irq(&ep->lock);
>  
> -		if (!eavail)
> +		if (!eavail && ep_schedule_timeout(to))
>  			timed_out = !schedule_hrtimeout_range(to, slack,
>  							      HRTIMER_MODE_ABS);
>  		__set_current_state(TASK_RUNNING);
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR