linux-kernel - Re: [RESEND PATCH v2] eventfd: introduce ratelimited wakeup for non-semaphore eventfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <w7ldxi4jcdizkefv7musjwxblwu66pg3rfteprfymqoxaev6by@ikvzlsncihbr>
Date: Sun, 11 Aug 2024 12:26:02 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Wen Yang <wen.yang@...ux.dev>
Cc: Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Jens Axboe <axboe@...nel.dk>, Dylan Yudaken <dylany@...com>, 
	David Woodhouse <dwmw@...zon.co.uk>, Paolo Bonzini <pbonzini@...hat.com>, 
	Dave Young <dyoung@...hat.com>, kernel test robot <lkp@...el.com>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RESEND PATCH v2] eventfd: introduce ratelimited wakeup for
 non-semaphore eventfd

On Sun, Aug 11, 2024 at 04:59:54PM +0800, Wen Yang wrote:
> For the NON-SEMAPHORE eventfd, a write (2) call adds the 8-byte integer
> value provided in its buffer to the counter, while a read (2) returns the
> 8-byte value containing the value and resetting the counter value to 0.
> Therefore, the accumulated value of multiple writes can be retrieved by a
> single read.
> 
> However, the current situation is to immediately wake up the read thread
> after writing the NON-SEMAPHORE eventfd, which increases unnecessary CPU
> overhead. By introducing a configurable rate limiting mechanism in
> eventfd_write, these unnecessary wake-up operations are reduced.
> 
> 
[snip]

> 	# ./a.out  -p 2 -s 3
> 	The original cpu usage is as follows:
> 09:53:38 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 09:53:40 PM    2   47.26    0.00   52.74    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 09:53:40 PM    3   44.72    0.00   55.28    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 
> 09:53:40 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 09:53:42 PM    2   45.73    0.00   54.27    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 09:53:42 PM    3   46.00    0.00   54.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 
> 09:53:42 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 09:53:44 PM    2   48.00    0.00   52.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 09:53:44 PM    3   45.50    0.00   54.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 
> Then enable the ratelimited wakeup, eg:
> 	# ./a.out  -p 2 -s 3  -r1000 -c2
> 
> Observing a decrease of over 20% in CPU utilization (CPU # 3, 54% ->30%), as shown below:
> 10:02:32 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 10:02:34 PM    2   53.00    0.00   47.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 10:02:34 PM    3   30.81    0.00   30.81    0.00    0.00    0.00    0.00    0.00    0.00   38.38
> 
> 10:02:34 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 10:02:36 PM    2   48.50    0.00   51.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 10:02:36 PM    3   30.20    0.00   30.69    0.00    0.00    0.00    0.00    0.00    0.00   39.11
> 
> 10:02:36 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 10:02:38 PM    2   45.00    0.00   55.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
> 10:02:38 PM    3   27.08    0.00   30.21    0.00    0.00    0.00    0.00    0.00    0.00   42.71
> 
> 

Where are these stats from? Is this from your actual program you coded
the feature for?

The program you inlined here does next to nothing in userspace and
unsurprisingly the entire thing is dominated by kernel time, regardless
of what event rate can be achieved.

For example I got: /a.out -p 2 -s 3  5.34s user 60.85s system 99% cpu 66.19s (1:06.19) total

Even so, looking at perf top shows me that a significant chunk is
contention stemming from calls to poll -- perhaps the overhead will
sufficiently go down if you epoll instead?

I think the idea is pretty dodgey. If the consumer program can tolerate
some delay in event processing, this probably can be massaged entirely in
userspace.

If your real program has the wake up rate so high that it constitutes a
tangible problem I wonder if eventfd is even the right primitive to use
-- perhaps something built around shared memory and futexes would do the
trick significantly better?