linux-kernel - Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7ace642f-e269-49ad-9d32-53c7ffa00681@amd.com>
Date: Mon, 3 Feb 2025 14:35:11 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Oleg Nesterov <oleg@...hat.com>, Linus Torvalds
	<torvalds@...ux-foundation.org>
CC: Manfred Spraul <manfred@...orfullife.com>, Christian Brauner
	<brauner@...nel.org>, David Howells <dhowells@...hat.com>, WangYuli
	<wangyuli@...ontech.com>, <linux-fsdevel@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, "Gautham R. Shenoy" <gautham.shenoy@....com>,
	Swapnil Sapkal <swapnil.sapkal@....com>, Neeraj Upadhyay
	<Neeraj.Upadhyay@....com>
Subject: Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still
 full

Hello Oleg,

Thank you for pointing me to the regression reports and the relevant
upstream discussions on the parallel thread.

On 2/2/2025 10:31 PM, Oleg Nesterov wrote:
> On 01/31, Linus Torvalds wrote:
>>
>> On Fri, 31 Jan 2025 at 01:50, K Prateek Nayak <kprateek.nayak@....com> wrote:
>>>
>>> On my 3rd Generation EPYC system (2 x 64C/128T), I see that on reverting
>>> the changes on the above mentioned commit, sched-messaging sees a
>>> regression up until the 8 group case which contains 320 tasks, however
>>> with 16 groups (640 tasks), the revert helps with performance.
>>
>> I suspect that the extra wakeups just end up perturbing timing, and
>> then you just randomly get better performance on that particular
>> test-case and machine.
>>
>> I'm not sure this is worth worrying about, unless there's a real load
>> somewhere that shows this regression.
> 
> Well yes, but the problem is that people seem to believe that hackbench
> is the "real" workload, even in the "overloaded" case...
> 
> And if we do care about performance... Could you look at the trivial patch
> at the end? I don't think {a,c,m}time make any sense in the !fifo case, but
> as you explained before they are visible to fstat() so we probably shouldn't
> remove file_accessed/file_update_time unconditionally.
> 
> This patch does help if I change hackbench to uses pipe2(O_NOATIME) instead
> of pipe(). And in fact it helps even in the simplest case:
> 
> 	static char buf[17 * 4096];
> 
> 	static struct timeval TW, TR;
> 
> 	int wr(int fd, int size)
> 	{
> 		int c, r;
> 		struct timeval t0, t1;
> 
> 		gettimeofday(&t0, NULL);
> 		for (c = 0; (r = write(fd, buf, size)) > 0; c += r);
> 		gettimeofday(&t1, NULL);
> 		timeradd(&TW, &t1, &TW);
> 		timersub(&TW, &t0, &TW);
> 
> 		return c;
> 	}
> 
> 	int rd(int fd, int size)
> 	{
> 		int c, r;
> 		struct timeval t0, t1;
> 
> 		gettimeofday(&t0, NULL);
> 		for (c = 0; (r = read(fd, buf, size)) > 0; c += r);
> 		gettimeofday(&t1, NULL);
> 		timeradd(&TR, &t1, &TR);
> 		timersub(&TR, &t0, &TR);
> 
> 		return c;
> 	}
> 
> 	int main(int argc, const char *argv[])
> 	{
> 		int fd[2], nb = 1, noat, loop, size;
> 
> 		assert(argc == 4);
> 		noat = atoi(argv[1]) ? O_NOATIME : 0;
> 		loop = atoi(argv[2]);
> 		size = atoi(argv[3]);
> 
> 		assert(pipe2(fd, noat) == 0);
> 		assert(ioctl(fd[0], FIONBIO, &nb) == 0);
> 		assert(ioctl(fd[1], FIONBIO, &nb) == 0);
> 
> 		assert(size <= sizeof(buf));
> 		while (loop--)
> 			assert(wr(fd[1], size) == rd(fd[0], size));
> 
> 		printf("TW = %lu.%03lu\n", TW.tv_sec, TW.tv_usec/1000);
> 		printf("TR = %lu.%03lu\n", TR.tv_sec, TR.tv_usec/1000);
> 
> 		return 0;
> 	}
> 
> 
> Now,
> 
> 	/# for i in 1 2 3; do /host/tmp/test 0 10000 100; done
> 	TW = 7.692
> 	TR = 5.704
> 	TW = 7.930
> 	TR = 5.858
> 	TW = 7.685
> 	TR = 5.697
> 	/#
> 	/# for i in 1 2 3; do /host/tmp/test 1 10000 100; done
> 	TW = 6.432
> 	TR = 4.533
> 	TW = 6.612
> 	TR = 4.638
> 	TW = 6.409
> 	TR = 4.523
> 
> Oleg.
> ---
> 

With the below patch on mainline, I see more improvements for a
modified version of sched-messaging (sched-messaging is same as
hackbench as you noted on the parallel thread) that uses
pipe2(O_NOATIME)

The original regression is still noticeable despite the improvements
but if folks believe this is a corner case with the original changes
exhibited by sched-messaging, I'll just continue further testing with
the new baseline.

That said, following are the results for the below patch:

    ==================================================================
     Test          : sched-messaging
     cmdline       : perf bench sched messaging -p -t -l 100000 -g <groups>
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:         mainline[pct imp](CV)    patched[pct imp](CV)
      1-groups     1.00 [ -0.00](12.29)     0.94 [  6.76]( 12.59)
      2-groups     1.00 [ -0.00]( 3.64)     0.95 [  5.16](  5.99)
      4-groups     1.00 [ -0.00]( 3.33)     0.99 [  1.03](  1.89)
      8-groups     1.00 [ -0.00]( 2.90)     1.00 [  0.16](  1.23)
     16-groups     1.00 [ -0.00]( 1.46)     0.98 [  2.01](  0.98)

Please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@....com>

I'll give your generalized optimization a spin too when it comes out.
Meanwhile, I'll go run a bunch of benchmarks to see if the original
change has affected any other workload in my test bed.

Thank you for looking into this.

-- 
Thanks and Regards,
Prateek

> diff --git a/fs/pipe.c b/fs/pipe.c
> index a3f5fd7256e9..14b2c0f8b616 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1122,6 +1122,9 @@ int create_pipe_files(struct file **res, int flags)
>   		}
>   	}
>   
> +	if (flags & O_NOATIME)
> +		inode->i_flags |= S_NOCMTIME;
> +
>   	f = alloc_file_pseudo(inode, pipe_mnt, "",
>   				O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)),
>   				&pipefifo_fops);
> @@ -1134,7 +1137,7 @@ int create_pipe_files(struct file **res, int flags)
>   	f->private_data = inode->i_pipe;
>   	f->f_pipe = 0;
>   
> -	res[0] = alloc_file_clone(f, O_RDONLY | (flags & O_NONBLOCK),
> +	res[0] = alloc_file_clone(f, O_RDONLY | (flags & (O_NONBLOCK | O_NOATIME)),
>   				  &pipefifo_fops);
>   	if (IS_ERR(res[0])) {
>   		put_pipe_info(inode, inode->i_pipe);
> @@ -1154,7 +1157,7 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
>   	int error;
>   	int fdw, fdr;
>   
> -	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOTIFICATION_PIPE))
> +	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOATIME | O_NOTIFICATION_PIPE))
>   		return -EINVAL;
>   
>   	error = create_pipe_files(files, flags);
>