lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d17fc70-01a8-43b4-aec6-5cede5c8f7ba@amd.com>
Date: Tue, 4 Mar 2025 11:05:57 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Hillf Danton <hdanton@...a.com>, "Sapkal, Swapnil"
	<swapnil.sapkal@....com>
CC: Oleg Nesterov <oleg@...hat.com>, Mateusz Guzik <mjguzik@...il.com>, "Linus
 Torvalds" <torvalds@...ux-foundation.org>, <linux-fsdevel@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still
 full

Hello Hillf,

On 3/4/2025 10:36 AM, Hillf Danton wrote:
> On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@....com>
>> On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
>>> And... I know, I know you already hate me ;)
>>>
>>
>> Not at all :)
>>
>>> but if you have time, could you check if this patch (with or without the
>>> previous debugging patch) makes any difference? Just to be sure.
>>>
>>
>> Sure, I will give this a try.
>>
>> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
>> We were able to reproduce this issue on a third generation EPYC system as well as
>> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
>>
>> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
>> I have attached the debug patch below. With tracing we found the following case for
>> pipe_writable():
>>
>>     hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
>>
>> Here,
>>
>> head = 37
>> tail = 38
>> max_usage = 16
>> pipe_full() returns 1.
>>
>> Between reading of head and later the tail, the tail seems to have moved ahead of the
>> head leading to wraparound. Applying the following changes I have not yet run into a
>> hang on the original machine where I first saw it:
>>
>> diff --git a/fs/pipe.c b/fs/pipe.c
>> index ce1af7592780..a1931c817822 100644
>> --- a/fs/pipe.c
>> +++ b/fs/pipe.c
>> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>>    /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>>    static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>>    {
>> -	unsigned int head = READ_ONCE(pipe->head);
>> -	unsigned int tail = READ_ONCE(pipe->tail);
>>    	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>> +	unsigned int head, tail;
>> +
>> +	tail = READ_ONCE(pipe->tail);
>> +	/*
>> +	 * Since the unsigned arithmetic in this lockless preemptible context
>> +	 * relies on the fact that the tail can never be ahead of head, read
>> +	 * the head after the tail to ensure we've not missed any updates to
>> +	 * the head. Reordering the reads can cause wraparounds and give the
>> +	 * illusion that the pipe is full.
>> +	 */
>> +	smp_rmb();
>> +	head = READ_ONCE(pipe->head);
>>    
>>    	return !pipe_full(head, tail, max_usage) ||
>>    		!READ_ONCE(pipe->readers);
>> ---
>>
>> smp_rmb() on x86 is a nop and even without the barrier we were not able to
>> reproduce the hang even after 10000 iterations.
>>
> My $.02 that changes the wait condition.
> Not sure it makes sense for you.
> 
> --- x/fs/pipe.c
> +++ y/fs/pipe.c
> @@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io
>   {
>   	struct file *filp = iocb->ki_filp;
>   	struct pipe_inode_info *pipe = filp->private_data;
> -	unsigned int head;
> +	unsigned int head, tail;
>   	ssize_t ret = 0;
>   	size_t total_len = iov_iter_count(from);
>   	ssize_t chars;
> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
>   		 * after waiting we need to re-check whether the pipe
>   		 * become empty while we dropped the lock.
>   		 */
> +		tail = pipe->tail;
>   		mutex_unlock(&pipe->mutex);
>   		if (was_empty)
>   			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>   		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> -		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> +		wait_event_interruptible_exclusive(pipe->wr_wait,
> +				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));

That could work too for the case highlighted but in case the head too
has moved by the time the writer wakes up, it'll lead to an extra
wakeup.

Linus' diff seems cleaner and seems to cover all racy scenarios.

>   		mutex_lock(&pipe->mutex);
>   		was_empty = pipe_empty(pipe->head, pipe->tail);
>   		wake_next_writer = true;
> --

-- 
Thanks and Regards,
Prateek


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ