linux-kernel - Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250304102934.2999-1-hdanton@sina.com>
Date: Tue,  4 Mar 2025 18:29:33 +0800
From: Hillf Danton <hdanton@...a.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Oleg Nesterov <oleg@...hat.com>,
	Mateusz Guzik <mjguzik@...il.com>,
	"Sapkal, Swapnil" <swapnil.sapkal@....com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full

On Tue, 4 Mar 2025 11:05:57 +0530 K Prateek Nayak <kprateek.nayak@....com>
>On 3/4/2025 10:36 AM, Hillf Danton wrote:
>> On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@....com>
>>> On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
>>>> And... I know, I know you already hate me ;)
>>>>
>>>
>>> Not at all :)
>>>
>>>> but if you have time, could you check if this patch (with or without the
>>>> previous debugging patch) makes any difference? Just to be sure.
>>>>
>>>
>>> Sure, I will give this a try.
>>>
>>> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
>>> We were able to reproduce this issue on a third generation EPYC system as well as
>>> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
>>>
>>> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
>>> I have attached the debug patch below. With tracing we found the following case for
>>> pipe_writable():
>>>
>>>     hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
>>>
>>> Here,
>>>
>>> head = 37
>>> tail = 38
>>> max_usage = 16
>>> pipe_full() returns 1.
>>>
>>> Between reading of head and later the tail, the tail seems to have moved ahead of the
>>> head leading to wraparound. Applying the following changes I have not yet run into a
>>> hang on the original machine where I first saw it:
>>>
>>> diff --git a/fs/pipe.c b/fs/pipe.c
>>> index ce1af7592780..a1931c817822 100644
>>> --- a/fs/pipe.c
>>> +++ b/fs/pipe.c
>>> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>>>    /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>>>    static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>>>    {
>>> -	unsigned int head = READ_ONCE(pipe->head);
>>> -	unsigned int tail = READ_ONCE(pipe->tail);
>>>    	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>>> +	unsigned int head, tail;
>>> +
>>> +	tail = READ_ONCE(pipe->tail);
>>> +	/*
>>> +	 * Since the unsigned arithmetic in this lockless preemptible context
>>> +	 * relies on the fact that the tail can never be ahead of head, read
>>> +	 * the head after the tail to ensure we've not missed any updates to
>>> +	 * the head. Reordering the reads can cause wraparounds and give the
>>> +	 * illusion that the pipe is full.
>>> +	 */
>>> +	smp_rmb();
>>> +	head = READ_ONCE(pipe->head);
>>>    
>>>    	return !pipe_full(head, tail, max_usage) ||
>>>    		!READ_ONCE(pipe->readers);
>>> ---
>>>
>>> smp_rmb() on x86 is a nop and even without the barrier we were not able to
>>> reproduce the hang even after 10000 iterations.
>>>
>> My $.02 that changes the wait condition.
>> Not sure it makes sense for you.
>> 
>> --- x/fs/pipe.c
>> +++ y/fs/pipe.c
>> @@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io
>>   {
>>   	struct file *filp = iocb->ki_filp;
>>   	struct pipe_inode_info *pipe = filp->private_data;
>> -	unsigned int head;
>> +	unsigned int head, tail;
>>   	ssize_t ret = 0;
>>   	size_t total_len = iov_iter_count(from);
>>   	ssize_t chars;
>> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
>>   		 * after waiting we need to re-check whether the pipe
>>   		 * become empty while we dropped the lock.
>>   		 */
>> +		tail = pipe->tail;
>>   		mutex_unlock(&pipe->mutex);
>>   		if (was_empty)
>>   			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>>   		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>> -		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
>> +		wait_event_interruptible_exclusive(pipe->wr_wait,
>> +				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));
>
>That could work too for the case highlighted but in case the head too
>has moved by the time the writer wakes up, it'll lead to an extra
>wakeup.
>
Note wakeup can occur even if pipe is full, and more important, taking
the pipe lock after wakeup is the price paid for curing the hang in
question.

		 * So we still need to wake up any pending writers in the
		 * _very_ unlikely case that the pipe was full, but we got
		 * no data.
		 */

>Linus' diff seems cleaner and seems to cover all racy scenarios.
>
>>   		mutex_lock(&pipe->mutex);
>>   		was_empty = pipe_empty(pipe->head, pipe->tail);
>>   		wake_next_writer = true;
>> --
>
>-- 
>Thanks and Regards,
>Prateek