linux-kernel - Re: Reasoning about memory ordering

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Fri, 23 Feb 2018 19:45:55 +0200
From:   Nikolay Borisov <nborisov@...e.com>
To:     Andrea Parri <parri.andrea@...il.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        mathieu.desnoyers@...icios.com,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: Reasoning about memory ordering



On 23.02.2018 19:31, Andrea Parri wrote:
> On Fri, Feb 23, 2018 at 02:30:22PM +0200, Nikolay Borisov wrote:
>> Hello, 
>>
>> I'm cc'ing a bunch of people I know are well-versed in 
>> the black arts of memory ordering! 
>>
>> Currently in btrfs we have roughly the following sequence: 
>>
>> T1:													T2:
>> i_size_write(inode, newsize);                                                         						
>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);          				atomic_inc(&inode->i_dio_count);
>> smp_mb(); 												if (iov_iter_rw(iter) == READ) {
>> 														if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) {
>> if (atomic_read(&inode->i_dio_count)) {											if (atomic_dec_and_test(&inode->i_dio_count))   
>> 	wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP);                 				wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
>>         DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP);                    				}
>>                                                                                 				if (offset >= i_size_read(inode))
>>         do {                                                                    			               return;
>>                 prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE);                                 }      
>>                 if (atomic_read(&inode->i_dio_count))                           
>>                         schedule();                                             
>>         } while (atomic_read(&inode->i_dio_count));                             
>>         finish_wait(wq, &q.wq_entry);           
>> }
>>
>> smp_mb__before_atomic();                                                
>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);     
>>
>> The semantics I'm after are: 
>>
>> 1. If T1 goes to sleep, then T2 would see the 
>> BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the 
>> atomic_dec_and_test and possibly wake up T1. This flag serves as a way 
>> to indicate to possibly multiple T2 (dio readers) that T1 is blocked
>> and they should unblock it and resort to acquiring some locks (this is not 
>> visible in this excerpt of code for brevity). It's sort of a back-off 
>> mechanism.
> 
> I don't see how this could be guaranteed, even in a sequentially consistent
> world (disclaimer: I'm certainly not familiar with btrfs): what is wrong in
> 
> T1				T2
> 
> 				atomic_inc(i_dio_count)
> 				test_bit(NEED_LOCK, flags)  // unset
> set_bit(NEED_LOCK, flags)
> atomic_read(i_dio_count)  // >1
> 	--> go to sleep

You are correct, so looking at btrfs_direct_IO again it seems this kind
of execution is fine since we also do inode_dio_end (i.e. the
atomic_dec_andtest/wake_up) sequence even outside of the test_bit()
conditional. So I guess the first requirement is really
unsatisfiable/not required.

> 
> Thanks,
>   Andrea
> 
> 
>>
>> 2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep 
>>
>> 3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has 
>> been woken up. 
>>
>> 4. After T1 is woken up, it's possible that a new T2 comes and doesn't see 
>> the  BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check 
>> for i_size should cause T2 to just return (it will also execute atomic_dec_and_test)
>>
>> Given this is the current state of the code (it's part of btrfs) I believe
>> the following could/should be done: 
>>
>> 1. The smp_mb after the set_bit in T1 could be removed, since there is 
>> already an implied full mm in prepare_to_wait. That is if we go to sleep, 
>> then T2 is guaranteed to see the flag/i_size_write happening by merit of 
>> the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't 
>> go to sleep? I still would like the i_size_write to be visible to T2
>>
>> 2. The bit clearing code in T1 should be possible to be replaced by 
>> clear_bit_unlock (this was suggested by PeterZ on IRC).
>>
>> 3. I suspect there is a memory barrier in T2 that is missing. Perhaps
>> there should be an smp_mb__before_atomic right before the test_bit so that 
>> it's ordered with the implied smp_mb in T1's prepare_to_wait. 
>