netdev - Re: [RFC PATCH v1 0/2] virtio/vsock: fix mutual rx/tx hungup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e60fb580-01ea-baf6-3635-8fc8933f817f@sberdevices.ru>
Date:   Tue, 20 Dec 2022 12:09:36 +0000
From:   Arseniy Krasnov <AVKrasnov@...rdevices.ru>
To:     Stefano Garzarella <sgarzare@...hat.com>
CC:     Stefan Hajnoczi <stefanha@...hat.com>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        "Jakub Kicinski" <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "virtualization@...ts.linux-foundation.org" 
        <virtualization@...ts.linux-foundation.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        kernel <kernel@...rdevices.ru>,
        Krasnov Arseniy <oxffffaa@...il.com>
Subject: Re: [RFC PATCH v1 0/2] virtio/vsock: fix mutual rx/tx hungup

On 20.12.2022 13:43, Stefano Garzarella wrote:
> On Tue, Dec 20, 2022 at 09:23:17AM +0000, Arseniy Krasnov wrote:
>> On 20.12.2022 11:33, Stefano Garzarella wrote:
>>> On Tue, Dec 20, 2022 at 07:14:27AM +0000, Arseniy Krasnov wrote:
>>>> On 19.12.2022 18:41, Stefano Garzarella wrote:
>>>>
>>>> Hello!
>>>>
>>>>> Hi Arseniy,
>>>>>
>>>>> On Sat, Dec 17, 2022 at 8:42 PM Arseniy Krasnov <AVKrasnov@...rdevices.ru> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> seems I found strange thing(may be a bug) where sender('tx' later) and
>>>>>> receiver('rx' later) could stuck forever. Potential fix is in the first
>>>>>> patch, second patch contains reproducer, based on vsock test suite.
>>>>>> Reproducer is simple: tx just sends data to rx by 'write() syscall, rx
>>>>>> dequeues it using 'read()' syscall and uses 'poll()' for waiting. I run
>>>>>> server in host and client in guest.
>>>>>>
>>>>>> rx side params:
>>>>>> 1) SO_VM_SOCKETS_BUFFER_SIZE is 256Kb(e.g. default).
>>>>>> 2) SO_RCVLOWAT is 128Kb.
>>>>>>
>>>>>> What happens in the reproducer step by step:
>>>>>>
>>>>>
>>>>> I put the values of the variables involved to facilitate understanding:
>>>>>
>>>>> RX: buf_alloc = 256 KB; fwd_cnt = 0; last_fwd_cnt = 0;
>>>>>     free_space = buf_alloc - (fwd_cnt - last_fwd_cnt) = 256 KB
>>>>>
>>>>> The credit update is sent if
>>>>> free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE [64 KB]
>>>>>
>>>>>> 1) tx tries to send 256Kb + 1 byte (in a single 'write()')
>>>>>> 2) tx sends 256Kb, data reaches rx (rx_bytes == 256Kb)
>>>>>> 3) tx waits for space in 'write()' to send last 1 byte
>>>>>> 4) rx does poll(), (rx_bytes >= rcvlowat) 256Kb >= 128Kb, POLLIN is set
>>>>>> 5) rx reads 64Kb, credit update is not sent due to *
>>>>>
>>>>> RX: buf_alloc = 256 KB; fwd_cnt = 64 KB; last_fwd_cnt = 0;
>>>>>     free_space = 192 KB
>>>>>
>>>>>> 6) rx does poll(), (rx_bytes >= rcvlowat) 192Kb >= 128Kb, POLLIN is set
>>>>>> 7) rx reads 64Kb, credit update is not sent due to *
>>>>>
>>>>> RX: buf_alloc = 256 KB; fwd_cnt = 128 KB; last_fwd_cnt = 0;
>>>>>     free_space = 128 KB
>>>>>
>>>>>> 8) rx does poll(), (rx_bytes >= rcvlowat) 128Kb >= 128Kb, POLLIN is set
>>>>>> 9) rx reads 64Kb, credit update is not sent due to *
>>>>>
>>>>> Right, (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE) is still false.
>>>>>
>>>>> RX: buf_alloc = 256 KB; fwd_cnt = 196 KB; last_fwd_cnt = 0;
>>>>>     free_space = 64 KB
>>>>>
>>>>>> 10) rx does poll(), (rx_bytes < rcvlowat) 64Kb < 128Kb, rx waits in poll()
>>>>>
>>>>> I agree that the TX is stuck because we are not sending the credit
>>>>> update, but also if RX sends the credit update at step 9, RX won't be
>>>>> woken up at step 10, right?
>>>>
>>>> Yes, RX will sleep, but TX will wake up and as we inform TX how much
>>>> free space we have, now there are two cases for TX:
>>>> 1) send "small" rest of data(e.g. without blocking again), leave 'write()'
>>>>   and continue execution. RX still waits in 'poll()'. Later TX will
>>>>   send enough data to wake up RX.
>>>> 2) send "big" rest of data - if rest is too big to leave 'write()' and TX
>>>>   will wait again for the free space - it will be able to send enough data
>>>>   to wake up RX as we compared 'rx_bytes' with rcvlowat value in RX.
>>>
>>> Right, so I'd update the test to behave like this.
>> Sorry, You mean vsock_test? To cover TX waiting for free space at RX, thus checking
>> this kernel patch logic?
> 
> Yep, I mean the test that you added in this series.
Ok
> 
>>> And I'd explain better the problem we are going to fix in the commit message.
>> Ok
>>>
>>>>>
>>>>>>
>>>>>> * is optimization in 'virtio_transport_stream_do_dequeue()' which
>>>>>>   sends OP_CREDIT_UPDATE only when we have not too much space -
>>>>>>   less than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE.
>>>>>>
>>>>>> Now tx side waits for space inside write() and rx waits in poll() for
>>>>>> 'rx_bytes' to reach SO_RCVLOWAT value. Both sides will wait forever. I
>>>>>> think, possible fix is to send credit update not only when we have too
>>>>>> small space, but also when number of bytes in receive queue is smaller
>>>>>> than SO_RCVLOWAT thus not enough to wake up sleeping reader. I'm not
>>>>>> sure about correctness of this idea, but anyway - I think that problem
>>>>>> above exists. What do You think?
>>>>>
>>>>> I'm not sure, I have to think more about it, but if RX reads less than
>>>>> SO_RCVLOWAT, I expect it's normal to get to a case of stuck.
>>>>>
>>>>> In this case we are only unstucking TX, but even if it sends that single
>>>>> byte, RX is still stuck and not consuming it, so it was useless to wake
>>>>> up TX if RX won't consume it anyway, right?
>>>>
>>>> 1) I think it is not useless, because we inform(not just wake up) TX that
>>>> there is free space at RX side - as i mentioned above.
>>>> 2) Anyway i think that this situation is a little bit strange: TX thinks that
>>>> there is no free space at RX and waits for it, but there is free space at RX!
>>>> At the same time, RX waits in poll() forever - it is ready to get new portion
>>>> of data to return POLLIN, but TX "thinks" exactly opposite thing - RX is full
>>>> of data. Of course, if there will be just stalls in TX data handling - it will
>>>> be ok - just performance degradation, but TX stucks forever.
>>>
>>> We did it to avoid a lot of credit update messages.
>> Yes, i see
>>> Anyway I think here the main point is why RX is setting SO_RCVLOWAT to 128 KB and then reads only half of it?
>>>
>>> So I think if the users set SO_RCVLOWAT to a value and then RX reads less then it, is expected to get stuck.
>> That a really interesting question, I've found nothing about this case in Google(not sure for 100%) or POSIX. But,
>> i can modify reproducer: it sets SO_RCVLOWAT to 128Kb BEFORE entering its last poll where it will stuck. In this
>> case behaviour looks more legal: it uses default SO_RCVLOWAT of 1, read 64Kb each time. Finally it sets SO_RCVLOWAT
>> to 128Kb(and imagine that it prepares 128Kb 'read()' buffer) and enters poll() - we will get same effect: TX will wait
>> for space, RX waits in 'poll()'.
> 
> Good point!
> 
>>>
>>> Anyway, since the change will not impact the default behaviour (SO_RCVLOWAT = 1) we can merge this patch, but IMHO we need to explain the case better and improve the test.
>> I see, of course I'm not sure about this change, just want to ask someone who knows this code better
> 
> Yes, it's an RFC, so you did well! :-)
So ok, I'll prepare RFC version of this patchset(e.g. CV with explanation, kernel patch and test for it)
> 
>>>
>>>>
>>>>>
>>>>> If RX woke up (e.g. SO_RCVLOWAT = 64KB) and read the remaining 64KB,
>>>>> then it would still send the credit update even without this patch and
>>>>> TX will send the 1 byte.
>>>>
>>>> But how RX will wake up in this case? E.g. it calls poll() without timeout,
>>>> connection is established, RX ignores signal
>>>
>>> RX will wake up because SO_RCVLOWAT is 64KB and there are 64 KB in the buffer. Then RX will read it and send the credit update to TX because
>>> free_space is 0.
>> IIUC, i'm talking about 10 steps above, e.g. RX will never wake up, because TX is waiting for space.
> 
> Yep, but if RX uses SO_RCVLOWAT = 64 KB instead of 128 KB (I mean if RX reads all the bytes that it's waiting as it specified in SO_RCVLOWAT), then RX will send the credit message.
> 
> But there is the case that you mentioned, when SO_RCVLOWAT is chagend while executing.
I'll use this case for test
> 
> Thanks,
> Stefano
> 
Thanks, Arseniy