netdev - Re: [PATCH V2 net-next 1/7] net: ena: avoid unnecessary rearming of interrupt vector when busy-polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <pj41zlk0z7rypx.fsf@ua97a68a4e7db56.ant.amazon.com>
Date:   Mon, 13 Jul 2020 22:39:38 +0300
From:   Shay Agroskin <shayagr@...zon.com>
To:     Eric Dumazet <eric.dumazet@...il.com>
CC:     <akiyano@...zon.com>, <davem@...emloft.net>,
        <netdev@...r.kernel.org>, <dwmw@...zon.com>, <zorik@...zon.com>,
        <matua@...zon.com>, <saeedb@...zon.com>, <msw@...zon.com>,
        <aliguori@...zon.com>, <nafea@...zon.com>, <gtzalik@...zon.com>,
        <netanel@...zon.com>, <alisaidi@...zon.com>, <benh@...zon.com>,
        <ndagan@...zon.com>, <sameehj@...zon.com>
Subject: Re: [PATCH V2 net-next 1/7] net: ena: avoid unnecessary rearming of
 interrupt vector when busy-polling


Eric Dumazet <eric.dumazet@...il.com> writes:

> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On 7/12/20 3:36 PM, akiyano@...zon.com wrote:
>> From: Arthur Kiyanovski <akiyano@...zon.com>
>>
>> In napi busy-poll mode, the kernel invokes the napi handler of the
>> device repeatedly to poll the NIC's receive queues. This process
>> repeats until a timeout, specific for each connection, is up.
>> By polling packets in busy-poll mode the user may gain lower latency
>> and higher throughput (since the kernel no longer waits for interrupts
>> to poll the queues) in expense of CPU usage.
>>
>> Upon completing a napi routine, the driver checks whether
>> the routine was called by an interrupt handler. If so, the driver
>> re-enables interrupts for the device. This is needed since an
>> interrupt routine invocation disables future invocations until
>> explicitly re-enabled.
>>
>> The driver avoids re-enabling the interrupts if they were not disabled
>> in the first place (e.g. if driver in busy mode).
>> Originally, the driver checked whether interrupt re-enabling is needed
>> by reading the 'ena_napi->unmask_interrupt' variable. This atomic
>> variable was set upon interrupt and cleared after re-enabling it.
>>
>> In the 4.10 Linux version, the 'napi_complete_done' call was changed
>> so that it returns 'false' when device should not re-enable
>> interrupts, and 'true' otherwise. The change includes reading the
>> "NAPIF_STATE_IN_BUSY_POLL" flag to check if the napi call is in
>> busy-poll mode, and if so, return 'false'.
>> The driver was changed to re-enable interrupts according to this
>> routine's return value.
>> The Linux community rejected the use of the
>> 'ena_napi->unmaunmask_interrupt' variable to determine whether
>> unmasking is needed, and urged to use napi_napi_complete_done()
>> return value solely.
>> See https://lore.kernel.org/patchwork/patch/741149/ for more details
>
> Yeah, and I see you did not bother to CC me on this new submission.
>

I apologize for this. We should have been more careful. We'll take
better notice next time

>>
>> As explained, a busy-poll session exists for a specified timeout
>> value, after which it exits the busy-poll mode and re-enters it later.
>> This leads to many invocations of the napi handler where
>> napi_complete_done() false indicates that interrupts should be
>> re-enabled.
>> This creates a bug in which the interrupts are re-enabled
>> unnecessarily.
>> To reproduce this bug:
>>     1) echo 50 | sudo tee /proc/sys/net/core/busy_poll
>>     2) echo 50 | sudo tee /proc/sys/net/core/busy_read
>>     3) Add counters that check whether
>>     'ena_unmask_interrupt(tx_ring, rx_ring);'
>>     is called without disabling the interrupts in the first
>>     place (i.e. with calling the interrupt routine
>>     ena_intr_msix_io())
>>
>> Steps 1+2 enable busy-poll as the default mode for new connections.
>>
>> The busy poll routine rearms the interrupts after every session by
>> design, and so we need to add an extra check that the interrupts were
>> masked in the first place.
>>
>> Signed-off-by: Shay Agroskin <shayagr@...zon.com>
>> Signed-off-by: Arthur Kiyanovski <akiyano@...zon.com>
>> ---
>>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 7 ++++++-
>>  drivers/net/ethernet/amazon/ena/ena_netdev.h | 1 +
>>  2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
>> index 91be3ffa1c5c..90c0fe15cd23 100644
>> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
>> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
>> @@ -1913,7 +1913,9 @@ static int ena_io_poll(struct napi_struct *napi, int budget)
>>               /* Update numa and unmask the interrupt only when schedule
>>                * from the interrupt context (vs from sk_busy_loop)
>>                */
>> -             if (napi_complete_done(napi, rx_work_done)) {
>> +             if (napi_complete_done(napi, rx_work_done) &&
>> +                 READ_ONCE(ena_napi->interrupts_masked)) {
>> +                     WRITE_ONCE(ena_napi->interrupts_masked, false);
>>                       /* We apply adaptive moderation on Rx path only.
>>                        * Tx uses static interrupt moderation.
>>                        */
>> @@ -1961,6 +1963,9 @@ static irqreturn_t ena_intr_msix_io(int irq, void *data)
>>
>>       ena_napi->first_interrupt = true;
>>
>> +     WRITE_ONCE(ena_napi->interrupts_masked, true);
>> +     smp_wmb(); /* write interrupts_masked before calling napi */
>
> It is not clear where is the paired smp_wmb()
>
Can you please explain what you mean ? The idea of adding the store barrier here is to ensure that the WRITE_ONCE(…) invocation is executed before
invoking the napi soft irq. From what I gathered using this command would result in compiler barrier (which would prevent it from executing the bool store after napi scheduling) on x86
and a memory barrier on ARM64 machines which have a weaker consistency model.
>> +
>>       napi_schedule_irqoff(&ena_napi->napi);
>>
>>       return IRQ_HANDLED;
>> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h
>> index ba030d260940..89304b403995 100644
>> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
>> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
>> @@ -167,6 +167,7 @@ struct ena_napi {
>>       struct ena_ring *rx_ring;
>>       struct ena_ring *xdp_ring;
>>       bool first_interrupt;
>> +     bool interrupts_masked;
>>       u32 qid;
>>       struct dim dim;
>>  };
>>
>
> Note that writing/reading over bool fields from hard irq context without proper sync is not generally allowed.
>
> Compiler could perform RMW over plain 32bit words.

Doesn't the READ/WRITE_ONCE macros prevent the compiler from reordering these instructions ? Also from what I researched (please correct me if I'm wrong here)
both x86 and ARM don't allow reordering LOAD and STORE when they access
the same variable (register or memory address).
>
> Sometimes we accept the races, but in this context this might be bad.

As long a the writing of the flag is atomic in the sense that the value in memory isn't some hybrid value of two parallel stores, the existing race cannot result in a dead lock or
leaving the interrupt vector masked. Am I missing something ?

Thank you for taking the time to look at this patch.

Shay