netdev - Re: Realtek 8139 problem on 486.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK8P3a3vnnaYf6+v9N1WmH0N7uG55DrC=Hy71mYi4Kt+FXBRuw@mail.gmail.com>
Date:   Sun, 13 Jun 2021 00:41:37 +0200
From:   Arnd Bergmann <arnd@...nel.org>
To:     Nikolai Zhubr <zhubr.2@...il.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: Realtek 8139 problem on 486.

On Sat, Jun 12, 2021 at 7:40 PM Nikolai Zhubr <zhubr.2@...il.com> wrote:
> 09.06.2021 10:09, Arnd Bergmann:
> [...]
> > If it's only a bit slower, that is not surprising, I'd expect it to
> > use fewer CPU
> > cycles though, as it avoids the expensive polling.
> >
> > There are a couple of things you could do to make it faster without reducing
> > reliability, but I wouldn't recommend major surgery on this driver, I was just
> > going for the simplest change that would make it work right with broken
> > IRQ settings.
> >
> > You could play around a little with the order in which you process events:
> > doing RX first would help free up buffer space in the card earlier, possibly
> > alternating between TX and RX one buffer at a time, or processing both
> > in a loop until the budget runs out would also help.
>
> I've modified your patch so as to quickly test several approaches within
> a single file by just switching some conditional defines.
> My diff against 4.14 is here:
> https://pastebin.com/mgpLPciE
>
> The tests were performed using a simple shell script:
> https://pastebin.com/Vfr8JC3X
>
> Each cell in the resulting table shows:
> - tcp sender/receiver (Mbit/s) as reported by iperf3 (total)
> - udp sender/receiver (Mbit/s) as reported by iperf3 (total)
> - accumulated cpu utilization during tcp+upd test.
>
> The first line in the table essentially corresponds to a standard
> unmodified kernel. The second line corresponds to your initially
> proposed approach.
>
> All tests run with the same physical instance of 8139D card against the
> same server.
>
> (The table best viewed in monospace font)
> +-------------------+-------------+-----------+-----------+
> | #Defines          ; i486dx2/66  ; Pentium3/ ; PentiumE/ |
> |                   ; (Edge IRQ)  ;  1200     ; Dual 2600 |
> +-------------------+-------------+-----------+-----------+
> | TX_WORK_IN_IRQ 1  ;             ; tcp 86/86 ; tcp 94/94 |
> | TX_WORK_IN_POLL 0 ;  (fails)    ; udp 96/96 ; udp 96/96 |
> | LOOP_IN_IRQ 0     ;             ; cpu 59%   ; cpu 15%   |
> | LOOP_IN_POLL 0    ;             ;           ;           |
> +-------------------+-------------+-----------+-----------+
> | TX_WORK_IN_IRQ 0  ; tcp 9.4/9.1 ; tcp 88/88 ; tcp 95/94 |
> | TX_WORK_IN_POLL 1 ; udp 5.5/5.5 ; udp 96/96 ; udp 96/96 |
> | LOOP_IN_IRQ 0     ; cpu 98%     ; cpu 55%   ; cpu 19%   |
> | LOOP_IN_POLL 0    ;             ;           ;           |
> +-------------------+-------------+-----------+-----------+
> | TX_WORK_IN_IRQ 0  ; tcp 9.0/8.7 ; tcp 87/87 ; tcp 95/94 |
> | TX_WORK_IN_POLL 1 ; udp 5.8/5.8 ; udp 96/96 ; udp 96/96 |
> | LOOP_IN_IRQ 0     ; cpu 98%     ; cpu 58%   ; cpu 20%   |
> | LOOP_IN_POLL 1    ;             ;           ;           |
> +-------------------+-------------+-----------+-----------+
> | TX_WORK_IN_IRQ 1  ; tcp 7.3/7.3 ; tcp 87/86 ; tcp 94/94 |
> | TX_WORK_IN_POLL 0 ; udp 6.2/6.2 ; udp 96/96 ; udp 96/96 |
> | LOOP_IN_IRQ 1     ; cpu 99%     ; cpu 57%   ; cpu 17%   |
> | LOOP_IN_POLL 0    ;             ;           ;           |
> +-------------------+-------------+-----------+-----------+
> | TX_WORK_IN_IRQ 1  ; tcp 6.5/6.5 ; tcp 88/88 ; tcp 94/94 |
> | TX_WORK_IN_POLL 1 ; udp 6.1/6.1 ; udp 96/96 ; udp 96/96 |
> | LOOP_IN_IRQ 1     ; cpu 99%     ; cpu 55%   ; cpu 16%   |
> | LOOP_IN_POLL 1    ;             ;           ;           |
> +-------------------+-------------+-----------+-----------+
> | TX_WORK_IN_IRQ 1  ; tcp 5.7/5.7 ; tcp 87/87 ; tcp 95/94 |
> | TX_WORK_IN_POLL 1 ; udp 6.1/6.1 ; udp 96/96 ; udp 96/96 |
> | LOOP_IN_IRQ 1     ; cpu 98%     ; cpu 56%   ; cpu 15%   |
> | LOOP_IN_POLL 0    ;             ;           ;           |
> +-------------------+-------------+-----------+-----------+
>
> Hopefully this helps to choose the most benefical approach.

I think several variants can just be eliminated without looking
at the numbers:

- doing the TX work in the irq handler (with the loop) but not in
  the poll function is incorrect with the edge interupts, as it has
  the same race as before, you just make it much harder to hit

- doing the tx work in both the irq handler and the poll function
  is probably not helpful, you just do extra work

- calling the tx cleanup loop in a second loop is not helpful
  if you don't do anything interesting after finding that all
  TX frames are done.

For best performance I would suggest restructuring the poll
function from your current

  while (boguscnt--) {
       handle_rare_events();
       while (tx_pending())
             handle_one_tx();
  }
  while (rx_pending && work_done < budged)
         work_done += handle_one_rx();

to something like

   handle_rare_events();
   do {
      if (rx_pending())
          work_done += handle_one_rx();
      if (tx_pending())
          work_done += handle_one_tx();
   } while ((tx_pending || rx_pending) && work_done < budget)

This way, you can catch the most events in one poll function
if new work comes in while you are processing the pending
events.

Or, to keep the change simpler, keep the inner loop in the tx
and rx processing, doing all rx events before moving on
to processing all tx events, but then looping back to try both
again, until either the budget runs out or no further events
are pending.

      Arnd