netdev - Fw: [Bug 217712] New: AF-XDP program in multi-process/multi-threaded configuration IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20230726082515.709b87eb@hermes.local>
Date: Wed, 26 Jul 2023 08:25:15 -0700
From: Stephen Hemminger <stephen@...workplumber.org>
To: netdev@...r.kernel.org
Subject: Fw: [Bug 217712] New: AF-XDP program in
 multi-process/multi-threaded configuration IO_PAGEFAULT

Begin forwarded message:

Date: Wed, 26 Jul 2023 13:00:28 +0000
From: bugzilla-daemon@...nel.org
To: stephen@...workplumber.org
Subject: [Bug 217712] New: AF-XDP program in multi-process/multi-threaded configuration IO_PAGEFAULT

https://bugzilla.kernel.org/show_bug.cgi?id=217712

            Bug ID: 217712
           Summary: AF-XDP program in multi-process/multi-threaded
                    configuration IO_PAGEFAULT
           Product: Networking
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: high
          Priority: P3
         Component: Other
          Assignee: stephen@...workplumber.org
          Reporter: joseph.reilly@....edu
        Regression: No

Created attachment 304701
  --> https://bugzilla.kernel.org/attachment.cgi?id=304701&action=edit  
code to reproduce bug

Hello,

I am currently doing research on AF_XDP and I encountered a bug that is present
in multi-process and multi-threaded configurations of AF_XDP programs. I
believe there is a race condition that causes an IO_PAGEFAULT and the entire OS
to crash when it is encountered. This bug can be reproduced using Suricata
release 7.0.0-rc1, or another program where multiple user space processes each
with an AF_XDP socket are created.

I have attached some sample code that has should be able to reproduce the bug.
This code creates n processes where n is the number of RX queues specified by
the user. In my experience the higher the number of processes/RX queues used,
the higher the likelihood of triggering the crash. 

To change the number of RX queues, use Ethtool to set the number of combined RX
queues, this may vary depending on network card:
sudo ethtool -L <interface> combined <number of RX queues>

Compile the code using make and run the code as such:
sudo -E ./xdp_main.o <interface> <number of child processes> consec

To get the crash to show up, lots of traffic needs to be sent to the network
interface. In our experimental setup, a machine using Pktgen is sending traffic
to the machine running the AF_XDP code at max line rate. Using Pktgen, vary the
IP/MAC addresses of each packet to make sure the packets are somewhat evenly
distributed across each RX queue. This may help with reproducing the bug. Also
be sure the interface is set to promiscuous mode.

While sending traffic at max line rate, send a SIGINT to the AF_XDP program
receiving the traffic to terminate the program. Sometimes an IO_PAGEFAULT will
occur. This is more common than not. Also attached are some screen shots of the
terminal and of the output our server gives.

The bug occurs because each process has the same STDIN file descriptor and as a
result each child process gets the same SIGINT signal at the same time causing
them all to terminate at once. During this, I believe a race condition is
reached where the AF_XDP program is still receiving packets and is trying to
write them to a UMEM that no longer exists. The order of operations to cause
this would be:
1. XDP program looks up AF_XDP socket in XSKS_MAP 
2. User space program deletes UMEM and/or AF_XDP socket 
3. XDP program tries to write packet to UMEM

This can also be reproduced with Suricata as stated earlier with a similar
traffic load as described for my personal program.

If more clarification is needed, please reach out to me. I would also like to
know if this is an intended design or the cause of this bug. I look forward to
hearing from you!

Best,
Joseph Reilly

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are the assignee for the bug.