[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20230726082515.709b87eb@hermes.local>
Date: Wed, 26 Jul 2023 08:25:15 -0700
From: Stephen Hemminger <stephen@...workplumber.org>
To: netdev@...r.kernel.org
Subject: Fw: [Bug 217712] New: AF-XDP program in
multi-process/multi-threaded configuration IO_PAGEFAULT
Begin forwarded message:
Date: Wed, 26 Jul 2023 13:00:28 +0000
From: bugzilla-daemon@...nel.org
To: stephen@...workplumber.org
Subject: [Bug 217712] New: AF-XDP program in multi-process/multi-threaded configuration IO_PAGEFAULT
https://bugzilla.kernel.org/show_bug.cgi?id=217712
Bug ID: 217712
Summary: AF-XDP program in multi-process/multi-threaded
configuration IO_PAGEFAULT
Product: Networking
Version: 2.5
Hardware: All
OS: Linux
Status: NEW
Severity: high
Priority: P3
Component: Other
Assignee: stephen@...workplumber.org
Reporter: joseph.reilly@....edu
Regression: No
Created attachment 304701
--> https://bugzilla.kernel.org/attachment.cgi?id=304701&action=edit
code to reproduce bug
Hello,
I am currently doing research on AF_XDP and I encountered a bug that is present
in multi-process and multi-threaded configurations of AF_XDP programs. I
believe there is a race condition that causes an IO_PAGEFAULT and the entire OS
to crash when it is encountered. This bug can be reproduced using Suricata
release 7.0.0-rc1, or another program where multiple user space processes each
with an AF_XDP socket are created.
I have attached some sample code that has should be able to reproduce the bug.
This code creates n processes where n is the number of RX queues specified by
the user. In my experience the higher the number of processes/RX queues used,
the higher the likelihood of triggering the crash.
To change the number of RX queues, use Ethtool to set the number of combined RX
queues, this may vary depending on network card:
sudo ethtool -L <interface> combined <number of RX queues>
Compile the code using make and run the code as such:
sudo -E ./xdp_main.o <interface> <number of child processes> consec
To get the crash to show up, lots of traffic needs to be sent to the network
interface. In our experimental setup, a machine using Pktgen is sending traffic
to the machine running the AF_XDP code at max line rate. Using Pktgen, vary the
IP/MAC addresses of each packet to make sure the packets are somewhat evenly
distributed across each RX queue. This may help with reproducing the bug. Also
be sure the interface is set to promiscuous mode.
While sending traffic at max line rate, send a SIGINT to the AF_XDP program
receiving the traffic to terminate the program. Sometimes an IO_PAGEFAULT will
occur. This is more common than not. Also attached are some screen shots of the
terminal and of the output our server gives.
The bug occurs because each process has the same STDIN file descriptor and as a
result each child process gets the same SIGINT signal at the same time causing
them all to terminate at once. During this, I believe a race condition is
reached where the AF_XDP program is still receiving packets and is trying to
write them to a UMEM that no longer exists. The order of operations to cause
this would be:
1. XDP program looks up AF_XDP socket in XSKS_MAP
2. User space program deletes UMEM and/or AF_XDP socket
3. XDP program tries to write packet to UMEM
This can also be reproduced with Suricata as stated earlier with a similar
traffic load as described for my personal program.
If more clarification is needed, please reach out to me. I would also like to
know if this is an intended design or the cause of this bug. I look forward to
hearing from you!
Best,
Joseph Reilly
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are the assignee for the bug.
Powered by blists - more mailing lists