linux-kernel - Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a3c15dff-c165-57c7-16f6-072e251a9368@linux.intel.com>
Date:   Fri, 23 Jun 2023 14:18:38 +0800
From:   Baolu Lu <baolu.lu@...ux.intel.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     baolu.lu@...ux.intel.com, Kevin Tian <kevin.tian@...el.com>,
        Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
        Robin Murphy <robin.murphy@....com>,
        Jean-Philippe Brucker <jean-philippe@...aro.org>,
        Nicolin Chen <nicolinc@...dia.com>,
        Yi Liu <yi.l.liu@...el.com>,
        Jacob Pan <jacob.jun.pan@...ux.intel.com>,
        iommu@...ts.linux.dev, linux-kselftest@...r.kernel.org,
        virtualization@...ts.linux-foundation.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCHES 00/17] IOMMUFD: Deliver IO page faults to user space

On 5/31/23 8:33 AM, Jason Gunthorpe wrote:
> On Tue, May 30, 2023 at 01:37:07PM +0800, Lu Baolu wrote:
>> Hi folks,
>>
>> This series implements the functionality of delivering IO page faults to
>> user space through the IOMMUFD framework. The use case is nested
>> translation, where modern IOMMU hardware supports two-stage translation
>> tables. The second-stage translation table is managed by the host VMM
>> while the first-stage translation table is owned by the user space.
>> Hence, any IO page fault that occurs on the first-stage page table
>> should be delivered to the user space and handled there. The user space
>> should respond the page fault handling result to the device top-down
>> through the IOMMUFD response uAPI.
>>
>> User space indicates its capablity of handling IO page faults by setting
>> a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
>> will then setup its infrastructure for page fault delivery. Together
>> with the iopf-capable flag, user space should also provide an eventfd
>> where it will listen on any down-top page fault messages.
>>
>> On a successful return of the allocation of iopf-capable HWPT, a fault
>> fd will be returned. User space can open and read fault messages from it
>> once the eventfd is signaled.
> This is a performance path so we really need to think about this more,
> polling on an eventfd and then reading a different fd is not a good
> design.
> 
> What I would like is to have a design from the start that fits into
> io_uring, so we can have pre-posted 'recvs' in io_uring that just get
> completed at high speed when PRIs come in.
> 
> This suggests that the PRI should be delivered via read() on a single
> FD and pollability on the single FD without any eventfd.

I will remove the eventfd and provide a single FD for both read() and
write(). The userspace reads the FD to retrieve the fault messages while
writing the FD to respond the handling of the faults. The user space
could leverage the io_uring for asynchronous I/O. A sample userspace
design could look like this:

[pseudo code for discussion only]

	struct io_uring ring;

	io_uring_setup(IOPF_ENTRIES, &ring);

	while (1) {
		struct io_uring_prep_read read;
		struct io_uring_cqe *cqe;

		read.fd = iopf_fd;
		read.buf = malloc(IOPF_SIZE);
		read.len = IOPF_SIZE;
		read.flags = 0;

		io_uring_prep_read(&ring, &read);
		io_uring_submit(&ring);

		// Wait for the read to complete
		while ((cqe = io_uring_get_cqe(&ring)) != NULL) {
			// Check if the read completed
			if (cqe->res < 0)
				break;

			if (page_fault_read_completion(cqe)) {
				// Get the fault data
				void *data = cqe->buf;
				size_t size = cqe->res;

				// Handle the page fault
				handle_page_fault(data);

				// Respond the fault
				struct io_uring_prep_write write;
				write.fd = iopf_fd;
				write.buf = malloc(IOPF_RESPONSE_SIZE);
				write.len = IOPF_RESPONSE_SIZE;
				write.flags = 0;

				io_uring_prep_write(&ring, &write);
             			io_uring_submit(&ring);
			}

			// Reap the cqe
			io_uring_cqe_free(&ring, cqe);
		}
	}

Did I understand you correctly?

Best regards,
baolu