lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a7bf2d5-9f5d-4618-b91a-a71a8917a79d@fujitsu.com>
Date: Mon, 5 Jan 2026 07:39:12 +0000
From: "Zhijian Li (Fujitsu)" <lizhijian@...itsu.com>
To: "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"zyjzyj2000@...il.com" <zyjzyj2000@...il.com>, "jgg@...pe.ca" <jgg@...pe.ca>,
	"leon@...nel.org" <leon@...nel.org>, Yi Zhang <yi.zhang@...hat.com>, Bob
 Pearson <rpearsonhpe@...il.com>
Subject: Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes !=
 PAGE_SIZE



On 05/01/2026 14:55, Zhijian Li (Fujitsu) wrote:
> +Bob
> 
> 
> On 26/12/2025 17:52, Li Zhijian wrote:
>> The current implementation incorrectly handles memory regions (MRs) with
>> page sizes different from the system PAGE_SIZE. The core issue is that
>> rxe_set_page() is called with mr->page_size step increments, but the
>> page_list stores individual struct page pointers, each representing
>> PAGE_SIZE of memory.
>>
>> Problem scenarios with concrete examples:
>>
>> 1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):
>>
>>      Suppose: PAGE_SIZE=4K, mr->page_size=64K, mr->ibmr.iova=0x13010
>>      When rxe_set_page() is called with dma_addr=0x10000 (64K-aligned),
>>      it stores only one 4K page at page_list[0], but should store 16 pages.
>>
>>      Accessing iova=0x13034:
>>      - Current code: index = (0x13034 >> 16) - (0x13010 >> 16) = 0
>>                     offset = 0x13034 & (64K-1) = 0x3034
>>      This calculates offset within 64K page, but page_list[0] is only 4K.
>>
>>      - Expected: The iova=0x13034 should map to:
>>                     base_align = 0x13010 & ~(64K-1) = 0x10000
>>                     index = (0x13034 - 0x10000) / 4K = 0x3034 / 0x1000 = 3
>>                     offset = 0x13034 & (4K-1) = 0x034
>>                     So: page_list[3] with offset 0x34
>>
>> 2. mr->page_size < PAGE_SIZE (e.g., 2K MR with 4K system pages):
>>
>>      Suppose: PAGE_SIZE=4K, mr->page_size=2K, mr->ibmr.iova=0x1100
>>      When rxe_set_page() is called with dma_addr=0x1000, it stores a 4K page
>>      at page_list[0]. Another call with dma_addr=0x1800 should not store
>>      a new page since 0x1800 is within the same 4K page as 0x1000.
>>
>>      Accessing iova=0x1890:
>>      - Current code: index = (0x1890 >> 11) - (0x1100 >> 11) = 3
>>                     offset = 0x1890 & (2K-1) = 0x90
>>                     This assumes page_list[3] exists and is a 2K page.
>>
>>      - Expected: Both 0x1000 and 0x1800 are within the same 4K page:
>>                     index = (0x1890 >> 12) - (0x1100 >> 12) = 0
>>                     offset = 0x1890 & (4K-1) = 0x890
>>                     So: page_list[0] with offset 0x890
>>
>> Yi Zhang reported a kernel panic[1] years ago related to this defect.
>>
>> The fix introduces:
>>
>> 1. Enhanced iova-to-index conversion with proper alignment handling:
>>      - For mr->page_size > PAGE_SIZE: Uses aligned base address
>>        (mr->ibmr.iova & mr->page_mask) as reference, correctly calculating
>>        which PAGE_SIZE sub-page contains the iova
>>      - For mr->page_size <= PAGE_SIZE: Uses PAGE_SIZE shift arithmetic
>>        to ensure each page_list entry corresponds to a PAGE_SIZE page
>>
>> 2. Page splitting in rxe_set_page():
>>      - When mr->page_size > PAGE_SIZE, set_pages_per_mr() splits each
>>        MR page into multiple PAGE_SIZE pages stored consecutively
>>      - For mr->page_size <= PAGE_SIZE, maintains original behavior
>>
>> 3. Always use PAGE_SIZE for offset calculation:
>>      - Since page_list stores PAGE_SIZE pages, offsets must be within
>>        [0, PAGE_SIZE-1]
>>
>> 4. Compatibility checks:
>>      - Ensures mr->page_size and PAGE_SIZE have a compatible relationship
>>        (one is multiple of the other) for correct mapping
>>
>> The solution ensures correct iova-to-va conversion for all MR page sizes
>> while maintaining the existing IB verbs semantics for MR registration
>> and memory access.
>>
>> This patch enables srp rnbd nvme in 64K system page environment.
>>
>> Tests on (4K and 64K PAGE_SIZE):
>> - rdma-core/pytests
>>     $ ./build/bin/run_tests.py  --dev eth0_rxe
>> - blktest:
>>     $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd
>>
>> In 64K environment, srp/012 is failed while it's passed in 4K environment.
> 
> 
>>    static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>> +	if (mr_page_size(mr) > PAGE_SIZE)
>> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
>> +	else
>> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>>    }
>>    
>> +/*
>> + * Always return offset within a PAGE_SIZE page since page_list
>> + * stores individual struct page pointers, each representing
>> + * PAGE_SIZE of memory.
>> + */
>>    static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return iova & (mr_page_size(mr) - 1);
>> +	return iova & (PAGE_SIZE - 1);
>>    }
> 
> 
> After digging into the behavior during the srp/012 test again, it turns out this fix is incomplete.
> The current xarray page_list approach cannot correctly map memory regions composed of two or more scatter-gather segments.
> 
> 
> Consider the following scenarios:
> (I'm assuming a 1:1 mapping from dma_addr to va for simplicity in the examples)
> 
> =========================
> I) page_size < PAGE_SIZE
> Assuming dam_addr to va mapping is 1:1
> MR page_size = 4K
> PAGE_SIZE = 64K
> ibmr->iova = 0x181000;
> 
> With this patch:
> sg_dma_address(sg[0]): 0x181000(va: 0x181000), dma_len: 0x1000, save to page_list[0]=page(0x181000)
> sg_dma_address(sg[1]): 0x173000(va: 0x173000, dma_len: 0x1000, save to page_list[1]=page(0x171000)
> 
> Now accessing iova: 0x182000, expectation: va: 0x173000
> In current patch,
>> (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> index: (0x182000 >> 16) - (0x181000 >> 16)=0x18-0x18=0
> 
>> iova & (PAGE_SIZE - 1)
> offset: 0x182000 & (PAGE_SIZE -1) = 0x2000
> 
> va: va(page_list[0]) + 0x2000 = 0x180000 + 0x2000 = 0x182000(incorrect)

mr->page_shift should be 12, so

re-edit:

> (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
index: (0x182000 >> 12) - (0x181000 >> 12)=0x182-0x181=1

> iova & (PAGE_SIZE - 1)
offset: 0x182000 & (PAGE_SIZE -1) = 0x2000
  
va: va(page_list[1]) + 0x2000 = 0x170000 + 0x2000 = 0x172000(incorrect)

Thanks
Zhijian


> 
> (srp/012 hit this case).
> 
> 
> II) page_size > PAGE_SIZE
> MR page_size = 64K
> PAGE_SIZE = 4K
> ibmr->iova = 0x181000;
> 
> sg_dma_address(sg[0]): 0x181000(va: 0x181000), dma_len: 0x1000, save to page_list[0-15], page_list[0]=page(0x180000) page_list[1]=page(0x181000)...
> sg_dma_address(sg[1]): 0x173000(va: 0x173000, dma_len: 0x1000, save to page_list[16-32], page_list[16]=page(0x170000) page_list[17]=page(0x171000)...
> 
> Now accessing iova: 0x182000, expectation: va: 0x173000
>> (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
> index: (0x182000 - 0x180000)/4K = 2
> 
>> iova & (PAGE_SIZE - 1);
> offset: 0x182000 & (PAGE_SIZE -1) = 0
> 
> va: va(page_list[2]) + 0 = 0x182000(incorrect)
> 
> =========================
> 
> Although the case I) can be solved by introducing another xarray to save extra base_offset for each page_list
> while the case II) I have not figured out a *good* solution without adding more parameters to rxe_set_page()
> to tell the dma_len...
> 
> Before I go further down this path, I'd like to raise a concern: is storing struct page pointers directly in
> the page_list still the right approach for this problem?
> 
> Perhaps we need to rethink how we store the mapping information entirely.
> 
> Any thoughts or suggestions would be greatly appreciated.
> 
> Thanks
> Zhijian
> 
> 
>>
>> [1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
>> ixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
>> Signed-off-by: Li Zhijian <lizhijian@...itsu.com>
>> ---
>>    drivers/infiniband/sw/rxe/rxe_mr.c | 83 +++++++++++++++++++++++++++---
>>    1 file changed, 76 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>> index b28b56db725a..8ad7d163b418 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>> @@ -72,14 +72,40 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>>    	mr->ibmr.type = IB_MR_TYPE_DMA;
>>    }
>>    
>> +/*
>> + * Convert iova to page_list index. The page_list stores pages of size
>> + * PAGE_SIZE, but MRs can have different page sizes. This function
>> + * handles the conversion for all cases:
>> + *
>> + * 1. mr->page_size > PAGE_SIZE:
>> + *    The MR's iova may not be aligned to mr->page_size. We use the
>> + *    aligned base (iova & page_mask) as reference, then calculate
>> + *    which PAGE_SIZE sub-page the iova falls into.
>> + *
>> + * 2. mr->page_size <= PAGE_SIZE:
>> + *    Use simple shift arithmetic since each page_list entry corresponds
>> + *    to one or more MR pages.
>> + *
>> + * Example for mr->page_size=64K, PAGE_SIZE=4K, iova=0x11034:
>> + *   base = iova & ~(64K-1) = 0x10000
>> + *   index = (0x11034 - 0x10000) / 4K = 0x1034 / 0x1000 = 4
>> + */
>>    static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>> +	if (mr_page_size(mr) > PAGE_SIZE)
>> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
>> +	else
>> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>>    }
>>    
>> +/*
>> + * Always return offset within a PAGE_SIZE page since page_list
>> + * stores individual struct page pointers, each representing
>> + * PAGE_SIZE of memory.
>> + */
>>    static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return iova & (mr_page_size(mr) - 1);
>> +	return iova & (PAGE_SIZE - 1);
>>    }
>>    
>>    static bool is_pmem_page(struct page *pg)
>> @@ -205,11 +231,40 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
>>    	return err;
>>    }
>>    
>> +/*
>> + * Split a large MR page (mr->page_size) into multiple PAGE_SIZE
>> + * sub-pages and store them in page_list.
>> + *
>> + * Called when mr->page_size > PAGE_SIZE. Each call to rxe_set_page()
>> + * represents one mr->page_size region, which we must split into
>> + * (mr->page_size / PAGE_SIZE) individual pages.
>> + */
>> +static int set_pages_per_mr(struct ib_mr *ibmr, u64 dma_addr)
>> +{
>> +	struct rxe_mr *mr = to_rmr(ibmr);
>> +	u32 page_size = mr_page_size(mr);
>> +	u64 addr = dma_addr & ~(u64)(page_size - 1);
>> +	u32 i, pages_per_mr = page_size / PAGE_SIZE;
>> +
>> +	for (i = 0; i < pages_per_mr; i++) {
>> +		struct page *sub_page =
>> +			ib_virt_dma_to_page(addr + i * PAGE_SIZE);
>> +		int err = xa_err(xa_store(&mr->page_list, mr->nbuf, sub_page,
>> +					  GFP_KERNEL));
>> +		if (err)
>> +			return err;
>> +
>> +		mr->nbuf++;
>> +	}
>> +	return 0;
>> +}
>> +
>>    static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>>    {
>>    	struct rxe_mr *mr = to_rmr(ibmr);
>>    	struct page *page = ib_virt_dma_to_page(dma_addr);
>>    	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
>> +	u32 page_size = mr_page_size(mr);
>>    	int err;
>>    
>>    	if (persistent && !is_pmem_page(page)) {
>> @@ -217,9 +272,13 @@ static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>>    		return -EINVAL;
>>    	}
>>    
>> -	if (unlikely(mr->nbuf == mr->num_buf))
>> +	if (unlikely(mr->nbuf >= mr->num_buf))
>>    		return -ENOMEM;
>>    
>> +	if (page_size > PAGE_SIZE)
>> +		return set_pages_per_mr(ibmr, dma_addr);
>> +
>> +	/* page_size <= PAGE_SIZE */
>>    	err = xa_err(xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL));
>>    	if (err)
>>    		return err;
>> @@ -234,6 +293,18 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
>>    	struct rxe_mr *mr = to_rmr(ibmr);
>>    	unsigned int page_size = mr_page_size(mr);
>>    
>> +	/*
>> +	 * Ensure page_size and PAGE_SIZE are compatible for mapping.
>> +	 * We require one to be a multiple of the other for correct
>> +	 * iova-to-page conversion.
>> +	 */
>> +	if (!IS_ALIGNED(page_size, PAGE_SIZE) &&
>> +	    !IS_ALIGNED(PAGE_SIZE, page_size)) {
>> +		rxe_err_mr(mr, "MR page size %u must be compatible with PAGE_SIZE %lu\n",
>> +			   page_size, PAGE_SIZE);
>> +		return -EINVAL;
>> +	}
>> +
>>    	mr->nbuf = 0;
>>    	mr->page_shift = ilog2(page_size);
>>    	mr->page_mask = ~((u64)page_size - 1);
>> @@ -255,8 +326,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>>    		if (!page)
>>    			return -EFAULT;
>>    
>> -		bytes = min_t(unsigned int, length,
>> -				mr_page_size(mr) - page_offset);
>> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>>    		va = kmap_local_page(page);
>>    		if (dir == RXE_FROM_MR_OBJ)
>>    			memcpy(addr, va + page_offset, bytes);
>> @@ -442,8 +512,7 @@ static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int leng
>>    		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>>    		if (!page)
>>    			return -EFAULT;
>> -		bytes = min_t(unsigned int, length,
>> -			      mr_page_size(mr) - page_offset);
>> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>>    
>>    		va = kmap_local_page(page);
>>    		arch_wb_cache_pmem(va + page_offset, bytes);

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ