linux-kernel - Re: [PATCH 1/1] RDMA/odp: convert to use HMM for ODP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f48ed64f-22fe-c366-6a0e-1433e72b9359@mellanox.com>
Date:   Wed, 6 Feb 2019 08:44:26 +0000
From:   Haggai Eran <haggaie@...lanox.com>
To:     "jglisse@...hat.com" <jglisse@...hat.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
CC:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
        Jason Gunthorpe <jgg@...lanox.com>,
        Leon Romanovsky <leonro@...lanox.com>,
        Doug Ledford <dledford@...hat.com>,
        Artemy Kovalyov <artemyko@...lanox.com>,
        Moni Shoua <monis@...lanox.com>,
        Mike Marciniszyn <mike.marciniszyn@...el.com>,
        Kaike Wan <kaike.wan@...el.com>,
        Dennis Dalessandro <dennis.dalessandro@...el.com>,
        Aviad Yehezkel <aviadye@...lanox.com>
Subject: Re: [PATCH 1/1] RDMA/odp: convert to use HMM for ODP

On 1/29/2019 6:58 PM, jglisse@...hat.com wrote:
 > Convert ODP to use HMM so that we can build on common infrastructure
 > for different class of devices that want to mirror a process address
 > space into a device. There is no functional changes.

Thanks for sending this patch. I think in general it is a good idea to 
use a common infrastructure for ODP.

I have a couple of questions below.

> -static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
> -				const struct mmu_notifier_range *range)
> -{
> -	struct ib_ucontext_per_mm *per_mm =
> -		container_of(mn, struct ib_ucontext_per_mm, mn);
> -
> -	if (unlikely(!per_mm->active))
> -		return;
> -
> -	rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
> -				      range->end,
> -				      invalidate_range_end_trampoline, true, NULL);
>   	up_read(&per_mm->umem_rwsem);
> +	return ret;
>   }
Previously the code held the umem_rwsem between range_start and 
range_end calls. I guess that was in order to guarantee that no device 
page faults take reference to the pages being invalidated while the 
invalidation is ongoing. I assume this is now handled by hmm instead, 
correct?

> +
> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
It seems that the mlx5_ib code in this patch currently ignores the 
ODP_DEVICE_BIT (e.g., in umem_dma_to_mtt). Is that okay? Or is it 
handled implicitly by the HMM_PFN_SPECIAL case?


> @@ -327,9 +287,10 @@ void put_per_mm(struct ib_umem_odp *umem_odp)
>  	up_write(&per_mm->umem_rwsem);
>  
>  	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> -	mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> +	hmm_mirror_unregister(&per_mm->mirror);
>  	put_pid(per_mm->tgid);
> -	mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> +
> +	kfree(per_mm);
>  }
Previously the per_mm struct was released through call srcu, but now it 
is released immediately. Is it safe? I saw that hmm_mirror_unregister 
calls mmu_notifier_unregister_no_release, so I don't understand what 
prevents concurrently running invalidations from accessing the released 
per_mm struct.

> @@ -578,11 +578,27 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
>  
>  next_mr:
>  	size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
> -
>  	page_shift = mr->umem->page_shift;
>  	page_mask = ~(BIT(page_shift) - 1);
> +	off = (io_virt & (~page_mask));
> +	size += (io_virt & (~page_mask));
> +	io_virt = io_virt & page_mask;
> +	off += (size & (~page_mask));
> +	size = ALIGN(size, 1UL << page_shift);
> +
> +	if (io_virt < ib_umem_start(&odp->umem))
> +		return -EINVAL;
> +
>  	start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
>  
> +	if (odp_mr->per_mm == NULL || odp_mr->per_mm->mm == NULL)
> +		return -ENOENT;
> +
> +	ret = hmm_range_register(&range, odp_mr->per_mm->mm,
> +				 io_virt, io_virt + size, page_shift);
> +	if (ret)
> +		return ret;
> +
>  	if (prefetch && !downgrade && !mr->umem->writable) {
>  		/* prefetch with write-access must
>  		 * be supported by the MR
Isn't there a mistake in the calculation of the variable size? Itis 
first set to the size of the page fault range, but then you add the 
virtual address, so I guess it is actually the range end. Then you pass 
io_virt + size to hmm_range_register. Doesn't it double the size of the 
range

> -void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
> -				 u64 bound)
> +void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp,
> +				 u64 virt, u64 bound)
>  {
> +	struct device *device = umem_odp->umem.context->device->dma_device;
>  	struct ib_umem *umem = &umem_odp->umem;
> -	int idx;
> -	u64 addr;
> -	struct ib_device *dev = umem->context->device;
> +	unsigned long idx, page_mask;
> +	struct hmm_range range;
> +	long ret;
> +
> +	if (!umem->npages)
> +		return;
> +
> +	bound = ALIGN(bound, 1UL << umem->page_shift);
> +	page_mask = ~(BIT(umem->page_shift) - 1);
> +	virt &= page_mask;
>  
>  	virt  = max_t(u64, virt,  ib_umem_start(umem));
>  	bound = min_t(u64, bound, ib_umem_end(umem));
> -	/* Note that during the run of this function, the
> -	 * notifiers_count of the MR is > 0, preventing any racing
> -	 * faults from completion. We might be racing with other
> -	 * invalidations, so we must make sure we free each page only
> -	 * once. */
> +
> +	idx = ((unsigned long)virt - ib_umem_start(umem)) >> PAGE_SHIFT;
> +
> +	range.page_shift = umem->page_shift;
> +	range.pfns = &umem_odp->pfns[idx];
> +	range.pfn_shift = ODP_FLAGS_BITS;
> +	range.values = odp_hmm_values;
> +	range.flags = odp_hmm_flags;
> +	range.start = virt;
> +	range.end = bound;
> +
>  	mutex_lock(&umem_odp->umem_mutex);
> -	for (addr = virt; addr < bound; addr += BIT(umem->page_shift)) {
> -		idx = (addr - ib_umem_start(umem)) >> umem->page_shift;
> -		if (umem_odp->page_list[idx]) {
> -			struct page *page = umem_odp->page_list[idx];
> -			dma_addr_t dma = umem_odp->dma_list[idx];
> -			dma_addr_t dma_addr = dma & ODP_DMA_ADDR_MASK;
> -
> -			WARN_ON(!dma_addr);
> -
> -			ib_dma_unmap_page(dev, dma_addr, PAGE_SIZE,
> -					  DMA_BIDIRECTIONAL);
> -			if (dma & ODP_WRITE_ALLOWED_BIT) {
> -				struct page *head_page = compound_head(page);
> -				/*
> -				 * set_page_dirty prefers being called with
> -				 * the page lock. However, MMU notifiers are
> -				 * called sometimes with and sometimes without
> -				 * the lock. We rely on the umem_mutex instead
> -				 * to prevent other mmu notifiers from
> -				 * continuing and allowing the page mapping to
> -				 * be removed.
> -				 */
> -				set_page_dirty(head_page);
> -			}
> -			/* on demand pinning support */
> -			if (!umem->context->invalidate_range)
> -				put_page(page);
> -			umem_odp->page_list[idx] = NULL;
> -			umem_odp->dma_list[idx] = 0;
> -			umem->npages--;
> -		}
> -	}
> +	ret = hmm_range_dma_unmap(&range, NULL, device,
> +		&umem_odp->dma_list[idx], true);
> +	if (ret > 0)
> +		umem->npages -= ret;
Can hmm_range_dma_unmap fail? If it does, we do we simply leak the DMA 
mappings?
>  	mutex_unlock(&umem_odp->umem_mutex);
>  }

Regards,
Haggai