linux-kernel - Re: [Question] Missing data after DMA read transfer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.1605022020560.5004@eggly.anvils>
Date:	Mon, 2 May 2016 21:04:02 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	Nicolas Morey Chaisemartin <devel@...ey-chaisemartin.com>
cc:	Mel Gorman <mgorman@...hsingularity.net>,
	Andrea Arcangeli <aarcange@...hat.com>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Jerome Glisse <j.glisse@...il.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [Question] Missing data after DMA read transfer - mm issue with
 transparent huge page?

On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:

> Hi everyone,
> 
> This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..

linux-kernel is a very high volume list which few are reading:
that also will account for your lack of response so far
(apart from the indefatigable Alan).

I've added linux-mm, and some people from another thread regarding
THP and get_user_pages() pins which has been discussed in recent days.

Make no mistake, the issue you're raising here is definitely not the
same as that one (which is specifically about the new THP refcounting
in v4.5+, whereas you're reporting a problem you've seen in both a
v3.10-based kernel and in v4.5).  But I think their heads are in
gear, much more so than mine, and likely to spot something.

> I added more info found while blindly debugging the issue.
> 
> Short version:
> I'm having an issue with direct DMA transfer from a device to host memory.
> It seems some of the data is not transferring to the appropriate page.
> 
> Some more details:
> I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
> 
> In the current case, a userland application transfers back and forth data through read/write operations on a file.
> On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
> 
> We followed what pretty much all docs said about direct I/O to user buffers:
> 
> 1) get_user_pages() (in the current case, it's at most 16 pages at once)
> 2) convert to a scatterlist
> 3) pci_map_sg
> 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
> 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
> 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
> 6) pci_unmap_sg
> 7) for read (card2host) transfer, set_page_dirty_lock
> 8) page_cache_release
> 
> In 99,9999% it works perfectly.
> However, I have one userland application where a few pages are not written by a read (card2host) transfer.
> The buffer is memset them to a different value so I can check that nothing has overwritten them.
> 
> I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
> I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
>         uint32_t *addr = page_address(trans->pages[0]);
>         dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
> and it has the expected value.
> But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
> 
> New infos:
> 
> The issue happens with IOMMU on or off.
> I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
> 
> I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
>  * we are using transparent huge pages
>  * the page 'not transferred' are the last few of a huge page
> More precisely:
> - We have several transfer in flight from the same user buffer
> - Each transfer is 16 pages long
> - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
> - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
>   they are all to 0. The pages are still mapped to dma at this point.
> - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
> But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
> 
> I tried the same code with a kernel 4.5 and encountered the same issue
> 
> Disabling transparent huge pages makes the issue disapear
> 
> Thanks in advance

It does look to me as if pages are being migrated, despite being pinned
by get_user_pages(): and that would be wrong.  Originally I intended
to suggest that THP is probably merely the cause of compaction, with
compaction causing the page migration.  But you posted very interesting
details in an earlier mail on 27th April from <nmorey@...ray.eu>:

> I ran some more tests:
> 
> * Test is OK if transparent huge tlb are disabled
> 
> * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
> [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
> [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping:          (null) index:0x2
> [436477.927314] page flags: 0x2fffff00008000(tail)
> [436477.927354] page dumped because: org_page
> [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping:          (null) index:0x2
> [436477.927382] page flags: 0x2fffff00008000(tail)
> [436477.927421] page dumped because: cur_page
> 
> I'm not sure what to make of this...

That (on the older kernel I think) seems clearly to show that a THP
itself has been migrated: which makes me suspect NUMA migration of
mispaced THPs - migrate_misplaced_transhuge_page().  I'd hoped to
find something obviously wrong there, but haven't quite managed
to bring my brain fully to bear on it, and hope the others Cc'ed
will do so more quickly (or spot the error of your ways instead).

I do find it suspect, how the migrate_page_copy() is done rather
early, while the old page is still mapped in the pagetable.  And
odd how it inserts the new pmd for a moment, before checking old
page_count and backing out.  But I don't see how either of those
would cause the trouble you see, where the migration goes ahead.

But I may be mistaken to suspect migration at all: perhaps this is
about Copy-On-Write: there's no concurrent fork()ing, is there?

And I think your driver is using get_user_pages() (under mmap_sem),
not short-cutting with the trickier get_user_pages_fast().

Over to more clued-in Cc's.

Hugh