linux-kernel - Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 10 Jun 2015 18:15:08 -0700
From:	Mark Hairgrove <mhairgrove@...dia.com>
To:	Jerome Glisse <j.glisse@...il.com>
CC:	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"joro@...tes.org" <joro@...tes.org>, Mel Gorman <mgorman@...e.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Johannes Weiner <jweiner@...hat.com>,
	Larry Woodman <lwoodman@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Dave Airlie <airlied@...hat.com>,
	Brendan Conoboy <blc@...hat.com>,
	Joe Donohue <jdonohue@...hat.com>,
	Duncan Poole <dpoole@...dia.com>,
	Sherry Cheung <SCheung@...dia.com>,
	Subhash Gutti <sgutti@...dia.com>,
	John Hubbard <jhubbard@...dia.com>,
	Lucien Dunning <ldunning@...dia.com>,
	Cameron Buschardt <cabuschardt@...dia.com>,
	Arvind Gopalakrishnan <arvindg@...dia.com>,
	Haggai Eran <haggaie@...lanox.com>,
	Shachar Raindel <raindel@...lanox.com>,
	Liran Liss <liranl@...lanox.com>,
	Roland Dreier <roland@...estorage.com>,
	Ben Sander <ben.sander@....com>,
	Greg Stoner <Greg.Stoner@....com>,
	John Bridgman <John.Bridgman@....com>,
	Michael Mantor <Michael.Mantor@....com>,
	Paul Blinzer <Paul.Blinzer@....com>,
	Laurent Morichetti <Laurent.Morichetti@....com>,
	Alexander Deucher <Alexander.Deucher@....com>,
	Oded Gabbay <Oded.Gabbay@....com>,
	Jérôme Glisse <jglisse@...hat.com>,
	Jatin Kumar <jakumar@...dia.com>,
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>
Subject: Re: [PATCH 05/36] HMM: introduce heterogeneous memory management
 v3.



On Wed, 10 Jun 2015, Jerome Glisse wrote:

> [...]
> 
> Like said, just ignore current code it is utterly broken in so many way
> when it comes to lifetime. I screw that part badly when reworking the
> patchset, i was focusing on other part.
> 
> I fixed that in my tree, i am waiting for more review on other part as
> anyway the lifetime thing is easy to rework/fix.
> 
> http://cgit.freedesktop.org/~glisse/linux/log/?h=hmm
> 

Ok, I'm working through the other patches so I'll check the updates out 
once I've made it through. My primary interest in this discussion is 
making sure we know the plan for mirror and device lifetimes.


> > So to confirm, on all file operations from user space the driver is 
> > expected to check that current->mm matches the mm associated with the 
> > struct file's hmm_mirror?
> 
> Well you might have a valid usecase for that, just be aware that
> anything your driver do with the hmm_mirror will actually impact
> the mm of the parent. Which i assume is not what you want.
> 
> I would actualy thought that what you want is having a way to find
> hmm_mirror using both device file & mm as a key. Otherwise you can
> not really use HMM with process that like to fork themself. Which
> is a valid usecase to me. For instance process start using HMM
> through your driver, decide to fork itself and to also use HMM
> through your driver inside its child.

Agreed, that sounds reasonable, and the use case is valid. I was digging 
into this to make sure we don't prevent that.


> > 
> > On file->release the driver still ought to call hmm_mirror_unregister 
> > regardless of whether the mms match, otherwise we'll never tear down the 
> > mirror. That means we're not saved from the race condition because 
> > hmm_mirror_unregister can happen in one thread while hmm_notifier_release 
> > might be happening in another thread.
> 
> Again there is no race the mirror list is the synchronization point and
> it is protected by a lock. So either hmm_mirror_unregister() wins or the
> other thread hmm_notifier_release()

Yes, I agree. That's not the race I'm worried about. I'm worried about a 
race on the device lifetime, but in order to hit that one first 
hmm_notifier_release must take the lock and remove the mirror from the 
list before hmm_mirror_unregister does it. That's why I brought it up.


> 
> You unregister as soon as you want, it is up to your driver to do it,
> i do not enforce anything. The only thing i enforce is that you can
> not unregister the hmm device driver before all mirror are unregistered
> and free.
> 
> So yes for device driver you want to unregister when device file is
> close (which happens when the process exit).

Sounds good.


> 
> There is no race here, the mirror struct will only be freed once as again
> the list is a synchronization point. Whoever remove the mirror from the
> list is responsible to drop the list reference.
> 
> In the fixed code the only thing that will happen twice is the ->release()
> callback. Even that can be work around to garanty it is call only once.
> 
> Anyway i do not see anyrace here.
> 

The mirror lifetime is fine. The problem I see is with the device lifetime 
on a multi-core system. Imagine this sequence:

- On CPU1 the mm associated with the mirror is going down
- On CPU2 the driver unregisters the mirror then the device

When this happens, the last device mutex_unlock on CPU1 is the only thing 
preventing the free of the device in CPU2. That doesn't work, as described 
in this thread: https://lkml.org/lkml/2013/12/2/997

Here's the full sequence again with mutex_unlock split apart. Hopefully 
this shows the device_unregister problem more clearly:

CPU1 (mm release)                   CPU2 (driver)
----------------------              ----------------------
hmm_notifier_release
  down_write(&hmm->rwsem);
  hlist_del_init(&mirror->mlist);
  up_write(&hmm->rwsem);

  // CPU1 thread is preempted or 
  // something
                                    hmm_mirror_unregister
                                      hmm_mirror_kill
                                        down_write(&hmm->rwsem);
                                        // mirror removed by CPU1 already
                                        // so hlist_unhashed returns 1
                                        up_write(&hmm->rwsem);

                                      hmm_mirror_unref(&mirror);
                                      // Mirror ref now 1

                                      // CPU2 thread is preempted or
                                      // something
// CPU1 thread is scheduled

hmm_mirror_unref(&mirror);
  // Mirror ref now 0, cleanup
  hmm_mirror_destroy(&mirror)
    mutex_lock(&device->mutex);
    list_del_init(&mirror->dlist);
    device->ops->release(mirror);
      kfree(mirror);
                                      // CPU2 thread is scheduled, now
                                      // both CPU1 and CPU2 are running

                                    hmm_device_unregister
                                      mutex_lock(&device->mutex);
                                        mutex_optimistic_spin()
    mutex_unlock(&device->mutex);
      [...]
      __mutex_unlock_common_slowpath
        // CPU2 releases lock
        atomic_set(&lock->count, 1);
                                          // Spinning CPU2 acquires now-
                                          // free lock
                                      // mutex_lock returns
                                      // Device list empty
                                      mutex_unlock(&device->mutex);
                                      return 0;
                                    kfree(hmm_device);
        // CPU1 still accessing 
        // hmm_device->mutex in 
        //__mutex_unlock_common_slowpath
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/