[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200214184514.GA165785@google.com>
Date: Fri, 14 Feb 2020 10:45:14 -0800
From: Minchan Kim <minchan@...nel.org>
To: Jens Axboe <axboe@...nel.dk>
Cc: Jann Horn <jannh@...gle.com>, io-uring <io-uring@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>,
Linux API <linux-api@...r.kernel.org>,
Oleksandr Natalenko <oleksandr@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Tim Murray <timmurray@...gle.com>,
Daniel Colascione <dancol@...gle.com>,
Sandeep Patil <sspatil@...gle.com>,
Sonny Rao <sonnyrao@...gle.com>,
Brian Geffon <bgeffon@...gle.com>,
Michal Hocko <mhocko@...e.com>,
Johannes Weiner <hannes@...xchg.org>,
Shakeel Butt <shakeelb@...gle.com>,
John Dias <joaodias@...gle.com>,
Joel Fernandes <joel@...lfernandes.org>, sj38.park@...il.com,
Alexander Duyck <alexander.h.duyck@...ux.intel.com>
Subject: Re: [PATCH v5 1/7] mm: pass task and mm to do_madvise
On Fri, Feb 14, 2020 at 11:22:08AM -0700, Jens Axboe wrote:
> On 2/14/20 10:25 AM, Jann Horn wrote:
> > +Jens and io-uring list
> >
> > On Fri, Feb 14, 2020 at 6:06 PM Minchan Kim <minchan@...nel.org> wrote:
> >> In upcoming patches, do_madvise will be called from external process
> >> context so we shouldn't asssume "current" is always hinted process's
> >> task_struct.
> > [...]
> >> [1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com
> > [...]
> >> diff --git a/fs/io_uring.c b/fs/io_uring.c
> > [...]
> >> @@ -2736,7 +2736,7 @@ static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt,
> >> if (force_nonblock)
> >> return -EAGAIN;
> >>
> >> - ret = do_madvise(ma->addr, ma->len, ma->advice);
> >> + ret = do_madvise(current, current->mm, ma->addr, ma->len, ma->advice);
> >> if (ret < 0)
> >> req_set_fail_links(req);
> >> io_cqring_add_event(req, ret);
> >
> > Jens, can you have a look at this change and the following patch
> > <https://lore.kernel.org/linux-mm/20200214170520.160271-4-minchan@kernel.org/>
> > ("[PATCH v5 3/7] mm: check fatal signal pending of target process")?
> > Basically Minchan's patch tries to plumb through the identity of the
> > target task so that if that task gets killed in the middle of the
> > operation, the (potentially long-running and costly) madvise operation
> > can be cancelled. Just passing in "current" instead (which in this
> > case is the uring worker thread AFAIK) doesn't really break anything,
> > other than making the optimization not work, but I wonder whether this
> > couldn't be done more cleanly - maybe by passing in NULL to mean "we
> > don't know who the target task is", since I think we don't know that
> > here?
>
> Thanks for bringing this to my attention, patches that touch io_uring
> (or anything else) really should be CC'ed to the maintainer(s) of those
> areas...
Hi Jens, it was my mistake. Sorry for that.
>
> Yeah, the change above won't do the right thing for io_uring, in fact
> it'll always be the wrong task. So I'd second Jann's question, and ask
> if we really need the actual task, or if NULL could be used? For
> cancelation purposes, I'm guessing you want the task that's actually
> doing the operation, even if it's on behalf of someone else. That makes
> the interface a bit weird, as you'd assume the task/mm passed in would
> be related to the madvise itself, not just for cancelation.
>
> Would be nice with some clarification, so we can figure out an approach
> that would actually work.
MADV_(COLD|PAGEOUT) checks both caller and callee and the part aims for
callee(ie, target task). Thus, we could pass NULL for io_madvise if
it couldn't know who is target and let's have NULL check before the
fatal_signal_pending. I will put following checks in [3/7].
if (private->target_Task &&
fatal_signal_pending(private->target_task))
return -EINTR;
>From d008a5a1049b03b3e0eeef7121faead2b6555f49 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@...nel.org>
Date: Fri, 14 Feb 2020 07:29:58 -0800
Subject: [PATCH] mm: pass task and mm to do_madvise
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct. Furthermore, we couldn't access mm_struct via task->mm
once it's verified by access_mm which will be introduced in next
patch[1]. And let's pass *current* and current->mm as arguments of
do_madvise so it shouldn't change existing behavior but prepare
next patch to make review easy.
Note: io_madvise pass NULL as target_tas argument of do_madvise
because it couldn't know who is target.
[1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com
Cc: Jens Axboe <axboe@...nel.dk>
Cc: Jann Horn <jannh@...gle.com>
Signed-off-by: Minchan Kim <minchan@...nel.org>
---
fs/io_uring.c | 2 +-
include/linux/mm.h | 3 ++-
mm/madvise.c | 34 +++++++++++++++++++---------------
3 files changed, 22 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 63beda9bafc5..1c7e9cd6c8ce 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2736,7 +2736,7 @@ static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt,
if (force_nonblock)
return -EAGAIN;
- ret = do_madvise(ma->addr, ma->len, ma->advice);
+ ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice);
if (ret < 0)
req_set_fail_links(req);
io_cqring_add_event(req, ret);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52269e56c514..beb9259f9ed1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2323,7 +2323,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf, bool downgrade);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf);
-extern int do_madvise(unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct task_struct *task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior);
static inline unsigned long
do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/mm/madvise.c b/mm/madvise.c
index 43b47d3fae02..f75c86b6c463 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -254,6 +254,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
+ struct mm_struct *mm = vma->vm_mm;
struct file *file = vma->vm_file;
loff_t offset;
@@ -288,12 +289,12 @@ static long madvise_willneed(struct vm_area_struct *vma,
*/
*prev = NULL; /* tell sys_madvise we drop mmap_sem */
get_file(file);
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
fput(file);
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
return 0;
}
@@ -676,7 +677,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (nr_swap) {
if (current->mm == mm)
sync_mm_rss(mm);
-
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
}
arch_leave_lazy_mmu_mode();
@@ -756,6 +756,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
int behavior)
{
+ struct mm_struct *mm = vma->vm_mm;
+
*prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -763,8 +765,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_sem has been dropped, prev is stale */
- down_read(¤t->mm->mmap_sem);
- vma = find_vma(current->mm, start);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -818,6 +820,7 @@ static long madvise_remove(struct vm_area_struct *vma,
loff_t offset;
int error;
struct file *f;
+ struct mm_struct *mm = vma->vm_mm;
*prev = NULL; /* tell sys_madvise we drop mmap_sem */
@@ -845,13 +848,13 @@ static long madvise_remove(struct vm_area_struct *vma,
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_sem was not released by userfaultfd_remove() */
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
return error;
}
@@ -1044,7 +1047,8 @@ madvise_behavior_valid(int behavior)
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
-int do_madvise(unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct task_struct *target_task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1082,10 +1086,10 @@ int do_madvise(unsigned long start, size_t len_in, int behavior)
write = madvise_need_mmap_write(behavior);
if (write) {
- if (down_write_killable(¤t->mm->mmap_sem))
+ if (down_write_killable(&mm->mmap_sem))
return -EINTR;
} else {
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
}
/*
@@ -1093,7 +1097,7 @@ int do_madvise(unsigned long start, size_t len_in, int behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(current->mm, start, &prev);
+ vma = find_vma_prev(mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;
@@ -1130,19 +1134,19 @@ int do_madvise(unsigned long start, size_t len_in, int behavior)
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_sem */
- vma = find_vma(current->mm, start);
+ vma = find_vma(mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
- up_write(¤t->mm->mmap_sem);
+ up_write(&mm->mmap_sem);
else
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
return error;
}
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
- return do_madvise(start, len_in, behavior);
+ return do_madvise(current, current->mm, start, len_in, behavior);
}
--
2.25.0.265.gbab2e86ba0-goog
Powered by blists - more mailing lists