lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250825181434.3340805-6-sashal@kernel.org>
Date: Mon, 25 Aug 2025 14:14:32 -0400
From: Sasha Levin <sashal@...nel.org>
To: linux-api@...r.kernel.org,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	tools@...nel.org
Cc: Sasha Levin <sashal@...nel.org>
Subject: [RFC PATCH v4 5/7] mm/mlock: add API specification for mlock

Add kernel API specification for the mlock() system call.

Signed-off-by: Sasha Levin <sashal@...nel.org>
---
 mm/mlock.c | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 134 insertions(+)

diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..36eac7fec17d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -656,6 +656,140 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
 	return 0;
 }
 
+/**
+ * sys_mlock - Lock pages in memory
+ * @start: Starting address of memory range to lock
+ * @len: Length of memory range to lock in bytes
+ *
+ * long-desc: Locks pages in the specified address range into RAM, preventing
+ *   them from being paged to swap. Requires CAP_IPC_LOCK capability
+ *   or RLIMIT_MEMLOCK resource limit.
+ *
+ * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
+ *
+ * param: start, KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_NONE
+ *   constraint: Automatically page-aligned down by kernel (PAGE_ALIGN_DOWN)
+ *
+ * param: len, KAPI_TYPE_UINT
+ *   flags: KAPI_PARAM_IN
+ *   constraint-type: KAPI_CONSTRAINT_RANGE
+ *   range: 0, LONG_MAX
+ *   constraint: Automatically page-aligned up by kernel (PAGE_ALIGN)
+ *
+ * return:
+ *   type: KAPI_TYPE_INT
+ *   check-type: KAPI_RETURN_ERROR_CHECK
+ *   success: 0
+ *
+ * error: ENOMEM, Address range issue
+ *   desc: Some of the specified range is not mapped, has unmapped gaps,
+ *   or the lock would cause the number of mapped regions to exceed the limit.
+ *
+ * error: EPERM, Insufficient privileges
+ *   desc: The caller is not privileged (no CAP_IPC_LOCK) and RLIMIT_MEMLOCK is 0.
+ *
+ * error: EINVAL, Address overflow
+ *   desc: The result of the addition start+len was less than start (arithmetic overflow).
+ *
+ * error: EAGAIN, Some or all memory could not be locked
+ *   desc: Some or all of the specified address range could not be locked.
+ *
+ * error: EINTR, Interrupted by signal
+ *   desc: The operation was interrupted by a fatal signal before completion.
+ *
+ * error: EFAULT, Bad address
+ *   desc: The specified address range contains invalid addresses that cannot be accessed.
+ *
+ * since-version: 2.0
+ *
+ * lock: mmap_lock, KAPI_LOCK_RWLOCK
+ *   acquired: true
+ *   released: true
+ *   desc: Process memory map write lock
+ *
+ * signal: FATAL
+ *   direction: KAPI_SIGNAL_RECEIVE
+ *   action: KAPI_SIGNAL_ACTION_RETURN
+ *   condition: Fatal signal pending
+ *   desc: Fatal signals (SIGKILL) can interrupt the operation at two points:
+ *   when acquiring mmap_write_lock_killable() and during page population
+ *   in __mm_populate(). Returns -EINTR. Non-fatal signals do NOT interrupt
+ *   mlock - the operation continues even if SIGINT/SIGTERM are received.
+ *   error: -EINTR
+ *   timing: KAPI_SIGNAL_TIME_DURING
+ *   priority: 0
+ *   interruptible: yes
+ *   state-req: KAPI_SIGNAL_STATE_RUNNING
+ *
+ * examples: mlock(addr, 4096);  // Lock one page
+ *   mlock(addr, len);   // Lock range of pages
+ *
+ * notes: Memory locks do not stack - multiple calls on the same range can be
+ *   undone by a single munlock. Locks are not inherited by child processes.
+ *   Pages are locked on whole page boundaries. Commonly used by real-time
+ *   applications to prevent page faults during time-critical operations.
+ *   Also used for security to prevent sensitive data (e.g., cryptographic keys)
+ *   from being written to swap. Note: locked pages may still be saved to
+ *   swap during system suspend/hibernate.
+ *
+ *   Tagged addresses are automatically handled via untagged_addr(). The operation
+ *   occurs in two phases: first VMAs are marked with VM_LOCKED, then pages are
+ *   populated into memory. When checking RLIMIT_MEMLOCK, the kernel optimizes
+ *   by recounting locked memory to avoid double-counting overlapping regions.
+ * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_ALLOC_MEMORY
+ *   target: process memory
+ *   desc: Locks pages into physical memory, preventing swapping
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: mm->locked_vm
+ *   desc: Increases process locked memory counter
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_ALLOC_MEMORY
+ *   target: physical pages
+ *   desc: May allocate and populate page table entries
+ *   condition: Pages not already present
+ *   reversible: yes
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_ALLOC_MEMORY
+ *   target: page faults
+ *   desc: Triggers page faults to bring pages into memory
+ *   condition: Pages not already resident
+ *
+ * side-effect: KAPI_EFFECT_MODIFY_STATE
+ *   target: VMA splitting
+ *   desc: May split existing VMAs at lock boundaries
+ *   condition: Lock range partially overlaps existing VMA
+ *
+ * state-trans: memory pages
+ *   from: swappable
+ *   to: locked in RAM
+ *   desc: Pages become non-swappable and pinned in physical memory
+ *
+ * state-trans: VMA flags
+ *   from: unlocked
+ *   to: VM_LOCKED set
+ *   desc: Virtual memory area marked as locked
+ *
+ * capability: CAP_IPC_LOCK, KAPI_CAP_BYPASS_CHECK, CAP_IPC_LOCK capability
+ *   allows: Lock unlimited amount of memory (no RLIMIT_MEMLOCK enforcement)
+ *   without: Must respect RLIMIT_MEMLOCK resource limit
+ *   condition: Checked when RLIMIT_MEMLOCK is 0 or locking would exceed limit
+ *   priority: 0
+ *
+ * constraint: RLIMIT_MEMLOCK Resource Limit
+ *   desc: The RLIMIT_MEMLOCK soft resource limit specifies the maximum bytes of memory that may be locked into RAM. Unprivileged processes are restricted to this limit. CAP_IPC_LOCK capability allows bypassing this limit entirely. The limit is enforced per-process, not per-user.
+ *   expr: locked_memory + request_size <= RLIMIT_MEMLOCK || CAP_IPC_LOCK
+ *
+ * constraint: Memory Pressure and OOM
+ *   desc: Locking large amounts of memory can cause system-wide memory pressure and potentially trigger the OOM killer. The kernel does not prevent locking memory that would destabilize the system.
+ *
+ * constraint: Special Memory Areas
+ *   desc: Some memory types cannot be locked or are silently skipped: VM_IO/VM_PFNMAP areas (device mappings) are skipped; Hugetlb pages are inherently pinned and skipped; DAX mappings are always present in memory and skipped; Secret memory (memfd_secret) mappings are skipped; VM_DROPPABLE memory cannot be locked and is skipped; Gate VMA (kernel entry point) is skipped; VM_LOCKED areas are already locked. These special areas are silently excluded without error.
+ */
 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
 {
 	return do_mlock(start, len, VM_LOCKED);
-- 
2.50.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ