linux-kernel - Re: [RFC v2 01/22] kernel/api: introduce kernel API specification framework

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGRKIuR6hgW0YLc_@lappy>
Date: Tue, 1 Jul 2025 16:50:42 -0400
From: Sasha Levin <sashal@...nel.org>
To: Jonathan Corbet <corbet@....net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>,
	linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
	linux-api@...r.kernel.org, workflows@...r.kernel.org,
	tools@...nel.org, Kate Stewart <kstewart@...uxfoundation.org>,
	Gabriele Paoloni <gpaoloni@...hat.com>,
	Chuck Wolber <chuckwolber@...il.com>
Subject: Re: [RFC v2 01/22] kernel/api: introduce kernel API specification
 framework

On Tue, Jul 01, 2025 at 01:01:27PM -0600, Jonathan Corbet wrote:
>[Adding some of the ELISA folks, who are working in a related area and
>might have thoughts on this.  You can find the patch series under
>discussion at:
>
>  https://lore.kernel.org/all/20250624180742.5795-1-sashal@kernel.org

Yup, we all met at OSS and reached the conclusion that we should lean
towards a machine readable spec, which we thought was closer to my
proposal than the kerneldoc work.

However, with your suggestion, I think it makes more sense to go back to
kerneldoc as that can be made machine readable.

>> In theory, all of that will let us have something like the following in
>> kerneldoc:
>>
>> - @api-type: syscall
>> - @api-version: 1
>> - @context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
>> - @param-type: family, KAPI_TYPE_INT
>> - @param-flags: family, KAPI_PARAM_IN
>> - @param-range: family, 0, 45
>> - @param-mask: type, SOCK_TYPE_MASK | SOCK_CLOEXEC | SOCK_NONBLOCK
>> - @error-code: -EAFNOSUPPORT, "Address family not supported"
>> - @error-condition: -EAFNOSUPPORT, "family < 0 || family >= NPROTO"
>> - @capability: CAP_NET_RAW, KAPI_CAP_GRANT_PERMISSION
>> - @capability-allows: CAP_NET_RAW, "Create SOCK_RAW sockets"
>> - @since: 2.0
>> - @return-type: KAPI_TYPE_FD
>> - @return-check: KAPI_RETURN_ERROR_CHECK
>>
>> How does it sound? I'm pretty excited about the possiblity to align this
>> with kerneldoc. Please poke holes in the plan :)
>
>I think we could do it without all the @signs.  We'd also want to see
>how well we could integrate that information with the minimal structure
>we already have: getting the return-value information into the Returns:
>section, for example, and tying the parameter constraints to the
>parameter descriptions we already have.

Right!

So I have a proof of concept which during the build process creates
.apispec.h which are generated from kerneldoc and contain macros
identical to the ones in my RFC.

Here's an example of sys_mlock() spec:

/**
  * sys_mlock - Lock pages in memory
  * @start: Starting address of memory range to lock
  * @len: Length of memory range to lock in bytes
  *
  * Locks pages in the specified address range into RAM, preventing them from
  * being paged to swap. Requires CAP_IPC_LOCK capability or RLIMIT_MEMLOCK
  * resource limit.
  *
  * long-desc: Locks pages in the specified address range into RAM, preventing
  *   them from being paged to swap. Requires CAP_IPC_LOCK capability
  *   or RLIMIT_MEMLOCK resource limit.
  * context-flags: KAPI_CTX_PROCESS | KAPI_CTX_SLEEPABLE
  * param-type: start, KAPI_TYPE_UINT
  * param-flags: start, KAPI_PARAM_IN
  * param-constraint-type: start, KAPI_CONSTRAINT_NONE
  * param-constraint: start, Rounded down to page boundary
  * param-type: len, KAPI_TYPE_UINT
  * param-flags: len, KAPI_PARAM_IN
  * param-constraint-type: len, KAPI_CONSTRAINT_RANGE
  * param-range: len, 0, LONG_MAX
  * param-constraint: len, Rounded up to page boundary
  * return-type: KAPI_TYPE_INT
  * return-check-type: KAPI_RETURN_ERROR_CHECK
  * return-success: 0
  * error-code: -ENOMEM, ENOMEM, Address range issue,
  *   Some of the specified range is not mapped, has unmapped gaps,
  *   or the lock would cause the number of mapped regions to exceed the limit.
  * error-code: -EPERM, EPERM, Insufficient privileges,
  *   The caller is not privileged (no CAP_IPC_LOCK) and RLIMIT_MEMLOCK is 0.
  * error-code: -EINVAL, EINVAL, Address overflow,
  *   The result of the addition start+len was less than start (arithmetic overflow).
  * error-code: -EAGAIN, EAGAIN, Some or all memory could not be locked,
  *   Some or all of the specified address range could not be locked.
  * error-code: -EINTR, EINTR, Interrupted by signal,
  *   The operation was interrupted by a fatal signal before completion.
  * error-code: -EFAULT, EFAULT, Bad address,
  *   The specified address range contains invalid addresses that cannot be accessed.
  * since-version: 2.0
  * lock: mmap_lock, KAPI_LOCK_RWLOCK
  * lock-acquired: true
  * lock-released: true
  * lock-desc: Process memory map write lock
  * signal: FATAL
  * signal-direction: KAPI_SIGNAL_RECEIVE
  * signal-action: KAPI_SIGNAL_ACTION_RETURN
  * signal-condition: Fatal signal pending
  * signal-desc: Fatal signals (SIGKILL) can interrupt the operation at two points:
  *   when acquiring mmap_write_lock_killable() and during page population
  *   in __mm_populate(). Returns -EINTR. Non-fatal signals do NOT interrupt
  *   mlock - the operation continues even if SIGINT/SIGTERM are received.
  * signal-error: -EINTR
  * signal-timing: KAPI_SIGNAL_TIME_DURING
  * signal-priority: 0
  * signal-interruptible: yes
  * signal-state-req: KAPI_SIGNAL_STATE_RUNNING
  * examples: mlock(addr, 4096);  // Lock one page
  *   mlock(addr, len);   // Lock range of pages
  * notes: Memory locks do not stack - multiple calls on the same range can be
  *   undone by a single munlock. Locks are not inherited by child processes.
  *   Pages are locked on whole page boundaries. Commonly used by real-time
  *   applications to prevent page faults during time-critical operations.
  *   Also used for security to prevent sensitive data (e.g., cryptographic keys)
  *   from being written to swap. Note: locked pages may still be saved to
  *   swap during system suspend/hibernate.
  *
  *   Tagged addresses are automatically handled via untagged_addr(). The operation
  *   occurs in two phases: first VMAs are marked with VM_LOCKED, then pages are
  *   populated into memory. When checking RLIMIT_MEMLOCK, the kernel optimizes
  *   by recounting locked memory to avoid double-counting overlapping regions.
  * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_ALLOC_MEMORY, process memory, Locks pages into physical memory, preventing swapping, reversible=yes
  * side-effect: KAPI_EFFECT_MODIFY_STATE, mm->locked_vm, Increases process locked memory counter, reversible=yes
  * side-effect: KAPI_EFFECT_ALLOC_MEMORY, physical pages, May allocate and populate page table entries, condition=Pages not already present, reversible=yes
  * side-effect: KAPI_EFFECT_MODIFY_STATE | KAPI_EFFECT_ALLOC_MEMORY, page faults, Triggers page faults to bring pages into memory, condition=Pages not already resident
  * side-effect: KAPI_EFFECT_MODIFY_STATE, VMA splitting, May split existing VMAs at lock boundaries, condition=Lock range partially overlaps existing VMA
  * state-trans: memory pages, swappable, locked in RAM, Pages become non-swappable and pinned in physical memory
  * state-trans: VMA flags, unlocked, VM_LOCKED set, Virtual memory area marked as locked
  * capability: CAP_IPC_LOCK, KAPI_CAP_BYPASS_CHECK, CAP_IPC_LOCK capability
  * capability-allows: Lock unlimited amount of memory (no RLIMIT_MEMLOCK enforcement)
  * capability-without: Must respect RLIMIT_MEMLOCK resource limit
  * capability-condition: Checked when RLIMIT_MEMLOCK is 0 or locking would exceed limit
  * capability-priority: 0
  * constraint: RLIMIT_MEMLOCK Resource Limit, The RLIMIT_MEMLOCK soft resource limit specifies the maximum bytes of memory that may be locked into RAM. Unprivileged processes are restricted to this limit. CAP_IPC_LOCK capability allows bypassing this limit entirely. The limit is enforced per-process, not per-user.
  * constraint-expr: RLIMIT_MEMLOCK Resource Limit, locked_memory + request_size <= RLIMIT_MEMLOCK || CAP_IPC_LOCK
  * constraint: Memory Pressure and OOM, Locking large amounts of memory can cause system-wide memory pressure and potentially trigger the OOM killer. The kernel does not prevent locking memory that would destabilize the system.
  * constraint: Special Memory Areas, Some memory types cannot be locked or are silently skipped: VM_IO/VM_PFNMAP areas (device mappings) are skipped; Hugetlb pages are inherently pinned and skipped; DAX mappings are always present in memory and skipped; Secret memory (memfd_secret) mappings are skipped; VM_DROPPABLE memory cannot be locked and is skipped; Gate VMA (kernel entry point) is skipped; VM_LOCKED areas are already locked. These special areas are silently excluded without error.
  *
  * Context: Process context. May sleep. Takes mmap_lock for write.
  *
  * Return: 0 on success, negative error code on failure
  */

>The other thing I would really like to see, to the extent we can, is
>that a bunch of patches adding all this data to the source will actually
>be accepted by the relevant maintainers.  It would be a shame to get all
>this infrastructure into place, then have things stall out due to
>maintainer pushback.  Maybe you should start by annotating the
>scheduler-related system calls; if that works the rest should be a piece
>of cake :)

In the RFC I've sent out I've specced out API from different subsystems
to solicit some feedback on those, but so fair it's been quiet.

I'll resend a "lean" RFC v3 with just the base macro spec infra +
kerneldoc support + "tricker" sched API + "trickier" mm API.

I'm thinking that if it's still quiet in a month or two I'll propose a
talk at LPC around it, or maybe try and feedback/consensus during
maintainer's summit.

But yes, it doesn't make sense to take it in until we have an ack from a
few larger subsystems.

-- 
Thanks,
Sasha