commit 4a403d7aa9 upstream.
The m->index isn't updated when m->show() overflows and retains its
value before the current mount causing a restart to start at the same
value. If that happens in short order to due a quickly expanding mount
table this would cause the same mount to be shown again and again.
Ensure that *pos always equals the mount id of the mount that was
returned by start/next. On restart after overflow mnt_find_id_at(*pos)
finds the exact mount. This should avoid duplicates, avoid skips and
should handle concurrent modification just fine.
Cc: <stable@vger.kernel.org>
Fixed: 2eea9ce431 ("mounts: keep list of mounts in an rbtree")
Link: https://patch.msgid.link/20260129-geleckt-treuhand-4bb940acacd9@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 78f0e33cd6 ]
grab_requested_mnt_ns was changed to return error codes on failure, but
its callers were not updated to check for error pointers, still checking
only for a NULL return value.
This commit updates the callers to use IS_ERR() or IS_ERR_OR_NULL() and
PTR_ERR() to correctly check for and propagate errors.
This also makes sure that the logic actually works and mount namespace
file descriptors can be used to refere to mounts.
Christian Brauner <brauner@kernel.org> says:
Rework the patch to be more ergonomic and in line with our overall error
handling patterns.
Fixes: 7b9d14af87 ("fs: allow mount namespace fd")
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20251111062815.2546189-1-avagin@google.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 278033a225 ]
When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
By default, a tmpfs mount is limited to using 50% of the available RAM
for its content. This can be problematic in memory-constrained
environments, particularly during a kdump capture.
In a kdump scenario, the capture kernel boots with a limited amount of
memory specified by the 'crashkernel' parameter. If the initramfs is
large, it may fail to unpack into the tmpfs rootfs due to insufficient
space. This is because to get X MB of usable space in tmpfs, 2*X MB of
memory must be available for the mount. This leads to an OOM failure
during the early boot process, preventing a successful crash dump.
This patch introduces a new kernel command-line parameter,
initramfs_options, which allows passing specific mount options directly
to the rootfs when it is first mounted. This gives users control over
the rootfs behavior.
For example, a user can now specify initramfs_options=size=75% to allow
the tmpfs to use up to 75% of the available memory. This can
significantly reduce the memory pressure for kdump.
Consider a practical example:
To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
the default 50% limit, this requires a memory pool of 96MB to be
available for the tmpfs mount. The total memory requirement is therefore
approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
By using initramfs_options=size=75%, the memory pool required for the
48MB tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total
memory requirement by 32MB (96MB - 64MB), allowing the kdump to succeed
with a smaller crashkernel size, such as 192MB.
An alternative approach of reusing the existing rootflags parameter was
considered. However, a new, dedicated initramfs_options parameter was
chosen to avoid altering the current behavior of rootflags (which
applies to the final root filesystem) and to prevent any potential
regressions.
Also add documentation for the new kernel parameter "initramfs_options"
This approach is inspired by prior discussions and patches on the topic.
Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
Ref: https://landley.net/notes-2015.html#01-01-2015
Ref: https://lkml.org/lkml/2021/6/29/783
Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
Signed-off-by: Lichen Liu <lichliu@redhat.com>
Link: https://lore.kernel.org/20250815121459.3391223-1-lichliu@redhat.com
Tested-by: Rob Landley <rob@landley.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c1f86d0ac3 upstream.
Massage listmount() and make sure we don't call path_put() under the
namespace semaphore. If we put the last reference we're fscked.
Fixes: b4c2bea8ce ("add listmount(2) syscall")
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1f282cdc1d upstream.
may_decode_fh() is calling has_locked_children() while holding no locks.
That's an oopsable race...
The rest of the callers are safe since they are holding namespace_sem and
are guaranteed a positive refcount on the mount in question.
Rename the current has_locked_children() to __has_locked_children(), make
it static and switch the fs/namespace.c users to it.
Make has_locked_children() a wrapper for __has_locked_children(), calling
the latter under read_seqlock_excl(&mount_lock).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: 620c266f39 ("fhandle: relax open_by_handle_at() permission checks")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ Harshit: Resolved conflicts due to missing commit:
db04662e2f ("fs: allow detached mounts in clone_private_mount()") in
linux-6.12.y ]
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cffd044187 ]
do_change_type() and do_set_group() are operating on different
aspects of the same thing - propagation graph. The latter
asks for mounts involved to be mounted in namespace(s) the caller
has CAP_SYS_ADMIN for. The former is a mess - originally it
didn't even check that mount *is* mounted. That got fixed,
but the resulting check turns out to be too strict for userland -
in effect, we check that mount is in our namespace, having already
checked that we have CAP_SYS_ADMIN there.
What we really need (in both cases) is
* only touch mounts that are mounted. That's a must-have
constraint - data corruption happens if it get violated.
* don't allow to mess with a namespace unless you already
have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns).
That's an equivalent of what do_set_group() does; let's extract that
into a helper (may_change_propagation()) and use it in both
do_set_group() and do_change_type().
Fixes: 12f147ddd6 "do_change_type(): refuse to operate on unmounted/not ours mounts"
Acked-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Tested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c28f922c9d upstream.
What we want is to verify there is that clone won't expose something
hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo"
may be a result of MNT_LOCKED on a child, but it may also come from
lacking admin rights in the userns of the namespace mount belongs to.
clone_private_mnt() checks the former, but not the latter.
There's a number of rather confusing CAP_SYS_ADMIN checks in various
userns during the mount, especially with the new mount API; they serve
different purposes and in case of clone_private_mnt() they usually,
but not always end up covering the missing check mentioned above.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Fixes: 427215d85e ("ovl: prevent private clone if bind mount is not allowed")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ merge conflict resolution: clone_private_mount() was reworked in
db04662e2f ("fs: allow detached mounts in clone_private_mount()").
Tweak the relevant ns_capable check so that it works on older kernels ]
Signed-off-by: Noah Orlando <Noah.Orlando@deshaw.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ce7df19686 ]
If we are propagating across the userns boundary, we need to lock the
mounts added there. However, in case when something has already
been mounted there and we end up sliding a new tree under that,
the stuff that had been there before should not get locked.
IOW, lock_mnt_tree() should be called before we reparent the
preexisting tree on top of what we are adding.
Fixes: 3bd045cc9c ("separate copying and locking mount tree on cross-userns copies")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 12f147ddd6 ]
Ensure that propagation settings can only be changed for mounts located
in the caller's mount namespace. This change aligns permission checking
with the rest of mount(2).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 07b20889e3 ("beginning of the shared-subtree proper")
Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d8cc0362f9 ]
9ffb14ef61 "move_mount: allow to add a mount into an existing group"
breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures
in question are actually documented.
Documentation/filesystem/sharedsubtree.rst:
All vfsmounts in a peer group have the same ->mnt_master. If it is
non-NULL, they form a contiguous (ordered) segment of slave list.
do_set_group() puts a mount into the same place in propagation graph
as the old one. As the result, if old mount gets events from somewhere
and is not a pure event sink, new one needs to be placed next to the
old one in the slave list the old one's on. If it is a pure event
sink, we only need to make sure the new one doesn't end up in the
middle of some peer group.
"move_mount: allow to add a mount into an existing group" ends up putting
the new one in the beginning of list; that's definitely not going to be
in the middle of anything, so that's fine for case when old is not marked
shared. In case when old one _is_ marked shared (i.e. is not a pure event
sink), that breaks the assumptions of propagation graph iterators.
Put the new mount next to the old one on the list - that does the right thing
in "old is marked shared" case and is just as correct as the current behaviour
if old is not marked shared (kudos to Pavel for pointing that out - my original
suggested fix changed behaviour in the "nor marked" case, which complicated
things for no good reason).
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 9ffb14ef61 ("move_mount: allow to add a mount into an existing group")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5f31c54938 ]
Holding namespace_sem is enough to make sure that result remains valid.
It is *not* enough to avoid false negatives from __lookup_mnt(). Mounts
can be unhashed outside of namespace_sem (stuck children getting detached
on final mntput() of lazy-umounted mount) and having an unrelated mount
removed from the hash chain while we traverse it may end up with false
negative from __lookup_mnt(). We need to sample and recheck the seqlock
component of mount_lock...
Bug predates the introduction of path_overmount() - it had come from
the code in finish_automount() that got abstracted into that helper.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Fixes: 26df6034fd ("fix automount/automount race properly")
Fixes: 6ac3928156 ("fs: allow to mount beneath top mount")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 250cf36930 ]
... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().
Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 65781e19dc ]
do_umount() analogue of the race fixed in 119e1ef80e "fix
__legitimize_mnt()/mntput() race". Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment. Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit). Still a bug...
Fixes: 48a066e72d ("RCU'd vfsmounts")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d039eac6e ]
Normally do_lock_mount(path, _) is locking a mountpoint pinned by
*path and at the time when matching unlock_mount() unlocks that
location it is still pinned by the same thing.
Unfortunately, for 'beneath' case it's no longer that simple -
the object being locked is not the one *path points to. It's the
mountpoint of path->mnt. The thing is, without sufficient locking
->mnt_parent may change under us and none of the locks are held
at that point. The rules are
* mount_lock stabilizes m->mnt_parent for any mount m.
* namespace_sem stabilizes m->mnt_parent, provided that
m is mounted.
* if either of the above holds and refcount of m is positive,
we are guaranteed the same for refcount of m->mnt_parent.
namespace_sem nests inside inode_lock(), so do_lock_mount() has
to take inode_lock() before grabbing namespace_sem. It does
recheck that path->mnt is still mounted in the same place after
getting namespace_sem, and it does take care to pin the dentry.
It is needed, since otherwise we might end up with racing mount --move
(or umount) happening while we were getting locks; in that case
dentry would no longer be a mountpoint and could've been evicted
on memory pressure along with its inode - not something you want
when grabbing lock on that inode.
However, pinning a dentry is not enough - the matching mount is
also pinned only by the fact that path->mnt is mounted on top it
and at that point we are not holding any locks whatsoever, so
the same kind of races could end up with all references to
that mount gone just as we are about to enter inode_lock().
If that happens, we are left with filesystem being shut down while
we are holding a dentry reference on it; results are not pretty.
What we need to do is grab both dentry and mount at the same time;
that makes inode_lock() safe *and* avoids the problem with fs getting
shut down under us. After taking namespace_sem we verify that
path->mnt is still mounted (which stabilizes its ->mnt_parent) and
check that it's still mounted at the same place. From that point
on to the matching namespace_unlock() we are guaranteed that
mount/dentry pair we'd grabbed are also pinned by being the mountpoint
of path->mnt, so we can quietly drop both the dentry reference (as
the current code does) and mnt one - it's OK to do under namespace_sem,
since we are not dropping the final refs.
That solves the problem on do_lock_mount() side; unlock_mount()
also has one, since dentry is guaranteed to stay pinned only until
the namespace_unlock(). That's easy to fix - just have inode_unlock()
done earlier, while it's still pinned by mp->m_dentry.
Fixes: 6ac3928156 "fs: allow to mount beneath top mount" # v6.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 056d33137b upstream.
Currently these mount options aren't accessible via statmount().
The read handler for /proc/#/mountinfo calls security_sb_show_options()
to emit the security options after emitting superblock flag options, but
before calling sb->s_op->show_options.
Have statmount_mnt_opts() call security_sb_show_options() before
calling ->show_options.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241115-statmount-v2-2-cd29aeff9cbb@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stable-dep-of: 5eb9871053 ("fs: fix adding security options to statmount.mnt_opt")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 344bac8f0d upstream.
Move mnt->mnt_node into the union with mnt->mnt_rcu and mnt->mnt_llist
instead of keeping it with mnt->mnt_list. This allows us to use
RB_CLEAR_NODE(&mnt->mnt_node) in umount_tree() as well as
list_empty(&mnt->mnt_node). That in turn allows us to remove MNT_ONRB.
This also fixes the bug reported in [1] where seemingly MNT_ONRB wasn't
set in @mnt->mnt_flags even though the mount was present in the mount
rbtree of the mount namespace.
The root cause is the following race. When a btrfs subvolume is mounted
a temporary mount is created:
btrfs_get_tree_subvol()
{
mnt = fc_mount()
// Register the newly allocated mount with sb->mounts:
lock_mount_hash();
list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
unlock_mount_hash();
}
and registered on sb->s_mounts. Later it is added to an anonymous mount
namespace via mount_subvol():
-> mount_subvol()
-> mount_subtree()
-> alloc_mnt_ns()
mnt_add_to_ns()
vfs_path_lookup()
put_mnt_ns()
The mnt_add_to_ns() call raises MNT_ONRB in @mnt->mnt_flags. If someone
concurrently does a ro remount:
reconfigure_super()
-> sb_prepare_remount_readonly()
{
list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
}
all mounts registered in sb->s_mounts are visited and first
MNT_WRITE_HOLD is raised, then MNT_READONLY is raised, and finally
MNT_WRITE_HOLD is removed again.
The flag modification for MNT_WRITE_HOLD/MNT_READONLY and MNT_ONRB race
so MNT_ONRB might be lost.
Fixes: 2eea9ce431 ("mounts: keep list of mounts in an rbtree")
Cc: <stable@kernel.org> # v6.8+
Link: https://lore.kernel.org/r/20241215-vfs-6-14-mount-work-v1-1-fd55922c4af8@kernel.org
Link: https://lore.kernel.org/r/ec6784ed-8722-4695-980a-4400d4e7bd1a@gmx.com [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull fuse updates from Miklos Szeredi:
- Add support for idmapped fuse mounts (Alexander Mikhalitsyn)
- Add optimization when checking for writeback (yangyun)
- Add tracepoints (Josef Bacik)
- Clean up writeback code (Joanne Koong)
- Clean up request queuing (me)
- Misc fixes
* tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (32 commits)
fuse: use exclusive lock when FUSE_I_CACHE_IO_MODE is set
fuse: clear FR_PENDING if abort is detected when sending request
fs/fuse: convert to use invalid_mnt_idmap
fs/mnt_idmapping: introduce an invalid_mnt_idmap
fs/fuse: introduce and use fuse_simple_idmap_request() helper
fs/fuse: fix null-ptr-deref when checking SB_I_NOIDMAP flag
fuse: allow O_PATH fd for FUSE_DEV_IOC_BACKING_OPEN
virtio_fs: allow idmapped mounts
fuse: allow idmapped mounts
fuse: warn if fuse_access is called when idmapped mounts are allowed
fuse: handle idmappings properly in ->write_iter()
fuse: support idmapped ->rename op
fuse: support idmapped ->set_acl
fuse: drop idmap argument from __fuse_get_acl
fuse: support idmapped ->setattr op
fuse: support idmapped ->permission inode op
fuse: support idmapped getattr inode op
fuse: support idmap for mkdir/mknod/symlink/create/tmpfile
fuse: support idmapped FUSE_EXT_GROUPS
fuse: add an idmap argument to fuse_simple_request
...
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
Pull vfs mount updates from Christian Brauner:
"Recently, we added the ability to list mounts in other mount
namespaces and the ability to retrieve namespace file descriptors
without having to go through procfs by deriving them from pidfds.
This extends nsfs in two ways:
(1) Add the ability to retrieve information about a mount namespace
via NS_MNT_GET_INFO.
This will return the mount namespace id and the number of mounts
currently in the mount namespace. The number of mounts can be
used to size the buffer that needs to be used for listmount() and
is in general useful without having to actually iterate through
all the mounts.
The structure is extensible.
(2) Add the ability to iterate through all mount namespaces over
which the caller holds privilege returning the file descriptor
for the next or previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt
to it's owning user namespace. This means that PID 1 on the host
can list all mounts in all mount namespaces or that a container
can list all mounts of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
namespace in one go.
(1) and (2) can be implemented for other namespace types easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.
The commit message in 49224a345c ('Merge patch series "nsfs: iterate
through mount namespaces"') contains example code how to do this"
* tag 'vfs-6.12.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nsfs: iterate through mount namespaces
file: add fput() cleanup helper
fs: add put_mnt_ns() cleanup helper
fs: allow mount namespace fd
Right now we determine if filesystem support vfs idmappings or not basing
on the FS_ALLOW_IDMAP flag presence. This "static" way works perfecly well
for local filesystems like ext4, xfs, btrfs, etc. But for network-like
filesystems like fuse, cephfs this approach is not ideal, because sometimes
proper support of vfs idmaps requires some extensions for the on-wire
protocol, which implies that changes have to be made not only in the Linux
kernel code but also in the 3rd party components like libfuse, cephfs MDS
server and so on.
We have seen that issue during our work on cephfs idmapped mounts [1] with
Christian, but right now I'm working on the idmapped mounts support for
fuse/virtiofs and I think that it is a right time for this extension.
[1] 5ccd8530dd ("ceph: handle idmapped mounts in create_request_message()")
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If no page could be allocated, an error pointer was used as format
string in pr_warn.
Rearrange the code to return early in case of OOM. Also add a check
for the return value of d_path.
Fixes: f8b92ba67c ("mount: Add mount warning for impending timestamp expiry")
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Link: https://lore.kernel.org/r/20240730085856.32385-1-olaf@aepfle.de
[brauner: rewrite commit and commit message]
Signed-off-by: Christian Brauner <brauner@kernel.org>
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Christian Brauner <brauner@kernel.org> says:
Recently, we added the ability to list mounts in other mount namespaces
and the ability to retrieve namespace file descriptors without having to
go through procfs by deriving them from pidfds.
This extends nsfs in two ways:
(1) Add the ability to retrieve information about a mount namespace via
NS_MNT_GET_INFO. This will return the mount namespace id and the
number of mounts currently in the mount namespace. The number of
mounts can be used to size the buffer that needs to be used for
listmount() and is in general useful without having to actually
iterate through all the mounts.
The structure is extensible.
(2) Add the ability to iterate through all mount namespaces over which
the caller holds privilege returning the file descriptor for the
next or previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt to
it's owning user namespace. This means that PID 1 on the host can
list all mounts in all mount namespaces or that a container can list
all mounts of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
namespace in one go.
(1) and (2) can be implemented for other namespace types easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs. Here's
a sample program list_all_mounts_everywhere.c:
// SPDX-License-Identifier: GPL-2.0-or-later
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <getopt.h>
#include <linux/stat.h>
#include <sched.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/param.h>
#include <sys/pidfd.h>
#include <sys/stat.h>
#include <sys/statfs.h>
#define die_errno(format, ...) \
do { \
fprintf(stderr, "%m | %s: %d: %s: " format "\n", __FILE__, \
__LINE__, __func__, ##__VA_ARGS__); \
exit(EXIT_FAILURE); \
} while (0)
/* Get the id for a mount namespace */
#define NS_GET_MNTNS_ID _IO(0xb7, 0x5)
/* Get next mount namespace. */
struct mnt_ns_info {
__u32 size;
__u32 nr_mounts;
__u64 mnt_ns_id;
};
#define MNT_NS_INFO_SIZE_VER0 16 /* size of first published struct */
/* Get information about namespace. */
#define NS_MNT_GET_INFO _IOR(0xb7, 10, struct mnt_ns_info)
/* Get next namespace. */
#define NS_MNT_GET_NEXT _IOR(0xb7, 11, struct mnt_ns_info)
/* Get previous namespace. */
#define NS_MNT_GET_PREV _IOR(0xb7, 12, struct mnt_ns_info)
#define PIDFD_GET_MNT_NAMESPACE _IO(0xFF, 3)
#define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */
#define __NR_listmount 458
#define __NR_statmount 457
/*
* @mask bits for statmount(2)
*/
#define STATMOUNT_SB_BASIC 0x00000001U /* Want/got sb_... */
#define STATMOUNT_MNT_BASIC 0x00000002U /* Want/got mnt_... */
#define STATMOUNT_PROPAGATE_FROM 0x00000004U /* Want/got propagate_from */
#define STATMOUNT_MNT_ROOT 0x00000008U /* Want/got mnt_root */
#define STATMOUNT_MNT_POINT 0x00000010U /* Want/got mnt_point */
#define STATMOUNT_FS_TYPE 0x00000020U /* Want/got fs_type */
#define STATMOUNT_MNT_NS_ID 0x00000040U /* Want/got mnt_ns_id */
#define STATMOUNT_MNT_OPTS 0x00000080U /* Want/got mnt_opts */
struct statmount {
__u32 size; /* Total size, including strings */
__u32 mnt_opts;
__u64 mask; /* What results were written */
__u32 sb_dev_major; /* Device ID */
__u32 sb_dev_minor;
__u64 sb_magic; /* ..._SUPER_MAGIC */
__u32 sb_flags; /* SB_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
__u32 fs_type; /* [str] Filesystem type */
__u64 mnt_id; /* Unique ID of mount */
__u64 mnt_parent_id; /* Unique ID of parent (for root == mnt_id) */
__u32 mnt_id_old; /* Reused IDs used in proc/.../mountinfo */
__u32 mnt_parent_id_old;
__u64 mnt_attr; /* MOUNT_ATTR_... */
__u64 mnt_propagation; /* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
__u64 mnt_peer_group; /* ID of shared peer group */
__u64 mnt_master; /* Mount receives propagation from this ID */
__u64 propagate_from; /* Propagation from in current namespace */
__u32 mnt_root; /* [str] Root of mount relative to root of fs */
__u32 mnt_point; /* [str] Mountpoint relative to current root */
__u64 mnt_ns_id;
__u64 __spare2[49];
char str[]; /* Variable size part containing strings */
};
struct mnt_id_req {
__u32 size;
__u32 spare;
__u64 mnt_id;
__u64 param;
__u64 mnt_ns_id;
};
#define MNT_ID_REQ_SIZE_VER1 32 /* sizeof second published struct */
#define LSMT_ROOT 0xffffffffffffffff /* root mount */
static int __statmount(__u64 mnt_id, __u64 mnt_ns_id, __u64 mask,
struct statmount *stmnt, size_t bufsize, unsigned int flags)
{
struct mnt_id_req req = {
.size = MNT_ID_REQ_SIZE_VER1,
.mnt_id = mnt_id,
.param = mask,
.mnt_ns_id = mnt_ns_id,
};
return syscall(__NR_statmount, &req, stmnt, bufsize, flags);
}
static struct statmount *sys_statmount(__u64 mnt_id, __u64 mnt_ns_id,
__u64 mask, unsigned int flags)
{
size_t bufsize = 1 << 15;
struct statmount *stmnt = NULL, *tmp = NULL;
int ret;
for (;;) {
tmp = realloc(stmnt, bufsize);
if (!tmp)
goto out;
stmnt = tmp;
ret = __statmount(mnt_id, mnt_ns_id, mask, stmnt, bufsize, flags);
if (!ret)
return stmnt;
if (errno != EOVERFLOW)
goto out;
bufsize <<= 1;
if (bufsize >= UINT_MAX / 2)
goto out;
}
out:
free(stmnt);
printf("statmount failed");
return NULL;
}
static ssize_t sys_listmount(__u64 mnt_id, __u64 last_mnt_id, __u64 mnt_ns_id,
__u64 list[], size_t num, unsigned int flags)
{
struct mnt_id_req req = {
.size = MNT_ID_REQ_SIZE_VER1,
.mnt_id = mnt_id,
.param = last_mnt_id,
.mnt_ns_id = mnt_ns_id,
};
return syscall(__NR_listmount, &req, list, num, flags);
}
int main(int argc, char *argv[])
{
#define LISTMNT_BUFFER 10
__u64 list[LISTMNT_BUFFER], last_mnt_id = 0;
int ret, pidfd, fd_mntns;
struct mnt_ns_info info = {};
pidfd = pidfd_open(getpid(), 0);
if (pidfd < 0)
die_errno("pidfd_open failed");
fd_mntns = ioctl(pidfd, PIDFD_GET_MNT_NAMESPACE, 0);
if (fd_mntns < 0)
die_errno("ioctl(PIDFD_GET_MNT_NAMESPACE) failed");
ret = ioctl(fd_mntns, NS_MNT_GET_INFO, &info);
if (ret < 0)
die_errno("ioctl(NS_GET_MNTNS_ID) failed");
printf("Listing %u mounts for mount namespace %d:%llu\n", info.nr_mounts, fd_mntns, info.mnt_ns_id);
for (;;) {
ssize_t nr_mounts;
next:
nr_mounts = sys_listmount(LSMT_ROOT, last_mnt_id, info.mnt_ns_id, list, LISTMNT_BUFFER, 0);
if (nr_mounts <= 0) {
printf("Finished listing mounts for mount namespace %d:%llu\n\n", fd_mntns, info.mnt_ns_id);
ret = ioctl(fd_mntns, NS_MNT_GET_NEXT, 0);
if (ret < 0)
die_errno("ioctl(NS_MNT_GET_NEXT) failed");
close(ret);
ret = ioctl(fd_mntns, NS_MNT_GET_NEXT, &info);
if (ret < 0) {
if (errno == ENOENT) {
printf("Finished listing all mount namespaces\n");
exit(0);
}
die_errno("ioctl(NS_MNT_GET_NEXT) failed");
}
close(fd_mntns);
fd_mntns = ret;
last_mnt_id = 0;
printf("Listing %u mounts for mount namespace %d:%llu\n", info.nr_mounts, fd_mntns, info.mnt_ns_id);
goto next;
}
for (size_t cur = 0; cur < nr_mounts; cur++) {
struct statmount *stmnt;
last_mnt_id = list[cur];
stmnt = sys_statmount(last_mnt_id, info.mnt_ns_id,
STATMOUNT_SB_BASIC |
STATMOUNT_MNT_BASIC |
STATMOUNT_MNT_ROOT |
STATMOUNT_MNT_POINT |
STATMOUNT_MNT_NS_ID |
STATMOUNT_MNT_OPTS |
STATMOUNT_FS_TYPE,
0);
if (!stmnt) {
printf("Failed to statmount(%llu) in mount namespace(%llu)\n", last_mnt_id, info.mnt_ns_id);
continue;
}
printf("mnt_id(%u/%llu) | mnt_parent_id(%u/%llu): %s @ %s ==> %s with options: %s\n",
stmnt->mnt_id_old, stmnt->mnt_id,
stmnt->mnt_parent_id_old, stmnt->mnt_parent_id,
stmnt->str + stmnt->fs_type,
stmnt->str + stmnt->mnt_root,
stmnt->str + stmnt->mnt_point,
stmnt->str + stmnt->mnt_opts);
free(stmnt);
}
}
exit(0);
}
* patches from https://lore.kernel.org/r/20240719-work-mount-namespace-v1-0-834113cab0d2@kernel.org:
nsfs: iterate through mount namespaces
file: add fput() cleanup helper
fs: add put_mnt_ns() cleanup helper
fs: allow mount namespace fd
Signed-off-by: Christian Brauner <brauner@kernel.org>
It is already possible to list mounts in other mount namespaces and to
retrieve namespace file descriptors without having to go through procfs
by deriving them from pidfds.
Augment these abilities by adding the ability to retrieve information
about a mount namespace via NS_MNT_GET_INFO. This will return the mount
namespace id and the number of mounts currently in the mount namespace.
The number of mounts can be used to size the buffer that needs to be
used for listmount() and is in general useful without having to actually
iterate through all the mounts. The structure is extensible.
And add the ability to iterate through all mount namespaces over which
the caller holds privilege returning the file descriptor for the next or
previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt to it's
owning user namespace. This means that PID 1 on the host can list all
mounts in all mount namespaces or that a container can list all mounts
of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount namespace
in one go. Both ioctls can be implemented for other namespace types
easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs.
Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-5-834113cab0d2@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull vfs mount query updates from Christian Brauner:
"This contains work to extend the abilities of listmount() and
statmount() and various fixes and cleanups.
Features:
- Allow iterating through mounts via listmount() from newest to
oldest. This makes it possible for mount(8) to keep iterating the
mount table in reverse order so it gets newest mounts first.
- Relax permissions on listmount() and statmount().
It's not necessary to have capabilities in the initial namespace:
it is sufficient to have capabilities in the owning namespace of
the mount namespace we're located in to list unreachable mounts in
that namespace.
- Extend both listmount() and statmount() to list and stat mounts in
foreign mount namespaces.
Currently the only way to iterate over mount entries in mount
namespaces that aren't in the caller's mount namespace is by
crawling through /proc in order to find /proc/<pid>/mountinfo for
the relevant mount namespace.
This is both very clumsy and hugely inefficient. So extend struct
mnt_id_req with a new member that allows to specify the mount
namespace id of the mount namespace we want to look at.
Luckily internally we already have most of the infrastructure for
this so we just need to expose it to userspace. Give userspace a
way to retrieve the id of a mount namespace via statmount() and
through a new nsfs ioctl() on mount namespace file descriptor.
This comes with appropriate selftests.
- Expose mount options through statmount().
Currently if userspace wants to get mount options for a mount and
with statmount(), they still have to open /proc/<pid>/mountinfo to
parse mount options. Simply the information through statmount()
directly.
Afterwards it's possible to only rely on statmount() and
listmount() to retrieve all and more information than
/proc/<pid>/mountinfo provides.
This comes with appropriate selftests.
Fixes:
- Avoid copying to userspace under the namespace semaphore in
listmount.
Cleanups:
- Simplify the error handling in listmount by relying on our newly
added cleanup infrastructure.
- Refuse invalid mount ids early for both listmount and statmount"
* tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reject invalid last mount id early
fs: refuse mnt id requests with invalid ids early
fs: find rootfs mount of the mount namespace
fs: only copy to userspace on success in listmount()
sefltests: extend the statmount test for mount options
fs: use guard for namespace_sem in statmount()
fs: export mount options via statmount()
fs: rename show_mnt_opts -> show_vfsmnt_opts
selftests: add a test for the foreign mnt ns extensions
fs: add an ioctl to get the mnt ns id from nsfs
fs: Allow statmount() in foreign mount namespace
fs: Allow listmount() in foreign mount namespace
fs: export the mount ns id via statmount
fs: keep an index of current mount namespaces
fs: relax permissions for statmount()
listmount: allow listing in reverse order
fs: relax permissions for listmount()
fs: simplify error handling
fs: don't copy to userspace under namespace semaphore
path: add cleanup helper
The method we used was predicated on the assumption that the mount
immediately following the root mount of the mount namespace would be the
rootfs mount of the namespace. That's not always the case though. For
example:
ID PARENT ID
408 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/hosts /etc/hosts rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
409 414 0:61 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=64000k,uid=1000,gid=1000,inode64
410 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/.containerenv /run/.containerenv rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
411 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/hostname /etc/hostname rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
412 363 0:65 / / rw,relatime - overlay overlay rw,lowerdir=/home/user1/.local/share/containers/storage/overlay/l/JS65SUCGTPCP2EEBHLRP4UCFI5:/home/user1/.local/share/containers/storage/overlay/l/DLW22KVDWUNI4242D6SDJ5GKCL [...]
413 412 0:68 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
414 412 0:69 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,uid=1000,gid=1000,inode64
415 412 0:70 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
416 414 0:71 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=100004,mode=620,ptmxmode=666
417 414 0:67 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
418 415 0:27 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
419 414 0:6 /null /dev/null rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
420 414 0:6 /zero /dev/zero rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
422 414 0:6 /full /dev/full rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
423 414 0:6 /tty /dev/tty rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
430 414 0:6 /random /dev/random rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
431 414 0:6 /urandom /dev/urandom rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
433 413 0:72 / /proc/acpi ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
440 413 0:6 /null /proc/kcore ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
441 413 0:6 /null /proc/keys ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
442 413 0:6 /null /proc/timer_list ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
443 413 0:73 / /proc/scsi ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
444 415 0:74 / /sys/firmware ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
445 415 0:75 / /sys/dev/block ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
446 413 0:68 /bus /proc/bus ro,nosuid,nodev,noexec,relatime - proc proc rw
447 413 0:68 /fs /proc/fs ro,nosuid,nodev,noexec,relatime - proc proc rw
448 413 0:68 /irq /proc/irq ro,nosuid,nodev,noexec,relatime - proc proc rw
449 413 0:68 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
450 413 0:68 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
364 414 0:71 /0 /dev/console rw,relatime - devpts devpts rw,gid=100004,mode=620,ptmxmode=666
In this mount table the root mount of the mount namespace is the mount
with id 363 (It isn't visible because it's literally just what the
rootfs mount is mounted upon and usually it's just a copy of the real
rootfs).
The rootfs mount that's mounted on the root mount of the mount namespace
is the mount with id 412. But the mount namespace contains mounts that
were created before the rootfs mount and thus have earlier mount ids. So
the first call to listmnt_next() would return the mount with the mount
id 408 and not the rootfs mount.
So we need to find the actual rootfs mount mounted on the root mount of
the mount namespace. This logic is also present in mntns_install() where
vfs_path_lookup() is used. We can't use this though as we're holding the
namespace semaphore. We could look at the children of the root mount of
the mount namespace directly but that also seems a bit out of place
while we have the rbtree. So let's just iterate through the rbtree
starting from the root mount of the mount namespace and find the mount
whose parent is the root mount of the mount namespace. That mount will
usually appear very early in the rbtree and afaik there can only be one.
IOW, it would be very strange if we ended up with a root mount of a
mount namespace that has shadow mounts.
Fixes: 0a3deb1185 ("fs: Allow listmount() in foreign mount namespace") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
Avoid copying when we failed to, or didn't have any mounts to list.
Fixes: cb54ef4f05 ("fs: don't copy to userspace under namespace semaphore") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
statmount() can export arbitrary strings, so utilize the __spare1 slot
for a mnt_opts string pointer, and then support asking for and setting
the mount options during statmount(). This calls into the helper for
showing mount options, which already uses a seq_file, so fits in nicely
with our existing mechanism for exporting strings via statmount().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/3aa6bf8bd5d0a21df9ebd63813af8ab532c18276.1719257716.git.josef@toxicpanda.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
[brauner: only call sb->s_op->show_options()]
Signed-off-by: Christian Brauner <brauner@kernel.org>
util-linux is about to implement listmount() and statmount() support.
Karel requested the ability to scan the mount table in backwards order
because that's what libmount currently does in order to get the latest
mount first. We currently don't support this in listmount(). Add a new
LISTMOUNT_REVERSE flag to allow listing mounts in reverse order. For
example, listing all child mounts of /sys without LISTMOUNT_REVERSE
gives:
/sys/kernel/security @ mnt_id: 4294968369
/sys/fs/cgroup @ mnt_id: 4294968370
/sys/firmware/efi/efivars @ mnt_id: 4294968371
/sys/fs/bpf @ mnt_id: 4294968372
/sys/kernel/tracing @ mnt_id: 4294968373
/sys/kernel/debug @ mnt_id: 4294968374
/sys/fs/fuse/connections @ mnt_id: 4294968375
/sys/kernel/config @ mnt_id: 4294968376
whereas with LISTMOUNT_REVERSE it gives:
/sys/kernel/config @ mnt_id: 4294968376
/sys/fs/fuse/connections @ mnt_id: 4294968375
/sys/kernel/debug @ mnt_id: 4294968374
/sys/kernel/tracing @ mnt_id: 4294968373
/sys/fs/bpf @ mnt_id: 4294968372
/sys/firmware/efi/efivars @ mnt_id: 4294968371
/sys/fs/cgroup @ mnt_id: 4294968370
/sys/kernel/security @ mnt_id: 4294968369
Link: https://lore.kernel.org/r/20240607-vfs-listmount-reverse-v1-4-7877a2bfa5e5@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>