mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2026-06-21 15:43:21 +02:00
7e0e7bd60d
Pull misc vfs updates from Christian Brauner:
"Features:
- Reduce pipe->mutex contention by pre-allocating pages outside the
lock in anon_pipe_write().
anon_pipe_write() called alloc_page() once per page while holding
pipe->mutex. The allocation can sleep doing direct reclaim and runs
memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex. Now up to 8 pages are
pre-allocated before the mutex is taken, leftovers are recycled
into the per-pipe tmp_page[] cache before unlock, and any remainder
is released after unlock, keeping the allocator out of the critical
section on both sides. On a writers x readers sweep with 64KB
writes against a 1 MB pipe throughput improves 6-28% and average
write latency drops 5-22%; under memory pressure - when the cost of
holding the mutex across reclaim is highest - throughput improves
21-48% and latency drops 17-33%. The microbenchmark is added to
selftests.
- uaccess/sockptr: fix the ignored_trailing logic in
copy_struct_to_user() to behave as documented and the usize check
in copy_struct_from_sockptr() for user pointers, and add
copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).
- bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
inode backing a dentry via d_real_inode(). On overlayfs the inode
attached to the dentry doesn't carry the underlying device
information; this is used by the filesystem restriction BPF program
that was merged into systemd.
- docs: add guidelines for submitting new filesystems, motivated by
the maintenance burden abandoned and untestable filesystems impose
on VFS developers, blocking infrastructure work like folio
conversions and iomap migration.
Fixes:
- libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
and drop the now-redundant assignments in callers. This began as a
one-line dma-buf fix for a path_noexec() warning; a pseudo
filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
callers were audited: the only visible effect is on dma-buf where
SB_I_NOEXEC silences the warning.
- Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
device with a sector size > PAGE_SIZE crashed roughly half of them;
the rest had the same missing error handling pattern. Plus a
follow-up releasing the superblock buffer_head when setting the
minix v3 block size fails.
- mount: honour SB_NOUSER in the new mount API.
- fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
switching the process-group paths of send_sigio() and send_sigurg()
from read_lock(&tasklist_lock) to RCU, matching the single-PID
path.
- vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
delegated NFS mounts (fsopen() in a container with the mount
performed by a privileged daemon) that broke when non-init
s_user_ns was tied to FS_USERNS_MOUNT.
- selftests/namespaces: fix a hang in nsid_test where an unreaped
grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
listns_efault_test, and a false FAIL on kernels without listns()
where the tests should SKIP.
- filelock: fix the break_lease() stub signature for
CONFIG_FILE_LOCKING=n.
- init/initramfs_test: wait for the async initramfs unpacking before
running; the test and do_populate_rootfs() share the parser state.
- fs/coredump: reduce redundant log noise in
validate_coredump_safety().
- iomap: pass the correct length to fserror_report_io() in
__iomap_write_begin().
- backing-file: fix the backing_file_open() kerneldoc.
Cleanups:
- initramfs: refactor the cpio hex header parsing to use hex2bin()
instead of the hand-rolled simple_strntoul() which is reverted, and
extend the initramfs KUnit tests to cover header fields with 0x
prefixes.
- Replace __get_free_pages() and friends with kmalloc()/kzalloc()
across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
do_mounts init code - part of the larger work of replacing page
allocator calls with kmalloc().
- Use clear_and_wake_up_bit() in unlock_buffer() and
journal_end_buffer_io_sync() instead of open-coding the sequence.
- Drop unused VFS exports: unexport drop_super_exclusive(), remove
start_removing_user_path_at(), and fold __start_removing_path()
into start_removing_path().
- fs/read_write: narrow the __kernel_write() export with
EXPORT_SYMBOL_FOR_MODULES().
- vfs: uapi: retire octal and hex constants in favor of (1 << n) for
the O_ flags. Finding a free bit for a new flag across the
architectures was needlessly hard with the mixed bases.
- dcache: add extra sanity checks of dead dentries in dentry_free()
via a new DENTRY_WARN_ONCE() that also prints d_flags.
- iov_iter: use kmemdup_array() in dup_iter() to harden the
allocation against multiplication overflow.
- fs/pipe: write to ->poll_usage only once.
- vfs: remove an always-taken if-branch in find_next_fd().
- dcache: use kmalloc_flex() for struct external_name in __d_alloc().
- namei: use QSTR() instead of QSTR_INIT() in path_pts().
- sync_file_range: delete dead S_ISLNK code.
- Comment fixes: retire a stale comment in fget_task_next() and fix
assorted spelling mistakes"
* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
backing-file: fix backing_file_open() kerneldoc parameter
iomap: pass the correct len to fserror_report_io in __iomap_write_begin
vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
bpf: add bpf_real_inode() kfunc
fs/read_write: Do not export __kernel_write() to the entire world
libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
mount: honour SB_NOUSER in the new mount API
fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
selftests/pipe: add pipe_bench microbenchmark
fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
fs: retire stale comment in fget_task_next()
fs: fix spelling mistakes in comment
bfs: replace get_zeroed_page() with kzalloc()
binfmt_misc: replace __get_free_page() with kmalloc()
configfs: replace __get_free_pages() with kzalloc()
fs/namespace: use __getname() to allocate mntpath buffer
fs/select: replace __get_free_page() with kmalloc()
...
218 lines
8.3 KiB
C
218 lines
8.3 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _LINUX_NAMEI_H
|
|
#define _LINUX_NAMEI_H
|
|
|
|
#include <linux/fs.h>
|
|
#include <linux/kernel.h>
|
|
#include <linux/path.h>
|
|
#include <linux/fcntl.h>
|
|
#include <linux/errno.h>
|
|
#include <linux/fs_struct.h>
|
|
|
|
enum { MAX_NESTED_LINKS = 8 };
|
|
|
|
#define MAXSYMLINKS 40
|
|
|
|
/* pathwalk mode */
|
|
#define LOOKUP_FOLLOW BIT(0) /* follow links at the end */
|
|
#define LOOKUP_DIRECTORY BIT(1) /* require a directory */
|
|
#define LOOKUP_AUTOMOUNT BIT(2) /* force terminal automount */
|
|
#define LOOKUP_EMPTY BIT(3) /* accept empty path [user_... only] */
|
|
#define LOOKUP_LINKAT_EMPTY BIT(4) /* Linkat request with empty path. */
|
|
#define LOOKUP_DOWN BIT(5) /* follow mounts in the starting point */
|
|
#define LOOKUP_MOUNTPOINT BIT(6) /* follow mounts in the end */
|
|
#define LOOKUP_REVAL BIT(7) /* tell ->d_revalidate() to trust no cache */
|
|
#define LOOKUP_RCU BIT(8) /* RCU pathwalk mode; semi-internal */
|
|
#define LOOKUP_CACHED BIT(9) /* Only do cached lookup */
|
|
#define LOOKUP_PARENT BIT(10) /* Looking up final parent in path */
|
|
/* 5 spare bits for pathwalk */
|
|
|
|
/* These tell filesystem methods that we are dealing with the final component... */
|
|
#define LOOKUP_OPEN BIT(16) /* ... in open */
|
|
#define LOOKUP_CREATE BIT(17) /* ... in object creation */
|
|
#define LOOKUP_EXCL BIT(18) /* ... in target must not exist */
|
|
#define LOOKUP_RENAME_TARGET BIT(19) /* ... in destination of rename() */
|
|
|
|
/* 4 spare bits for intent */
|
|
|
|
/* Scoping flags for lookup. */
|
|
#define LOOKUP_NO_SYMLINKS BIT(24) /* No symlink crossing. */
|
|
#define LOOKUP_NO_MAGICLINKS BIT(25) /* No nd_jump_link() crossing. */
|
|
#define LOOKUP_NO_XDEV BIT(26) /* No mountpoint crossing. */
|
|
#define LOOKUP_BENEATH BIT(27) /* No escaping from starting point. */
|
|
#define LOOKUP_IN_ROOT BIT(28) /* Treat dirfd as fs root. */
|
|
/* LOOKUP_* flags which do scope-related checks based on the dirfd. */
|
|
#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
|
|
/* 3 spare bits for scoping */
|
|
|
|
extern int path_pts(struct path *path);
|
|
|
|
extern int user_path_at(int, const char __user *, unsigned, struct path *);
|
|
|
|
extern int kern_path(const char *, unsigned, struct path *);
|
|
struct dentry *kern_path_parent(const char *name, struct path *parent);
|
|
|
|
extern struct dentry *start_creating_path(int, const char *, struct path *, unsigned int);
|
|
extern struct dentry *start_creating_user_path(int, const char __user *, struct path *, unsigned int);
|
|
extern void end_creating_path(const struct path *, struct dentry *);
|
|
extern struct dentry *start_removing_path(const char *, struct path *);
|
|
static inline void end_removing_path(const struct path *path , struct dentry *dentry)
|
|
{
|
|
end_creating_path(path, dentry);
|
|
}
|
|
int vfs_path_parent_lookup(struct filename *filename, unsigned int flags,
|
|
struct path *parent, struct qstr *last,
|
|
const struct path *root);
|
|
int vfs_path_lookup(struct dentry *, struct vfsmount *, const char *,
|
|
unsigned int, struct path *);
|
|
|
|
extern struct dentry *try_lookup_noperm(struct qstr *, struct dentry *);
|
|
extern struct dentry *lookup_noperm(struct qstr *, struct dentry *);
|
|
extern struct dentry *lookup_noperm_unlocked(struct qstr *, struct dentry *);
|
|
extern struct dentry *lookup_noperm_positive_unlocked(struct qstr *, struct dentry *);
|
|
struct dentry *lookup_one(struct mnt_idmap *, struct qstr *, struct dentry *);
|
|
struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap,
|
|
struct qstr *name, struct dentry *base);
|
|
struct dentry *lookup_one_positive_unlocked(struct mnt_idmap *idmap,
|
|
struct qstr *name,
|
|
struct dentry *base);
|
|
struct dentry *lookup_one_positive_killable(struct mnt_idmap *idmap,
|
|
struct qstr *name,
|
|
struct dentry *base);
|
|
|
|
struct dentry *start_creating(struct mnt_idmap *idmap, struct dentry *parent,
|
|
struct qstr *name);
|
|
struct dentry *start_removing(struct mnt_idmap *idmap, struct dentry *parent,
|
|
struct qstr *name);
|
|
struct dentry *start_creating_killable(struct mnt_idmap *idmap,
|
|
struct dentry *parent,
|
|
struct qstr *name);
|
|
struct dentry *start_removing_killable(struct mnt_idmap *idmap,
|
|
struct dentry *parent,
|
|
struct qstr *name);
|
|
struct dentry *start_creating_noperm(struct dentry *parent, struct qstr *name);
|
|
struct dentry *start_removing_noperm(struct dentry *parent, struct qstr *name);
|
|
struct dentry *start_creating_dentry(struct dentry *parent,
|
|
struct dentry *child);
|
|
struct dentry *start_removing_dentry(struct dentry *parent,
|
|
struct dentry *child);
|
|
|
|
/* end_creating - finish action started with start_creating
|
|
* @child: dentry returned by start_creating() or vfs_mkdir()
|
|
*
|
|
* Unlock and release the child. This can be called after
|
|
* start_creating() whether that function succeeded or not,
|
|
* but it is not needed on failure.
|
|
*
|
|
* If vfs_mkdir() was called then the value returned from that function
|
|
* should be given for @child rather than the original dentry, as vfs_mkdir()
|
|
* may have provided a new dentry.
|
|
*
|
|
*
|
|
* If vfs_mkdir() was not called, then @child will be a valid dentry and
|
|
* @parent will be ignored.
|
|
*/
|
|
static inline void end_creating(struct dentry *child)
|
|
{
|
|
end_dirop(child);
|
|
}
|
|
|
|
/* end_creating_keep - finish action started with start_creating() and return result
|
|
* @child: dentry returned by start_creating() or vfs_mkdir()
|
|
*
|
|
* Unlock and return the child. This can be called after
|
|
* start_creating() whether that function succeeded or not,
|
|
* but it is not needed on failure.
|
|
*
|
|
* If vfs_mkdir() was called then the value returned from that function
|
|
* should be given for @child rather than the original dentry, as vfs_mkdir()
|
|
* may have provided a new dentry.
|
|
*
|
|
* Returns: @child, which may be a dentry or an error.
|
|
*
|
|
*/
|
|
static inline struct dentry *end_creating_keep(struct dentry *child)
|
|
{
|
|
if (!IS_ERR(child))
|
|
dget(child);
|
|
end_dirop(child);
|
|
return child;
|
|
}
|
|
|
|
/**
|
|
* end_removing - finish action started with start_removing
|
|
* @child: dentry returned by start_removing()
|
|
* @parent: dentry given to start_removing()
|
|
*
|
|
* Unlock and release the child.
|
|
*
|
|
* This is identical to end_dirop(). It can be passed the result of
|
|
* start_removing() whether that was successful or not, but it not needed
|
|
* if start_removing() failed.
|
|
*/
|
|
static inline void end_removing(struct dentry *child)
|
|
{
|
|
end_dirop(child);
|
|
}
|
|
|
|
extern int follow_down_one(struct path *);
|
|
extern int follow_down(struct path *path, unsigned int flags);
|
|
extern int follow_up(struct path *);
|
|
|
|
int start_renaming(struct renamedata *rd, int lookup_flags,
|
|
struct qstr *old_last, struct qstr *new_last);
|
|
int start_renaming_dentry(struct renamedata *rd, int lookup_flags,
|
|
struct dentry *old_dentry, struct qstr *new_last);
|
|
int start_renaming_two_dentries(struct renamedata *rd,
|
|
struct dentry *old_dentry, struct dentry *new_dentry);
|
|
void end_renaming(struct renamedata *rd);
|
|
|
|
/**
|
|
* mode_strip_umask - handle vfs umask stripping
|
|
* @dir: parent directory of the new inode
|
|
* @mode: mode of the new inode to be created in @dir
|
|
*
|
|
* In most filesystems, umask stripping depends on whether or not the
|
|
* filesystem supports POSIX ACLs. If the filesystem doesn't support it umask
|
|
* stripping is done directly in here. If the filesystem does support POSIX
|
|
* ACLs umask stripping is deferred until the filesystem calls
|
|
* posix_acl_create().
|
|
*
|
|
* Some filesystems (like NFSv4) also want to avoid umask stripping by the
|
|
* VFS, but don't support POSIX ACLs. Those filesystems can set SB_I_NOUMASK
|
|
* to get this effect without declaring that they support POSIX ACLs.
|
|
*
|
|
* Returns: mode
|
|
*/
|
|
static inline umode_t __must_check mode_strip_umask(const struct inode *dir, umode_t mode)
|
|
{
|
|
if (!IS_POSIXACL(dir) && !(dir->i_sb->s_iflags & SB_I_NOUMASK))
|
|
mode &= ~current_umask();
|
|
return mode;
|
|
}
|
|
|
|
extern int __must_check nd_jump_link(const struct path *path);
|
|
|
|
static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
|
|
{
|
|
((char *) name)[min(len, maxlen)] = '\0';
|
|
}
|
|
|
|
/**
|
|
* retry_estale - determine whether the caller should retry an operation
|
|
* @error: the error that would currently be returned
|
|
* @flags: flags being used for next lookup attempt
|
|
*
|
|
* Check to see if the error code was -ESTALE, and then determine whether
|
|
* to retry the call based on whether "flags" already has LOOKUP_REVAL set.
|
|
*
|
|
* Returns true if the caller should try the operation again.
|
|
*/
|
|
static inline bool
|
|
retry_estale(const long error, const unsigned int flags)
|
|
{
|
|
return unlikely(error == -ESTALE && !(flags & LOOKUP_REVAL));
|
|
}
|
|
|
|
#endif /* _LINUX_NAMEI_H */
|