Rendered at 18:55:34 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
quotemstr 1 days ago [-]
Linux is unusual in OS kernels in that direct system calls from arbitrary userspace code are supported and ABI-stable. This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition.
If, instead, as on OpenBSD, the kernel enforced the rule that all system calls had to go through libc (or perhaps a big ntdll.dll-like VDSO), then the whole problem the linked article tries in vain to solve would disappear. If you wanted to hook a system call, you'd just change the libc/VDSO dispatch. No need to rewrite any instructions.
If I were Linus, I'd make a new rule: starting today, all new system calls must go through VDSO. No exceptions. SYSCALL from anywhere else? SIGKILL.
This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
yjftsjthsd-h 1 days ago [-]
> This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition.
This model has always been a trade-off. It has downsides, but it also has upsides, including an immense boost in flexibility; decoupling from any particular userspace is useful.
> This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
matheusmoreira 20 hours ago [-]
The vDSO is just a normal ELF shared object that Linux maps somewhere in the address space of the process. The kernel passes a pointer to the ELF header to the process via the auxiliary vector.
That's the end of Linux's involvement. It's up to the program itself to do something useful with that pointer, namely by parsing the ELF header, and then resolving its symbols to function pointer addresses.
There's no doubt that all the various libc implementations out there do this, but I don't know if they do it in a way that lets LD_PRELOAD override the vDSO. They could be hard linking the vDSO system calls into their system call stubs or something.
Usually programs intercept system calls by overriding the libc stubs, which also indirectly intercepts the vDSO. However, it's not actually a requirement that the system be structured like this. Theoretically, the program could do anything. System calls can be done directly, without any stubs. Compilers could just generate the code directly without any functions at all.
mananaysiempre 1 days ago [-]
> Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
The kernel puts the vDSO in memory and tells ld.so where it is, but where if anywhere ld.so will put it in the search order it implements is its own concern. (TBH I don’t actually know whether ld.so will actually allow LD_PRELOAD to override the vDSO, but there’s no reason for it not to, except I guess for the syscalls that are needed to perform the dynamic linking itself.)
razighter777 24 hours ago [-]
Direct system calls are an amazing idea. The NtDll and bsd models are worse. The whole libc becomes a security boundary without the protection of kernel space. So much windows malware and process tampering happens because now you have a library (ntdll) fully in userspace that is given special privileges, which now becomes a huge attack surface. Then you have to deal with breakages between the built in libc versions and the kernel
This syscall overhead isn't as much as you suppose it is; for workloads where the syscall overhead actually makes a difference there are robust low-syscall paths for io/latency sensitive operations with DPDK, io_uring, and futex being a few examples.
And there are robust performant methods on linux for syscall interception/tracing, see seccomp unotify, bpf tracepoints, ftrace.
eqvinox 10 hours ago [-]
Your argument about libc/ntdll having "special privileges" is a bit weird in that the alternate option is everything having those privileges. The ntdll tampering doesn't exist on Linux because it's not necessary. It's not better due to this.
matheusmoreira 4 hours ago [-]
Yeah. On Linux it's just an optimization. What user space really wanted was a way to memory map some kernel data into the process address space in order to avoid switching to kernel mode while accessing it. Instead Linux memory mapped an entire ELF whose only purpose is to wrap the data. Newer system calls like io_uring are doing it right.
eqvinox 2 hours ago [-]
Strongly disagree that providing the vDSO in ELF file format is somehow harmful or inefficient. You'll need a compatibility mechanism in any case since the exposed features will change over time, and doing that through normal symbol resolution avoids a whole bunch of extra effort. And after ld.so is done with relations on executable startup, it makes no difference in performance either.
Look at the Linux architectures that have a vDSO in non-ELF format. It's seriously ugly.
(I don't think the comparison with io_uring is valid either, very different kind of API.)
quotemstr 2 hours ago [-]
Well, no more harmful or inefficient than ELF itself. :-) I really wish we'd ended up with PE or something with a two-level namespace.
And yeah, nothing wrong with using ELF for the vDSO. People have strange intuitions about what's expensive and what's cheap.
matheusmoreira 21 hours ago [-]
> This model has always been a terrible idea.
I disagree. It's an amazing idea. It allows me to write freestanding programs without any C libraries. It allows compilers to have Linux system call builtins that directly generate the calling convention. I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
As a kernel, Linux is completely independent from its user space. The instruction set is the correct abstraction for the system call entry point. There should be no "required C libraries". User space should be free to reinvent everything in Rust if it wants.
There are various kernel mechanisms for system call interception if that's what you want. Tools like strace work just fine on my lisp interpreter, so libc is clearly not needed.
LD_PRELOAD is a GNU ld feature. The linker is the exact sort of user space component that's supposed to be completely replaceable. None of this is any of Linux's business.
Use of the vDSO is not even mandatory. All system calls in the vDSO are also available via the kernel entry point. The vDSO is just an optimization for frequently called system calls like gettimeofday. Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker. This is a significant blow if you want to create minimal freestanding Linux programs.
Joker_vD 11 hours ago [-]
> It allows me to write freestanding programs without any C libraries.
KERNEL32.dll is not a C library (for once, its exported functions don't even use any of the default C calling conventions on x86).
> I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
"Freestanding", as in "standing on top of an OS but nothing else"? Then using the OS-provided shared object that is the documented interface between the userspace and the kernel doesn't violate your free stand.
I mean, I too had written small interpreters that had only LoadLibraryW/GetProcAddress from kernel32.dll as their imports and nothing else.
> The instruction set is the correct abstraction for the system call entry point.
Why? A function call seems a much more appropriate abstraction for the system call entry point.
> There should be no "required C libraries".
There is no required C library on Windows, yet it doesn't use direct system calls.
> Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker.
Not really. Neither Windows nor UEFI require you to reimplement any linking functionality. The OS can simply give your program a pointer to a table of function pointers at your entry point... which it already can do, see the aux vector on Linux.
matheusmoreira 4 hours ago [-]
> "Freestanding", as in "standing on top of an OS but nothing else"?
Freestanding as in freestanding C.
> Then using the OS-provided shared object that is the documented interface between the userspace and the kernel doesn't violate your free stand.
Correct. I'm just saying it shouldn't be required.
> I mean, I too had written small interpreters that had only LoadLibraryW/GetProcAddress from kernel32.dll as their imports and nothing else.
And where are LoadLibraryW and GetProcAddress coming from? What if you had to implement those functions yourself?
> A function call seems a much more appropriate abstraction for the system call entry point.
The system call entry point is essentially its own calling convention. It pretty much is a function call. The function just happens to be identified by a stable number rather than function address.
> The OS can simply give your program a pointer to a table of function pointers at your entry point
It's not "a table of function pointers", it's a complete ELF object which you have to parse and resolve symbols from. That's a lot more work than putting the system call number and arguments in specific registers, executing one instruction and retrieving the return value from a specific register.
Joker_vD 2 hours ago [-]
> I'm just saying it shouldn't be required.
I'm still not entirely sure why you want to use instructions from the privilleged subset of your ISA instead of plain old "call fun_addr".
> And where are LoadLibraryW and GetProcAddress coming from?
They're provided by the OS. Their addresses are patched into your executable's image during the loading. It's a very ancient technology, one of the very first software technologies invented, in fact — predates FORTRAN.
> What if you had to implement those functions yourself?
What if you had to implement exec(2) yourself? As a matter of fact, why is exec even provided as a syscall? Almost all of it (except for locking the text segment IIRC) can be done in the user space, including the parsing of the program headers and relocating stuff. Which, again, I've done once and I appreciate the OS giving it to me already implemented.
> The function just happens to be identified by a stable number rather than function address.
Or you can identify it as a stable offset into a large table of function addresses; or even as a stable character string!
> it's a complete ELF object which you have to parse and resolve symbols from.
You don't have to parse it. And UEFI environment in fact does give your program's entry point a table of function pointers: you put the arguments into the registers, take an offset into this table of functions, load the address, and call it, with one instruction, "call"/"branch-and-link", and it will give you the return value in a specific register. No need to parse anything by yourself.
I personally think this kind of dependency injection is pretty neat; you can intercept your own syscalls by passing pointer a modified table down your call stack. Trapping "sysenter" instruction in the userspace is way harder.
throwaway7356 1 days ago [-]
> all system calls had to go through libc (or perhaps a big ntdll.dll-like
Which makes containers crap on Windows and *BSD as they have to run the currect libc or equivalent. Thus you need to build a different container per OS version which sucks compared to Linux.
Joker_vD 1 days ago [-]
Windows doesn't even have its own libc.
orangesilk 22 hours ago [-]
Windows does have three libc, likely as a compability layer.
their names are:
* <forgotten something Windows 3.1>
* msvcrt.dll, 2014
* ucrt.dll (universal c runtime, since Windows 10)
Joker_vD 20 hours ago [-]
Those are not a compatibility layer with the OS. Heck, the all barely even provide proper access to the file system, ffs! The "msvcrt.dll" in the System32 folder is an ancient leftover from Microsoft-internal version of MSVC 6.0 or so, not intended for 3rd-party consumption.
At some point Microsoft got tired of maintaining binary-incompatible versions of its C runtime for different Visual Studios, so they started shipping UCRT with Windows itself... but you still don't need to touch that garbage for anything whatsoever.
yjftsjthsd-h 1 days ago [-]
They said "or equivalent", so ntdll
quotemstr 19 hours ago [-]
In Window,s the last-userspace-before-kernel-mode layer is called ntdll.dll. Unlike msvcrt or any other libc, ntdll is universal and loaded into every process.
quotemstr 19 hours ago [-]
You understand that your container is using the VDSO today, right? A UAPI requirement to issue system calls through it wouldn't hurt your deployment story at all.
But sure, keep using SYSCALL, THE DEPENDENCY MUTILATOR. It's got what containers crave!
freestanding 1 days ago [-]
thats why OpenBSD is unconvinient for development - because it binds to libc bloatware
razighter777 24 hours ago [-]
yep and and it forces every application to deal with the C FFI. It's beautiful in linux that I can access the full kernel API from an int 0x80/syscall instruction + a few register loads without having to link against crap. I can write a simple cat utility in a dozen or so lines of assembly.
freestanding 21 hours ago [-]
FFI is a different term. i called LIBC bloatware because it comes with many stuff that is not needed and things that are not appropriate for the system API layer, like memory allocator, string primitives etc. it also has an old style naming, like_this_one_supposed_to_be_nice or whtabthis1?
windows's NTDLL (at least early versions) naming is much better and the layer is much thiner, the problem is that it is "undocumented". also its rigid portability, while libc binding makes NIX software non-portable. NT also has syscalls through the interrupt btw.
matheusmoreira 20 hours ago [-]
You might enjoy my work on the lone lisp language. I got rid of the libc and implemented an entire interpreter with nothing but Linux system calls. Been working on it and blogging about it for about 3 years now.
ye, i see.. strange choice though "lisp". i would organize memory allocator and syscalls into a separate library, like here https://codeberg.org/determin1st/sm-c-base so you kinda have libc-substitute. also your memory allocator is kinda primitive for the runtime, better to steal that one, its heap-based with groups. syscalls look almost the same, not lone enough, kek.
PunchyHamster 22 hours ago [-]
The amount of times we ran LD_PRELOAD in prod was vanishingly small and limited to debug so the OpenBSD solution seems to be just waste of CPU cycles
Gualdrapo 1 days ago [-]
> If I were Linus, I'd make a new rule
Or, you know, just propose your idea to him
yjftsjthsd-h 1 days ago [-]
Based on https://www.phoronix.com/news/Linus-Torvalds-No-Random-vDSO , I had been under the impression that he wasn't fond of adding more use of vDSO. On rereading, I can't tell if that's a vDSO thing or a preference against fast randomness being provided by the kernel.
If, instead, as on OpenBSD, the kernel enforced the rule that all system calls had to go through libc (or perhaps a big ntdll.dll-like VDSO), then the whole problem the linked article tries in vain to solve would disappear. If you wanted to hook a system call, you'd just change the libc/VDSO dispatch. No need to rewrite any instructions.
If I were Linus, I'd make a new rule: starting today, all new system calls must go through VDSO. No exceptions. SYSCALL from anywhere else? SIGKILL.
This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
This model has always been a trade-off. It has downsides, but it also has upsides, including an immense boost in flexibility; decoupling from any particular userspace is useful.
> This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
That's the end of Linux's involvement. It's up to the program itself to do something useful with that pointer, namely by parsing the ELF header, and then resolving its symbols to function pointer addresses.
There's no doubt that all the various libc implementations out there do this, but I don't know if they do it in a way that lets LD_PRELOAD override the vDSO. They could be hard linking the vDSO system calls into their system call stubs or something.
Usually programs intercept system calls by overriding the libc stubs, which also indirectly intercepts the vDSO. However, it's not actually a requirement that the system be structured like this. Theoretically, the program could do anything. System calls can be done directly, without any stubs. Compilers could just generate the code directly without any functions at all.
The kernel puts the vDSO in memory and tells ld.so where it is, but where if anywhere ld.so will put it in the search order it implements is its own concern. (TBH I don’t actually know whether ld.so will actually allow LD_PRELOAD to override the vDSO, but there’s no reason for it not to, except I guess for the syscalls that are needed to perform the dynamic linking itself.)
This syscall overhead isn't as much as you suppose it is; for workloads where the syscall overhead actually makes a difference there are robust low-syscall paths for io/latency sensitive operations with DPDK, io_uring, and futex being a few examples.
And there are robust performant methods on linux for syscall interception/tracing, see seccomp unotify, bpf tracepoints, ftrace.
Look at the Linux architectures that have a vDSO in non-ELF format. It's seriously ugly.
(I don't think the comparison with io_uring is valid either, very different kind of API.)
And yeah, nothing wrong with using ELF for the vDSO. People have strange intuitions about what's expensive and what's cheap.
I disagree. It's an amazing idea. It allows me to write freestanding programs without any C libraries. It allows compilers to have Linux system call builtins that directly generate the calling convention. I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
I've written a sort of manifesto around this:
https://www.matheusmoreira.com/articles/linux-system-calls
> If I were Linus
Good thing you aren't.
As a kernel, Linux is completely independent from its user space. The instruction set is the correct abstraction for the system call entry point. There should be no "required C libraries". User space should be free to reinvent everything in Rust if it wants.
There are various kernel mechanisms for system call interception if that's what you want. Tools like strace work just fine on my lisp interpreter, so libc is clearly not needed.
LD_PRELOAD is a GNU ld feature. The linker is the exact sort of user space component that's supposed to be completely replaceable. None of this is any of Linux's business.
Use of the vDSO is not even mandatory. All system calls in the vDSO are also available via the kernel entry point. The vDSO is just an optimization for frequently called system calls like gettimeofday. Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker. This is a significant blow if you want to create minimal freestanding Linux programs.
KERNEL32.dll is not a C library (for once, its exported functions don't even use any of the default C calling conventions on x86).
> I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
"Freestanding", as in "standing on top of an OS but nothing else"? Then using the OS-provided shared object that is the documented interface between the userspace and the kernel doesn't violate your free stand.
I mean, I too had written small interpreters that had only LoadLibraryW/GetProcAddress from kernel32.dll as their imports and nothing else.
> The instruction set is the correct abstraction for the system call entry point.
Why? A function call seems a much more appropriate abstraction for the system call entry point.
> There should be no "required C libraries".
There is no required C library on Windows, yet it doesn't use direct system calls.
> Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker.
Not really. Neither Windows nor UEFI require you to reimplement any linking functionality. The OS can simply give your program a pointer to a table of function pointers at your entry point... which it already can do, see the aux vector on Linux.
Freestanding as in freestanding C.
> Then using the OS-provided shared object that is the documented interface between the userspace and the kernel doesn't violate your free stand.
Correct. I'm just saying it shouldn't be required.
> I mean, I too had written small interpreters that had only LoadLibraryW/GetProcAddress from kernel32.dll as their imports and nothing else.
And where are LoadLibraryW and GetProcAddress coming from? What if you had to implement those functions yourself?
> A function call seems a much more appropriate abstraction for the system call entry point.
The system call entry point is essentially its own calling convention. It pretty much is a function call. The function just happens to be identified by a stable number rather than function address.
> The OS can simply give your program a pointer to a table of function pointers at your entry point
It's not "a table of function pointers", it's a complete ELF object which you have to parse and resolve symbols from. That's a lot more work than putting the system call number and arguments in specific registers, executing one instruction and retrieving the return value from a specific register.
I'm still not entirely sure why you want to use instructions from the privilleged subset of your ISA instead of plain old "call fun_addr".
> And where are LoadLibraryW and GetProcAddress coming from?
They're provided by the OS. Their addresses are patched into your executable's image during the loading. It's a very ancient technology, one of the very first software technologies invented, in fact — predates FORTRAN.
> What if you had to implement those functions yourself?
What if you had to implement exec(2) yourself? As a matter of fact, why is exec even provided as a syscall? Almost all of it (except for locking the text segment IIRC) can be done in the user space, including the parsing of the program headers and relocating stuff. Which, again, I've done once and I appreciate the OS giving it to me already implemented.
> The function just happens to be identified by a stable number rather than function address.
Or you can identify it as a stable offset into a large table of function addresses; or even as a stable character string!
> it's a complete ELF object which you have to parse and resolve symbols from.You don't have to parse it. And UEFI environment in fact does give your program's entry point a table of function pointers: you put the arguments into the registers, take an offset into this table of functions, load the address, and call it, with one instruction, "call"/"branch-and-link", and it will give you the return value in a specific register. No need to parse anything by yourself.
I personally think this kind of dependency injection is pretty neat; you can intercept your own syscalls by passing pointer a modified table down your call stack. Trapping "sysenter" instruction in the userspace is way harder.
Which makes containers crap on Windows and *BSD as they have to run the currect libc or equivalent. Thus you need to build a different container per OS version which sucks compared to Linux.
At some point Microsoft got tired of maintaining binary-incompatible versions of its C runtime for different Visual Studios, so they started shipping UCRT with Windows itself... but you still don't need to touch that garbage for anything whatsoever.
But sure, keep using SYSCALL, THE DEPENDENCY MUTILATOR. It's got what containers crave!
windows's NTDLL (at least early versions) naming is much better and the layer is much thiner, the problem is that it is "undocumented". also its rigid portability, while libc binding makes NIX software non-portable. NT also has syscalls through the interrupt btw.
http://github.com/lone-lang/lone/
Or, you know, just propose your idea to him