Waiting for many things at once with io_uring

122 points by ashvardanian 10 days ago

jchw 6 days ago

io_uring and Linux's many different types of file descriptors are great. I mean, I personally think that the explicit large API surface of WinNT is kinda nicer than jamming a bunch of weird functionality into files and file descriptors like Linux, but when things work, they do show some nice advantages of unifying everything to some framework, ill-fitting as it may sometimes be (Though now that I say this, it's not like WinNT Objects are really any different here, they just offer more advanced baseline functionality like ACLs). io_uring and it's ability to tie together a lot of pre-existing things in new ways is pretty cool. UNIX never really had a story for async operations, something I will not fault an OS designed 50 years ago for. However, still not having a decent story for async operations today is harder to excuse. I've been excited to learn about io_uring. I've learned a lot listening to conference talk recordings about it. While it has its issues (like the many times it (semi-?)accidentally bypassed security subsystems...) it has some really cool and substantial benefits.

I'll tell you what I would love to see next: a successor to inotify that does not involve opening one zillion file descriptors to watch a recursive subtree. I'm sure there are valid reasons why it's not easy to just make it happen, but it feels like it will be a major improvement in a lot of use cases. And in many cases, it would probably fix the dreaded problem of users needing to fight against ulimits, especially in text editors like VSCode.

I don't have anything of great substance to say about the actual subject of the article. It feels a bit late to finally get this functionality proper in Linux after NT had it basically forever, but any improvement is welcome. Next time I'm doing something where I want to wait on a bunch of FDs I will have to try this approach.

PhilipRoman 5 days ago

Regarding your point about inotify: it's hard to put this in words, but I feel like there is a need to be able to manipulate file trees more explicitly. Right now the directory abstraction is mostly just a list of lists, and it works fine, but it has problems w.r.t. atomicity and sometimes performance.
iknowstuff 6 days ago

fanotify?
https://man7.org/linux/man-pages/man7/fanotify.7.html
- jchw 6 days ago
  
  It feels like last time I looked into this, fanotify was for some reason not suitable for most inotify use cases. Maybe this has changed. Would be great news if so.
  
  claudex 5 days ago
  
  Depends on when was the last time you look at it:
  > In the original fanotify API, only a limited set of events was supported. In particular, there was no support for create, delete, and move events. The support for those events was added in Linux 5.1. (See inotify(7) for details of an API that did notify those events pre Linux 5.1.)
hansvm 6 days ago

> inotify
A hack that should be performant enough if properly implemented would be a custom FUSE implementation over the directory. As a one-off it could just do the callbacks you want done, or as a reusable component it could implement the inotify behavior you want.
trws 6 days ago

An inotify replacement that can work at whole FS level (and doesn’t require root/admin like the existing option) would be amazing. To be honest, I don’t see a reason it would be hard at the whole filesystem or perhaps mount level unless there are security ramifications. Restricting it to a subdirectory might be tricky though.
- o11c 5 days ago
  
  The problem with notifications is always "deadlocks".
  Even without deadlocks it's common to accidentally create a loop that quickly starts using 100% CPU until you kill it. This isn't theoretical, e.g. I ran across it after encountering a program that used `bash -l` to run scripts (fundamentally wrong, but not the bug here), when my `~/.bash_login` happened to touch the part of the filesystem it was watching.
  
  throwaway127482 5 days ago
  
  The problems you're describing sound rather niche - I can see how a deadlock or a 100% CPU issue could happen, but the biggest problem in my experience is the resource usage associated with the file watches, at least on Linux.
pjmlp 5 days ago

Some UNIXes did, but they could never agree into something that would cross across the ecosystem.

KerrAvon 6 days ago

Wikipedia:

> In June 2023, Google's security team reported that 60% of the exploits submitted to their bug bounty program in 2022 were exploits of the Linux kernel's io_uring vulnerabilities. As a result, io_uring was disabled for apps in Android, and disabled entirely in ChromeOS as well as Google servers.[11] Docker also consequently disabled io_uring from their default seccomp profile.[12]

Root privilege CVE from earlier this year (2024): https://nvd.nist.gov/vuln/detail/CVE-2024-0582

refulgentis 6 days ago

It took me many io_uring hello world articles to find out it's not really used in production (ex. Android and ChromeOS both disable it) because it was, and continues to be, a source of an absolutely bonkers outsized # of security issues.

I don't remember much more than that*, but just dropping it here because I learned a ton more from reading about that, than my Nth io_uring article.

* for example, the article mentioning relevant buffers are shared with the system made me want to say "aHA, yes, that's what the security articles said was a core issue!" -- but I can't actually remember with 100% confidence

loeg 6 days ago

Well, it's not true that it isn't used in production. Google has been burned and at least historically did not use it. But I know some services at Facebook use it in production.
Yes, historically it was a big source of security bugs. I think that has tapered off somewhat as the rate of change slows down.
- junon 5 days ago
  
  Jens Axboe, io_uring creator, works at Facebook if memory serves, so I'd imagine that's why it's used in prod at Facebook.
  
  loeg 5 days ago
  
  He does, and these things are somewhat related, though it's not like services are compelled to use io_uring on his behalf.
MathMonkeyMan 6 days ago

Someone linked to a kernel mailing list recently, I don't know if it was in a submission or in a comment.
The security issue with io_uring, as I understand it, is that it bypasses a lot of Linux's security auditing mechanisms. The problem is that, like with ioctl, if the kernel called out to a security subsystem with "here's something that the user wants to do with this file," the security subsystem would have to know what "something" means for every driver. Impossible; so do you allow most things? Deny them? If you choose the former, now there are gaping security holes. If you choose the latter, then enabling security will break too many things.
- vacuity 5 days ago
  
  We should have something like capability-based security, since object capabilities are amenable to more expressive interfaces than "read bytes" and "write bytes". Capability-based security also favors minimizing privilege by default instead of providing too much privilege and restricting/auditing later.

hosh 6 days ago

Discussion thread in the Erlang community proposing implementing io_uring for BEAM, security issues, and a digression comparing it to FreeBSD's kqueues

https://erlangforums.com/t/erlang-io-uring-support/765/18?pa...

jeffbee 6 days ago

Some of the things that you cannot wait on using io_uring are your kernel actually supporting the feature mentioned in the article, io_uring actually working properly, and io_uring solving its seemingly bottomless supply of local user exploits. In the early days of this feature I was bullish but the way its implementation has emitted CVEs has not been a source of joy, and now many major Linux operators have banned the API internally. Maybe what is needed is a moment of reflection and a scratch reimplementation that learns the lessons of io_uring?

loeg 6 days ago

A new from-scratch implementation would suffer from a similar problem as early io_uring did (high rate of code change, which seems to be what drives security bug rates).
- KerrAvon 6 days ago
  
  Isn't something fundamentally broken with either the kernel or the adoption process if that's true? It seems like you should be able to do fast async I/O without the kind of privilege escalation vulnerabilities that are still happening.

ashvardanian 6 days ago

Surprisingly, I only came across Francesco's blog this month. I stumbled upon the 2021 post "Speeding up atan2f by 50x" while searching for others who have to reimplement trigonometry in SIMD every other year. I've also enjoyed "Beating the L1 cache with value speculation" from the same year, as well as the 2013 Agda sorting example.

Highly recommend checking it out: https://mazzo.li/archive.html

Const-me 5 days ago

When I needed something similar to that for older Linux kernels, I have used primitives based on file descriptors (eventfd for manually reset events, pidfd_open to wait for completion of processes, mq_open for sending messages), and poll() to wait for multiple things with one system call.

4hg4ufxhy 6 days ago

Very interesting, but unfortunate there is no example program. I guess that is left as exercise for reader, but it's a bit daunting for a non systems programmer.

tux1968 6 days ago

For c there is :
https://git.kernel.dk/cgit/liburing/tree/examples
There's also a minimal Rust example for tokio:
https://github.com/tokio-rs/io-uring/tree/master/examples

User23 6 days ago

It's a shame io_uring is proving to be such a disappointment. It's been over two decades now that Linux has been trying to catch up with the NT Kernel's IO Completion Ports and we're still not there.

On the plus side, this submission somehow reminded me about ACE[1], which is where I first came across the Proactor[2]/Reactor distinction. Good times!

[1] https://www.dre.vanderbilt.edu/~schmidt/ACE.html

[2] https://www.dre.vanderbilt.edu/~schmidt/PDF/Proactor.pdf

tveita 4 days ago

For all people say things like "Linux has been trying to catch up with the NT Kernel's IO Completion Ports", is there anywhere that translates to Windows being faster than Linux at disk or networking?
Or is it just a matter of conceptual elegance or programmer convenience?
- loeg 4 days ago
  
  Yes, NT had legitimately better async disk IO with completion ports prior to io_uring. Some things are much worse on Windows (anything to do with file enumeration / directories) but once you've got the file open, NT async IO is high performance. Prior to io_uring, Linux applications could only approximate asynchronous disk IO via a threadpool, and those threadpools could not integrate with fd-based event loops well.
  I can't find the comments now, but HN users have written at length about NT IO performance in the past.
loeg 5 days ago

You're the only person I've ever heard call io_uring "such a disappointment." It's hard to even tell what you mean by that without further detail.
klooney 5 days ago

Windows has adopted io_uring, it can't be that bad.

c0detrafficker 6 days ago

[dead]