NVIDIA does not want CUDA development (e.g. flash attention) to move to Triton because Triton also supports AMD and if ecosystem moves from pure CUDA to Triton, that's bad for NVIDIA's lock-in. That's why there is so much focus on CUDA Python (lower level) and Tilus (higher level, more similar to Triton).
And they continue to fumble with it. AMD has had time to catch up--a decade in
fact. They simply don’t understand: robust software support requires a significant
investment from their side. Simply providing small amounts of funding for academic
research doesn’t suffice.
Meanwhile Nvidia keeps building more and more libraries..
It’s not AMD it’s their board. Unless the board approves billions of $ in stock rewards to motivate good engineers nobody is going to join.
It’s not rocket science. They can identify many key personel in Nvidia and make them offers which would be significantly better for them. Cycle 3 years and repeat. Two or three cycles and you will have replicated the most important parts.
Not sure cloud providers will care, all the costs get passed onto the customers. There's already far more demand for GPUs than can be met by the supply chain, too.
If they were sitting on excess stock, or struggling to sell, sure.
The cloud providers all have their own Nvidia alternatives. Having worked with more than one, I would rate them not much better than AMD when it comes to software.
How feasible is this for a cloud operation(s)? I imagine this work requires close collaboration with the architects and proprietary knowledge about the design.
it seems feasible, it's more a matter of how much of a priority it is.
I follow Google most closely. They design and manufacture their own accelerators. AWS I know manufactures its own CPUs, but I don't know if they're working on or already have an AI accelerator.
Several of the big players are working on OpenXLA, which is designed to abstract and commoditize the GPU layer: https://openxla.org/xla
OpenXLA mentions:
> Alibaba, Amazon Web Services, AMD, Apple, Arm, Google, Intel, Meta, and NVIDIA
> AWS Inferentia chips are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications
but I don't know this second if they're supported by the major frameworks, or what
I also didn't recall about https://aws.amazon.com/ai/machine-learning/trainium/ until I was looking up that page, so it seems they're trying to have a competitor to the TPUs just naming them dumb, because AWS
> AWS Trainium chips are a family of AI chips purpose built by AWS for AI training and inference to deliver high performance while reducing costs.
It's my guess that they don't want to. A decade ago when AMD's CPUs had trouble competing at the high end, they ceded that market segment almost entirely, and now they're doing the same with nVidia. And Intel is basically a dead company. Neither of them are going to risk the capital and internal shake-up necessary to actually compete with nVidia. And anyways, there's only so much TSMC capacity for high-end chips, and Apple and nVidia have already spent infinity dollars reserving most of it.
Unmatched by Intel, who have been failing for a decade, so there was no competition? They mostly won by default. And Apple's chips are giving them a run for their money. If Apple sold plain CPUs that weren't locked to their software (they never will, but hypothetically) then AMD would let themselves slide into 2nd place again.
That really makes three companies that are happy to concede to nVidia, because Apple could definitely challenge nVidia if they wanted to.
Note: I'm not saying that AMD sucks, just that their corporate culture prevents them from being very ambitious.
Apples chips dont even come close. Their benchmarks compete in specific tasks and then measure by metrics like preformance per watt. These are benchmarks AMDs cpus arent optimizing for and yet they're still close. Once you remove the power consumption out of the tests and broading the tests AMD cpu's come out ahead. Apple had something impressive with M1 then within a year the other mobile cpu manufactures came out with something on par. A year after that and they had surpassed apple.
Apples closest cpu competition is Qualcomm and they dont win that.
What's crazy to me is that ROCm and SYCL are open-source, but somehow more difficult to install and support less hardware for their respective brands than CUDA does for Nvidia...
After looking up Triton this seems quite different - Triton is a high-level CUDA competitor with Python-like syntax, and this seems to be a library aimed at generating GPU assembly for micro-optimizing kernels.
My experience searching for "nvidia triton" coughed up oppressive number of results for the similarly named inference server but I think the Triton discussed here is https://github.com/triton-lang/triton (although the commits that I spot checked were all from @openai.com emails, not nvidia)
The recent interest on NVidia making Python first class on CUDA ecosystem is what makes me wonder how successful Mojo migth becomem, if they aren't faster than NVidia with Python.
All the attempts to attack CUDA fail to undestand why most researchers flock to it, instead of enduring the pain of the competion tooling, and they tend to focus on a single aspect of CUDA, be it C++, or something else, but never the polyglot support, the libraries, the IDE integration, the graphical debuggers, the compiler backends for other developers to target CUDA.
My impression of NVIDIA is that internally the teams are quite independent and there's not much in terms of broad strategy. So it could be these two efforts are not related.
This is accurate. A lot of projects that come out are each teams research projects unless they’re tied to a specific product which is when they start working closer together and integrating further.
However this repo was specifically part of their acquisition of CentML
okay, I am not a systems level programmer but I am currently learning c with the aim of doing some gpgpu programming using Cuda etc., what is a tile level gpu kernel programming language? and how is it different from something like cuda?
I know i can ask a llm or search on google, but i was hoping someone in the community could explain it in a way i could understand.
I'd say the main difference is that in traditional GPU languages, the thread of execution is a single lane of a warp or wave. You typically work with ~fp32-sized values, and those are mapped by the compiler to one lane of a 32-wide vector register in a wave (or 16- to 128-wide depending on the architecture). Control flow often has to be implemented through implicit masking as different threads mapped to lanes of the same vector can make different control flow decisions (that is, an if statement in the source program gets compiled to an instruction sequence that uses masking in some way - the details vary by vendor).
In tile languages, the thread of execution is an entire workgroup (or block in CUDA-speak). You typically work with large vector/matrix-sized values. The compiler decides how to distribute those values onto vector registers across waves of the workgroup. (Example: if your program has a value that is a 32x32 matrix of fp32 elements and a workgroup has 8 32-wide waves, the value will be implemented as 4 standard-sized vector registers in each wave of the workgroup.) All control flow affects the entire workgroup equally since the ToE is the entire workgroup, and so the compiler does not have to do implicit masking. Instead, tile languages usually have provisions for explicit masking using boolean vectors/matrices.
Tile languages are a new phenomenon and clearly disagree on what the exact level of abstraction should be. For example, Triton mostly hides the details of shared memory from the programmer and lets the compiler take care of software pipelined loads, while in this Tilus here, it looks like the programmer has to program shared memory use explicitly.
programming w/ tiles means working w/ contiguous-ish blocks/squares of data, to minimise cache misses (often major bottleneck in gpu programming), so it's just a nod to the fact that optimisations like these are included in the language or some such...
It's like writing code directly for the GPU's DSP-like SIMD cores in assembly, instead of taking the CUDA model of targeting a single SIMD thread, from which the compiler figures out how to write assembly for the core itself.
Maybe this is my dad speaking through me when he got tired of answering questions and said "look it up" and pointed to our bookshelf but....
Copying and pasting your exact words above into an LLM (gemini/chatgpt) provided an answer arguably better than any of the human answers at the time of this post.
while thats true, it sets a dangerous precedent. llms get their answers from being trained on a corpus of human knowledge created by humans. what is going to happen when we all collectively tell each other to copy paste into chatgpt and no one is actually answering any questions?
seriously!!! there is so much more Nvidia and its billionaire managers can do to improve developers experience with CUDA/nvcc/PTX, instead of yet another barely functional, sparsely documented, rarely tested DSL
WebGPU is a API that sits above Vulkan, DirectX and Metal. For CUDA, ROCm and oneAPI there is SYCL. But it's not used too much and it seems mostly just Intel is invested.
Additionally, designed for 2015 hardware capabilities, with JavaScript in mind, in a sandboxed environment.
Some folks are reaching out to WebGPU outside the browser, because Vulkan is a pain to program for, and they misuse WebGPU as middleware, although a less capable one, due to its original design, and who is driving its standardisation process.
Regarding SYCL, Intel basically bought the only company that was shipping a good developer experience for it, CodePlay, which used to do specialized compilers for game consoles, and pivoted into GPGPU.
There are a lot of hardware features inaccessible from webgpu because its devs still have a lot of work to do for the current implementations—no browser is even shipping it on Linux by default.
Vulkan is closer, but CUDA still exposes more features.
WebGPU is the lowest common denominator over ancient mobile phones and modern hardware; as well as Vulkan, DirectX and Metal; effectively making it an ancient-mobile-phone-graphics-API. It is at the time of its releas as outdated as WebGL was when it came out.
Additionally Chrome sabotaged the WebGL 2.0 compute effort in name of WebGPU, which 5 years later still isn't fully available.
However despite all of this, there is hardly any Web 3D experience, or game, at the level of iPhone games from its OpenGL ES 3.0 glory days, like Infinity Blade, from 2011!
NVIDIA does not want CUDA development (e.g. flash attention) to move to Triton because Triton also supports AMD and if ecosystem moves from pure CUDA to Triton, that's bad for NVIDIA's lock-in. That's why there is so much focus on CUDA Python (lower level) and Tilus (higher level, more similar to Triton).
What’s bad for Nvidia is good for everyone else. The cuda lock-in needs to die.
It is on AMD and Intel to deliver.
And they continue to fumble with it. AMD has had time to catch up--a decade in fact. They simply don’t understand: robust software support requires a significant investment from their side. Simply providing small amounts of funding for academic research doesn’t suffice.
Meanwhile Nvidia keeps building more and more libraries..
It’s not AMD it’s their board. Unless the board approves billions of $ in stock rewards to motivate good engineers nobody is going to join.
It’s not rocket science. They can identify many key personel in Nvidia and make them offers which would be significantly better for them. Cycle 3 years and repeat. Two or three cycles and you will have replicated the most important parts.
It wasn't me that missed the deadline, it was my brain.
Or one of the cloud providers who doesn't want to pay lock-in prices when they'd rather pay commodity prices
Not sure cloud providers will care, all the costs get passed onto the customers. There's already far more demand for GPUs than can be met by the supply chain, too.
If they were sitting on excess stock, or struggling to sell, sure.
The cloud providers all have their own Nvidia alternatives. Having worked with more than one, I would rate them not much better than AMD when it comes to software.
How feasible is this for a cloud operation(s)? I imagine this work requires close collaboration with the architects and proprietary knowledge about the design.
it seems feasible, it's more a matter of how much of a priority it is.
I follow Google most closely. They design and manufacture their own accelerators. AWS I know manufactures its own CPUs, but I don't know if they're working on or already have an AI accelerator.
Several of the big players are working on OpenXLA, which is designed to abstract and commoditize the GPU layer: https://openxla.org/xla
OpenXLA mentions:
> Alibaba, Amazon Web Services, AMD, Apple, Arm, Google, Intel, Meta, and NVIDIA
> AWS I know manufactures its own CPUs, but I don't know if they're working on or already have an AI accelerator
I believe those are the Inferentia: https://aws.amazon.com/ai/machine-learning/inferentia/
> AWS Inferentia chips are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications
but I don't know this second if they're supported by the major frameworks, or what
I also didn't recall about https://aws.amazon.com/ai/machine-learning/trainium/ until I was looking up that page, so it seems they're trying to have a competitor to the TPUs just naming them dumb, because AWS
> AWS Trainium chips are a family of AI chips purpose built by AWS for AI training and inference to deliver high performance while reducing costs.
thanks this is useful!
> have a competitor to the TPUs just naming them dumb, because AWS
I kind of like "trainium" although "inferentia" I could take or leave. At least it's nice that the names tell you the intended use case.
With what software though?
It's my guess that they don't want to. A decade ago when AMD's CPUs had trouble competing at the high end, they ceded that market segment almost entirely, and now they're doing the same with nVidia. And Intel is basically a dead company. Neither of them are going to risk the capital and internal shake-up necessary to actually compete with nVidia. And anyways, there's only so much TSMC capacity for high-end chips, and Apple and nVidia have already spent infinity dollars reserving most of it.
Ummm are you forgetting epyc and ryzen are now unmatched?
If AMD wants to they can compete..
Unmatched by Intel, who have been failing for a decade, so there was no competition? They mostly won by default. And Apple's chips are giving them a run for their money. If Apple sold plain CPUs that weren't locked to their software (they never will, but hypothetically) then AMD would let themselves slide into 2nd place again.
That really makes three companies that are happy to concede to nVidia, because Apple could definitely challenge nVidia if they wanted to.
Note: I'm not saying that AMD sucks, just that their corporate culture prevents them from being very ambitious.
Apples chips dont even come close. Their benchmarks compete in specific tasks and then measure by metrics like preformance per watt. These are benchmarks AMDs cpus arent optimizing for and yet they're still close. Once you remove the power consumption out of the tests and broading the tests AMD cpu's come out ahead. Apple had something impressive with M1 then within a year the other mobile cpu manufactures came out with something on par. A year after that and they had surpassed apple.
Apples closest cpu competition is Qualcomm and they dont win that.
What's crazy to me is that ROCm and SYCL are open-source, but somehow more difficult to install and support less hardware for their respective brands than CUDA does for Nvidia...
After looking up Triton this seems quite different - Triton is a high-level CUDA competitor with Python-like syntax, and this seems to be a library aimed at generating GPU assembly for micro-optimizing kernels.
My experience searching for "nvidia triton" coughed up oppressive number of results for the similarly named inference server but I think the Triton discussed here is https://github.com/triton-lang/triton (although the commits that I spot checked were all from @openai.com emails, not nvidia)
Looks pretty similar to me[1]
[1]: https://nvidia.github.io/tilus/getting-started/tutorials/mat...
The recent interest on NVidia making Python first class on CUDA ecosystem is what makes me wonder how successful Mojo migth becomem, if they aren't faster than NVidia with Python.
All the attempts to attack CUDA fail to undestand why most researchers flock to it, instead of enduring the pain of the competion tooling, and they tend to focus on a single aspect of CUDA, be it C++, or something else, but never the polyglot support, the libraries, the IDE integration, the graphical debuggers, the compiler backends for other developers to target CUDA.
What’s the relationship between Tilus and cuTile (and maybe Warp)?
A slight tangent, but I really wish Nvidia would release more details on Tile IR. Specifically on what it enables vs PTX.
Is it just about moving towards more MLIR based infra? Maybe it’s higher level and thus can enable better codegen across generations?
My impression of NVIDIA is that internally the teams are quite independent and there's not much in terms of broad strategy. So it could be these two efforts are not related.
This is accurate. A lot of projects that come out are each teams research projects unless they’re tied to a specific product which is when they start working closer together and integrating further.
However this repo was specifically part of their acquisition of CentML
okay, I am not a systems level programmer but I am currently learning c with the aim of doing some gpgpu programming using Cuda etc., what is a tile level gpu kernel programming language? and how is it different from something like cuda?
I know i can ask a llm or search on google, but i was hoping someone in the community could explain it in a way i could understand.
I'd say the main difference is that in traditional GPU languages, the thread of execution is a single lane of a warp or wave. You typically work with ~fp32-sized values, and those are mapped by the compiler to one lane of a 32-wide vector register in a wave (or 16- to 128-wide depending on the architecture). Control flow often has to be implemented through implicit masking as different threads mapped to lanes of the same vector can make different control flow decisions (that is, an if statement in the source program gets compiled to an instruction sequence that uses masking in some way - the details vary by vendor).
In tile languages, the thread of execution is an entire workgroup (or block in CUDA-speak). You typically work with large vector/matrix-sized values. The compiler decides how to distribute those values onto vector registers across waves of the workgroup. (Example: if your program has a value that is a 32x32 matrix of fp32 elements and a workgroup has 8 32-wide waves, the value will be implemented as 4 standard-sized vector registers in each wave of the workgroup.) All control flow affects the entire workgroup equally since the ToE is the entire workgroup, and so the compiler does not have to do implicit masking. Instead, tile languages usually have provisions for explicit masking using boolean vectors/matrices.
Tile languages are a new phenomenon and clearly disagree on what the exact level of abstraction should be. For example, Triton mostly hides the details of shared memory from the programmer and lets the compiler take care of software pipelined loads, while in this Tilus here, it looks like the programmer has to program shared memory use explicitly.
programming w/ tiles means working w/ contiguous-ish blocks/squares of data, to minimise cache misses (often major bottleneck in gpu programming), so it's just a nod to the fact that optimisations like these are included in the language or some such...
It's like writing code directly for the GPU's DSP-like SIMD cores in assembly, instead of taking the CUDA model of targeting a single SIMD thread, from which the compiler figures out how to write assembly for the core itself.
Maybe this is my dad speaking through me when he got tired of answering questions and said "look it up" and pointed to our bookshelf but....
Copying and pasting your exact words above into an LLM (gemini/chatgpt) provided an answer arguably better than any of the human answers at the time of this post.
while thats true, it sets a dangerous precedent. llms get their answers from being trained on a corpus of human knowledge created by humans. what is going to happen when we all collectively tell each other to copy paste into chatgpt and no one is actually answering any questions?
The linked paper in the GitHub repository doesn’t contain an NVIDIA email. There is an Amazon email and a bunch of university emails.
How come that this paper has become an NVIDIA project?
Paper: https://arxiv.org/pdf/2504.12984
Nvidia lock in attempt.
I'm hoping that something like MoYe.jl moves from Nvidia-only to a vendor-agnostic tile DSL.
So, SIMD but in python and on GPU?
seriously!!! there is so much more Nvidia and its billionaire managers can do to improve developers experience with CUDA/nvcc/PTX, instead of yet another barely functional, sparsely documented, rarely tested DSL
"...made incremental improvements to CUDA" just doesn't get you promoted like "creator of GPU Kernel Programming language Tilus"
New thing to experiment with.
Great :).
but why use “~float16” for something that in its own documentation the authors describe as a float16* ?
[dead]
I'm a little out of the loop these days but I thought webgpu obviated the use of platform specific language (say Cuda)?
WebGPU is a API that sits above Vulkan, DirectX and Metal. For CUDA, ROCm and oneAPI there is SYCL. But it's not used too much and it seems mostly just Intel is invested.
Additionally, designed for 2015 hardware capabilities, with JavaScript in mind, in a sandboxed environment.
Some folks are reaching out to WebGPU outside the browser, because Vulkan is a pain to program for, and they misuse WebGPU as middleware, although a less capable one, due to its original design, and who is driving its standardisation process.
Regarding SYCL, Intel basically bought the only company that was shipping a good developer experience for it, CodePlay, which used to do specialized compilers for game consoles, and pivoted into GPGPU.
There are a lot of hardware features inaccessible from webgpu because its devs still have a lot of work to do for the current implementations—no browser is even shipping it on Linux by default.
Vulkan is closer, but CUDA still exposes more features.
Not quite right, Chrome is shipping with support on ChromeOS, Android and WebOS.
It is GNU/Linux that Google/Chrome sees as low priority.
WebGPU is the lowest common denominator over ancient mobile phones and modern hardware; as well as Vulkan, DirectX and Metal; effectively making it an ancient-mobile-phone-graphics-API. It is at the time of its releas as outdated as WebGL was when it came out.
Additionally Chrome sabotaged the WebGL 2.0 compute effort in name of WebGPU, which 5 years later still isn't fully available.
However despite all of this, there is hardly any Web 3D experience, or game, at the level of iPhone games from its OpenGL ES 3.0 glory days, like Infinity Blade, from 2011!