Zyphra, AMD, and IBM have spent the past year answering a question that has hovered over the AI industry: can AMD’s hardware support the training of a major foundation model at scale? Their answer is ZAYA1, a Mixture-of-Experts model built entirely on AMD GPUs, networking, and software.
The companies describe ZAYA1 as the first significant MoE foundation model trained end to end on AMD’s Instinct MI300X chips, Pensando networking, and ROCm software, all hosted on IBM Cloud. What stands out is how familiar the system design looks. Zyphra didn’t rely on exotic hardware or niche configurations. Instead, it built a cluster that resembles a standard enterprise setup, only without NVIDIA in the loop.
Early benchmarks suggest ZAYA1 performs competitively with established open models across reasoning, math, and coding tasks. For businesses stuck waiting for GPUs or dealing with rising supply costs, the result offers something that has been missing: a viable alternative with real performance behind it.
Why AMD’s Setup Helped Cut Costs While Keeping Performance Steady
Many engineering teams value memory capacity, stable iteration times, and strong communication throughput more than theoretical peak compute. AMD’s MI300X offers 192GB of high-bandwidth memory per GPU, which gives researchers room to run early experiments without the complexity of heavy parallelism from the start.
Zyphra designed each node with eight MI300X GPUs connected over InfinityFabric and paired each GPU with its own Pollara network card. A separate network handled dataset reads and checkpointing. The architecture is simple on purpose. Fewer moving parts help keep switching costs down and iteration times predictable.
What Makes ZAYA1 Stand Out
ZAYA1-base activates 760 million parameters out of a total pool of 8.3 billion and was trained on 12 trillion tokens through a three-stage process. The model leans on compressed attention, improved routing for expert selection, and lighter residual scaling to support stability in deeper layers.
Zyphra used both Muon and AdamW as optimizers. To make Muon efficient on AMD hardware, the team fused kernels and reduced memory overhead so the optimizer wouldn’t slow training. Batch sizes increased throughout the process, supported by storage pipelines tuned to deliver tokens at speed.
The model competes with popular peers such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. Thanks to the MoE structure, only a small portion of the model runs at once, helping lower memory needs during inference and cutting serving costs. For companies like banks, this means it is more realistic to fine-tune domain-specific models without complex parallelism from day one.
Getting ROCm to Run Smoothly
Zyphra was clear that shifting from a mature NVIDIA workflow to ROCm required time and tuning. The team adapted model dimensions, GEMM patterns, and microbatch sizes to fit MI300X’s strengths. InfinityFabric performs best when all eight GPUs in a node participate in collective operations, and Pollara reaches peak throughput with larger message sizes. Zyphra adjusted buffer sizes and routing to match these preferences.
Handling longer contexts, up to 32k tokens, required a mix of ring attention for sharded sequences and tree attention for decoding. Storage systems were optimized too, with bundled dataset shards and larger page caches to reduce scattered reads and speed checkpoint recovery.
Keeping Long Training Runs Stable
Weeks-long training jobs rarely run without hiccups. Zyphra built Aegis, a monitoring service that watches logs and system metrics and responds automatically to common issues such as NIC faults or memory errors. The team also raised RCCL timeouts to prevent minor network glitches from derailing entire jobs.
Checkpointing was redesigned to use all GPUs instead of forcing data through a single node. Zyphra says this approach delivered more than ten times faster save times, improving overall resilience and reducing operator load.
What the ZAYA1 Milestone Means for AI Buyers
The report highlights clear parallels between NVIDIA’s ecosystem and AMD’s alternatives, such as NVLINK versus InfinityFabric and NCCL versus RCCL. The takeaway is that AMD’s stack is now mature enough to support large-scale, industry-grade model development.
This doesn’t mean companies must abandon their existing NVIDIA clusters. A more realistic approach is a hybrid setup: use NVIDIA for production environments while relying on AMD for stages that benefit from MI300X’s memory capacity or ROCm’s openness. This reduces supplier risk and expands available training capacity without major infrastructure changes.
The broader lesson from Zyphra’s experience is practical. Treat model shapes as adjustable. Build networks to match the collective operations training will use. Add fault tolerance that protects GPU hours, not just logs errors. And modernize checkpointing so it no longer drags down training rhythm.