Frontier Tech

Ryzen AI 300 Explained: What This Chip Changes

Q: How fast does the Ryzen AI 300 run a local LLM?

It runs about **28 tokens/sec on Llama 3.2-3B** via the NPU and **20–80 tokens/sec** on small models, according to [RunAI Home](https://runaihome.com/blog/amd-lemonade-local-llm-server-npu-gpu-guide-2026) — interactive speeds for drafting, extraction, and classification, though slower than a discrete GPU on a single small model.

Q: How much does a Ryzen AI 300 machine cost?

As documented by [Of Zen and Computing](https://www.ofzenandcomputing.com/amd-ryzen-ai-300-series/), complete systems **start near $899**, while according to [XDA Developers](https://www.xda-developers.com/amd-ai-halo-mini-pc-now-available/), a high-memory 128GB Ryzen AI Max+ 395 mini-PC **launched at $3,999**.

Q: What are the limits of running models on the Ryzen AI 300?

Standard parts are limited to roughly **18B-parameter models at Q4** quantization, as documented by [RunAI Home](https://runaihome.com/blog/amd-lemonade-local-llm-server-npu-gpu-guide-2026), and most popular LLM runtimes still route work to the iGPU rather than the NPU, per [Zen van Riel](https://zenvanriel.com/ai-engineer-blog/ryzen-ai-300-vs-rtx-3060-local-llm-inference/).

Jun 14, 2026

The Ryzen AI 300 is AMD's family of laptop and mini-PC processors that combines a CPU, an integrated GPU, and a dedicated neural processing unit (NPU) that share the same system memory — letting a machine run language models on-device instead of calling a cloud API for every request.

That one-sentence definition is the anchor, because "Ryzen AI 300" is one of those terms that jumped from spec-sheet jargon to a real purchasing decision in a single hardware cycle. The chips shipped through mid-2026, and the practical claim attached to them — a small-business laptop or desktop running a useful model without a monthly API bill — is now testable rather than theoretical. This page is the plain-English explanation of what the Ryzen AI 300 is, what actually changed, why it changed now, who built it, and — fenced off in its own section — where we think it lands for small and mid-size operators over the next few years.

TL;DR

The Ryzen AI 300 pairs a Zen 5 CPU, an RDNA 3.5 integrated GPU, and an XDNA 2 NPU rated at up to 50 TOPS, all sharing one pool of system memory.
That shared memory is the real story: the NPU and GPU read the same RAM, so a laptop can hold a model that would normally need a discrete graphics card.
In practice it runs small-to-mid models locally — roughly 45 tokens per second on Llama 3 8B (Of Zen and Computing), fast enough to feel interactive for drafting, extraction, and classification.
The cost shift is from a recurring cloud bill to a one-time hardware purchase, with complete systems starting near $899 and high-memory mini-PCs reaching $3,999 (XDA Developers).
Honest limits: it is not a replacement for a data-center GPU on large models, NPU software is still maturing, and "50 TOPS" is a peak number, not a guarantee.

What actually happened

For years, "AI on a laptop" meant one of two things: a thin client calling a cloud model, or a gaming machine with a power-hungry discrete GPU. The Ryzen AI 300 collapses that choice by putting a purpose-built AI accelerator on the same die as the CPU and GPU, then letting all three draw from unified system memory.

According to Of Zen and Computing, the flagship Ryzen AI 9 HX 370 carries a 50 TOPS NPU paired with 12 CPU cores (4 Zen 5 + 8 Zen 5c) boosting up to 5.1 GHz. The same source notes the integrated Radeon 890M graphics ships with 16 compute units running up to 2.9 GHz — meaning the "graphics card" and the AI accelerator now both live on the processor itself, not on a separate add-in board.

The flagship NPU is rated at up to 50 TOPS of AI compute. That figure, per Of Zen and Computing, is what qualifies these chips for the on-device LLM workloads that previously required cloud infrastructure.

The headline benchmark people actually care about is throughput on a real model. According to RunAI Home, the platform runs Llama 3.2-3B at 28 tokens/sec on the NPU and GPT-OSS-20B at 19 tokens/sec, with small models reaching 20–80 tokens/sec depending on the configuration. Those are the kinds of numbers that decide whether on-device AI is a demo or a tool you'd actually put in front of staff.

Model	Throughput	Where it runs	Source
Llama 3 8B	45 tokens/sec	iGPU/NPU	Of Zen and Computing
Llama 3.2-3B	28 tokens/sec	NPU	RunAI Home
GPT-OSS-20B	19 tokens/sec	NPU	RunAI Home
Mistral 7B (iGPU)	10–15 tokens/sec	iGPU	Zen van Riel
512×512 image	3.2 seconds	iGPU	Of Zen and Computing

The mechanism, in plain language

There are no equations here — just three ideas.

1. Three engines, one memory pool. A CPU is a generalist, a GPU is good at parallel math, and an NPU is built specifically for the matrix multiplication that neural networks do. The Ryzen AI 300 puts all three on one chip and — crucially — lets them share the same RAM instead of copying data between separate memory banks. On the larger "Strix Halo" variants, as documented by RunAI Home, that pool reaches 128GB of LPDDR5X unified memory with up to 96GB usable as VRAM — enough to hold models that a typical gaming GPU's 8–24GB cannot.

That unified-memory design is the single most important difference from a desktop graphics card. According to Zen van Riel, a 12GB RTX 3060 cannot fit a 32B model and is limited to 13B-class models, while a 32GB unified Ryzen AI 300 machine fits a 32B Qwen 2.5 Coder at 4-bit (~20GB) with room for context — and a 64GB configuration can run 70B models at 4-bit. The trade-off is raw speed, covered below.

2. The NPU does the boring, repetitive part efficiently. A model's "decode" step — generating one token at a time — is repetitive and power-sensitive. Routing it to the NPU instead of the GPU saves energy. As measured by RunAI Home, small models on NPU+CPU mode run at under 2W of power draw, and the NPU path delivers time-to-first-token 2.3× faster than GPU-only on supported workloads.

3. Software turns the silicon into something usable. Hardware alone does nothing without a runtime. AMD's Lemonade project and tools like FastFlowLM map model layers onto the NPU and GPU. The honest state of that software is mixed: as documented by Zen van Riel, the 50 TOPS NPU is "not currently doing meaningful work for large language model inference" in popular runtimes like Llama.cpp, Ollama, and LM Studio, which still route LLM work to the iGPU. The NPU's clearest wins today are sub-1B-parameter models — transcription, classification, background blur — while emerging runtimes are beginning to pull LLM decode onto it.

Component	What it is	Spec (Ryzen AI 9 HX 370)	Source
NPU (XDNA 2)	AI accelerator	up to 50 TOPS	Of Zen and Computing
CPU (Zen 5)	General compute	12 cores, 5.1 GHz	Of Zen and Computing
iGPU (RDNA 3.5)	Parallel math	Radeon 890M, 16 CUs, 2.9 GHz	Of Zen and Computing
Memory	Shared pool	up to 128GB LPDDR5X (Strix Halo)	RunAI Home
70B model support	Large local model	64GB config, 4-bit	Zen van Riel

Why now — what constraint broke

The constraint that broke was memory bandwidth and capacity at the edge. A model's weights have to live somewhere fast; on a normal laptop the GPU's memory was both small and walled off from the CPU. By unifying the memory and adding an NPU tuned for low-power decode, AMD made it practical to keep a useful model resident on a portable machine. The capacity advantage is concrete: a unified-memory Ryzen part can hold far larger models than a same-priced discrete GPU, even if it generates tokens more slowly.

The second constraint was cost predictability. Cloud inference is a metered, recurring expense that scales with usage. On-device inference converts that into a fixed hardware purchase. According to XDA Developers, a Ryzen AI Max+ 395 mini-PC with 128GB of unified memory and a 16-core/32-thread CPU launched at $3,999, with a monthly power-cost estimate of about $16 at $0.15/kWh — numbers a small operator can put on a spreadsheet against a cloud API bill.

There is a third, quieter constraint: the cost of model memory itself. According to Zen van Riel, usable model memory on a Ryzen AI 300 machine works out to roughly $20 per gigabyte versus about $50 per gigabyte on an RTX 3060 — and the mini-PC sustains AI load at 28 to 54 watts against a discrete-GPU system drawing "north of 200 watts." For an always-on inference box, that power and capacity math is what makes local viable.

Approach	Cost shape	Privacy	Best fit
Cloud API	Recurring, per-token	Data leaves the building	Spiky, large-model needs
Ryzen AI 300 laptop	One-time ~$899+	Fully on-device	Steady small-model workloads
Ryzen AI Max+ mini-PC	One-time $3,999	Fully on-device	Shared team inference box

A Ryzen AI Max+ 395 mini-PC launched at $3,999 with 128GB unified memory, as reported by XDA Developers — positioning it as a shared on-device inference box for a small team.

Teams that already route documents and tickets through US Tech Automations workflows can treat a Ryzen AI 300 box as a model swap at the inference step, not a rebuild — the orchestration layer stays the same; only where the model runs changes.

Who shipped it

AMD designed and shipped the Ryzen AI 300 silicon. The on-device software ecosystem around it — the part that makes the NPU usable for LLMs — is a mix of AMD's own Lemonade effort and community runtimes. The FastFlowLM runtime is what produces the 28 tokens/sec on Llama 3.2-3B figure on Strix Point parts, as documented by RunAI Home, and AMD Lemonade is positioned as the server layer for routing decode to the NPU. Mini-PC and laptop OEMs package the chip into shipping hardware, with systems available for pre-order in June 2026, as reported by XDA Developers.

For a team running an automated document or ticket pipeline, the practical integration point is narrow: the model is one step in a longer workflow that ingests an input, classifies or extracts, and routes the result for review. Teams that already run that pipeline through US Tech Automations workflows point the classify-and-extract step at a local Ryzen machine and leave the routing, approval, and logging steps untouched — which is why the change reads as a model swap rather than a migration. The hardware vendor ships the silicon; the runtime vendor maps the model onto it; the workflow layer decides when to call it.

The honest limits

This is not a data-center GPU, and it is not even a fast discrete gaming GPU on small models. As benchmarked by Zen van Riel, an RTX 3060 generates 30–50 tokens/sec on Mistral 7B while a Ryzen AI 300 iGPU manages only 10–15 tokens/sec on the same model — the Ryzen wins on capacity and efficiency, not peak speed. The 50 TOPS figure is also a peak NPU rating, and as noted, most popular LLM runtimes do not yet use the NPU at all. As documented by RunAI Home, a standard Strix Point part with ~32GB is limited to roughly 18B-parameter models at Q4 quantization — the very large models still belong in the cloud or on the high-memory Strix Halo parts.

Signal vs Speculation

Everything above this line is sourced fact. This section is our forecast.

Our read: if the per-token economics hold, the Ryzen AI 300 turns "AI" from a line item that scales with usage into a fixed asset on the balance sheet. For a small operator running steady, bounded workloads — summarizing inbound email, extracting fields from invoices, classifying tickets — a one-time $899-and-up machine (per Of Zen and Computing) that processes 45 tokens/sec on an 8B model (same source) can quietly replace a recurring cloud bill within a year.

Our read: the bigger shift over 12–36 months is privacy-driven. Industries that handle regulated or confidential data — accounting, legal, healthcare back-offices — have a real reason to keep inference on-device. The unified-memory capacity advantage (a 64GB machine running 70B models per Zen van Riel) and the 2.3× faster time-to-first-token on NPU per RunAI Home make local-only deployment plausible for the first time on commodity hardware.

Our read: the limiter will be software, not silicon. The hardware is ready; the runtime maturity is the gating factor — today most LLM runtimes ignore the NPU entirely. We expect 2027 to be the year on-device runtimes catch up enough that "swap the cloud call for a local one" is a one-line config change rather than a project, at which point the idle 50 TOPS finally does LLM work.

For deeper, industry-specific implications, see the spoke pages in this cluster: what Ryzen AI 300 means for small businesses, what it means for accounting firms, and what it means for home services companies.

Key Takeaways

The Ryzen AI 300 unifies a Zen 5 CPU, RDNA 3.5 GPU, and a 50 TOPS XDNA 2 NPU over shared memory, per Of Zen and Computing.
It runs real models locally — 28 tokens/sec on Llama 3.2-3B via the NPU and small models at 20–80 t/s, per RunAI Home.
The economic shift is recurring cloud cost to a one-time purchase: a $3,999 128GB mini-PC running at ~$16/month, per XDA Developers.
Capacity beats speed: usable model memory costs ~$20/GB vs ~$50/GB on an RTX 3060, per Zen van Riel.
The limit is real: standard parts cap around 18B-parameter models and most LLM runtimes still ignore the NPU.

Frequently Asked Questions

What is the Ryzen AI 300?

It is a family of AMD processors that combines a Zen 5 CPU, an RDNA 3.5 integrated GPU, and an XDNA 2 NPU rated at up to 50 TOPS, all sharing system memory, according to Of Zen and Computing. The design lets a laptop or mini-PC run language models on-device.

How fast does the Ryzen AI 300 run a local LLM?

It runs about 28 tokens/sec on Llama 3.2-3B via the NPU and 20–80 tokens/sec on small models, according to RunAI Home — interactive speeds for drafting, extraction, and classification, though slower than a discrete GPU on a single small model.

What does "50 TOPS" actually mean?

It means the NPU can perform up to 50 trillion operations per second on AI workloads, a peak rating reported by Of Zen and Computing. In practice most LLM runtimes do not yet use the NPU, so real LLM throughput comes from the iGPU today.

How much does a Ryzen AI 300 machine cost?

As documented by Of Zen and Computing, complete systems start near $899, while according to XDA Developers, a high-memory 128GB Ryzen AI Max+ 395 mini-PC launched at $3,999.

Is it better than a gaming GPU for local AI?

It depends on the goal. A Ryzen AI 300 fits much larger models — a 64GB config runs 70B models at 4-bit versus a 13B ceiling on a 12GB RTX 3060 — but the GPU is faster on a single small model, per Zen van Riel. Capacity and power efficiency favor the Ryzen.

What are the limits of running models on the Ryzen AI 300?

Standard parts are limited to roughly 18B-parameter models at Q4 quantization, as documented by RunAI Home, and most popular LLM runtimes still route work to the iGPU rather than the NPU, per Zen van Riel.

Freshness: written as of June 2026, based on the Ryzen AI 300 launch (announced 2026-05-01) and shipping hardware available for pre-order in June 2026.

Ready to run local models inside your existing pipelines? See how an agentic workflow platform handles a model swap without a rebuild.

About the Author

US Tech Automations Team

AI Automation Specialists

We design agentic automation workflows for small and mid-size operators, back-office teams, and on-device AI deployments.

What Ryzen AI 300 Means for Small Businesses