I think this democratizes access to massive models but risks turning personal devices into unmonitored compute farms. For creators, frictionless joining could lead to unexpected battery drain and thermal throttling for casual users. On licensing, the shift from Apple-only to Android support broadens the ecosystem but complicates unified workflow standards.
No H100s needed: Three Apple computers can power a 400B-parameter large language model.
The key behind this achievement is an open-source distributed AI inference framework on GitHub, which has already garnered 2.5k stars.

Using this framework, users can build their own AI computing clusters in minutes using everyday devices like iPhones and iPads.

The framework is named exo. Unlike other distributed inference frameworks, it employs a peer-to-peer (P2P) connection method, allowing devices to automatically join the cluster once connected to the network.
A developer used the exo framework to connect two MacBook Pros and one Mac Studio, achieving a computing speed of 110 TFLOPS.
The same developer stated that they are ready for the upcoming Llama-3-405B model.

The exo official team has also announced that support for Llama-3-405B will be provided on day zero (the first day of release).

Moreover, it is not just computers; exo allows devices such as iPhones and iPads to join the local computing network, with even Apple Watches being capable of contributing.

With version iterations, the exo framework is no longer limited to Apple devices (initially supporting only MLX). Users have successfully integrated Android phones and RTX 4090 graphics cards into the cluster.

Setup Takes Under a Minute
I followed the release of exo, and what stood out to me is its departure from traditional master-worker architectures. Instead, it leverages peer-to-peer (P2P) networking to connect devices directly.
As long as your hardware shares a local area network (LAN), these machines can automatically join exo’s computing mesh to run large models together. This democratizes access to heavy compute without needing a dedicated server rack.
When splitting models across this distributed pool, exo employs various sharding strategies. The default is ring memory-weighted partitioning, which organizes inference in a ring topology. Here, each device executes multiple model layers, with the count proportional to its available memory capacity.

The friction here is low: the process requires almost no manual configuration. After installation and startup, the system auto-discovers devices on the LAN, with Bluetooth connectivity slated for future updates. In a video demonstration by the author, setup on two new MacBook Pros took about 60 seconds. By that mark, the program was already running in the background.
I think seamless P2P discovery reduces the technical barrier for indie developers and hobbyists.

From the interface shown, exo supports Tiny Chat, a graphical user interface, and an API compatible with OpenAI’s standards. However, these controls are accessible only on the tail node within the cluster, which might complicate multi-user workflows in shared home labs.
For creators, aPI compatibility is great for integration but limits direct control to a single device in the chain.

Under the hood, exo currently supports Apple’s MLX framework and the open-source tinygrad. Adaptation for llama.cpp is also underway, expanding its potential hardware compatibility.
On licensing, expanding backend support could eventually allow creators to leverage older or diverse hardware more effectively.
However, there are growing pains. Due to iOS implementation updates lagging behind Python, several issues have arisen with the mobile app. The author has temporarily taken down the iPhone and iPad versions of exo; those who wish to try it can contact the author via email. This highlights the instability of relying on bleeding-edge OS integrations for production-like tasks.
I think mobile instability forces creators to rely on desktops, limiting true on-the-go AI experimentation.

Home AI Cluster of PCs and Tablets Runs 400B Model, Garnering 2.5K GitHub Stars
Netizens: Is It Really That Useful?
I followed the release of this project, and what stands out immediately is who holds the power in this new distributed computing stack. While the promise of democratizing access to a 400B parameter model sounds empowering for independent creators, it currently favors those with existing high-end hardware or reliable local networks. For most, the friction of orchestration outweighs the privacy benefits.
The concept of running large language models on local devices sparked widespread discussion on Hacker News. The advantages of localized operation include better privacy protection, offline access to models, and support for personalized customization.

Some pointed out that building a cluster using existing devices for large model computation has lower long-term costs compared to cloud services.

However, regarding the specific project exo, many expressed doubts. First, some netizens noted that the computing power of existing older devices is orders of magnitude lower than that of professional service providers. While it might be fun for curiosity’s sake, achieving top-tier performance at a comparable cost to large platforms is impossible.
For creators, this approach adds significant workflow friction for creators who just want reliable inference without managing distributed nodes.

Others pointed out that the devices used in the author’s demonstration were high-end hardware. A Mac device with 32GB of memory might cost over $2,000, a price tag that could instead buy two RTX 3090 graphics cards. They even argued that since Apple is involved, the term “affordable” hardly applies.
On licensing, high entry costs for compatible hardware exclude many independent creators from participating in this local-first movement.

This raises another question: Which devices does the exo framework support? Is it exclusive to Apple? Netizens asked more directly, inquiring about Raspberry Pi compatibility. The author replied that theoretically, it is possible, though not yet tested; testing will be the next step.
I think unclear hardware support limits commercial viability for creators relying on diverse or legacy equipment.

Beyond the computing power of the devices themselves, some added that network transmission speed bottlenecks could also limit cluster performance. The framework’s author personally addressed this concern:
The data transmitted within exo consists of small activation vectors, not entire model weights.
For the Llama-3-8B model, activation vectors are approximately 10KB; for Llama-3-70B, they are about 32KB.
Local network latency is typically very low (<5ms) and does not significantly impact performance.

The author stated that since the framework currently supports tinygrad, it theoretically supports all devices capable of running tinygrad, even though testing has primarily been conducted on Macs. The framework is still in an experimental stage, with the future goal of making it as simple to use as Dropbox (a cloud storage service).
For creators, experimental status means creators risk broken workflows when dependencies shift during this early development phase.

By the way, the exo official team has listed some current shortcomings they plan to address and has offered public bounties. Those who resolve these issues will receive rewards ranging from $100 to $500.

Comments
Sign in to join the discussion and leave a comment.
Sign in with Google