webFrame Achieves 5.8 Tokens/Second Throughput for On-Prem LLM Inference on Mac Clusters

webFrame, a solution developed by webAI, is enabling high-performance distributed inference of large language models (LLMs) on local Apple-silicon devices. As detailed in a recent tweet by AI expert Rohan Paul, the technology allows models like Llama-3 70B to achieve a first-token latency of approximately 1.8 seconds on a four-Mini cluster, demonstrating significant efficiency for on-premise LLM operations. This approach addresses the growing demand for local LLM deployment, ensuring data privacy and reducing reliance on cloud infrastructure.

The core of webFrame's capability lies in its sophisticated model-splitting techniques. A transformer model, viewed as "a long stack of math layers packed into weight matrices," is sliced into "shards" on disk, with each Mac loading only its necessary portion. According to Paul's tweet, this leverages "the same tensor-parallel idea first popularized in Megatron-LM" and involves "matrix multiplications... spread across machines at each step," using lightweight all-reduce calls for result reassembly.

Further optimizing performance, webFrame integrates pipeline parallelism, staging "one chunk of consecutive layers per machine." The system mixes both tensor and pipeline parallelism, allowing a "tiny micro-batch of tokens [to flow] through stage 1, then hops to stage 2," overlapping compute and network time. During autoregressive generation, only "small activation tensors and single-token logits cross the wire," with other machines caching local activations to reduce latency, similar to DeepSpeed's inference engine.

A key efficiency gain comes from "entropy-weighted quantization," which reduces low-information layers from 16-bit to 4-bit precision, maintaining over 99% accuracy. This quantization leads to "less RAM, faster math, and smaller network packets," resulting in a throughput of "≈5.8 tokens / s versus 2.3 tokens / s for an unoptimized baseline." The system's "Navigator" daemon further simplifies deployment by auto-detecting optimal network paths and setting up collective calls, rerouting traffic mid-stream if links slow down.

This innovative approach allows "several modest Macs [to] behave like one giant GPU without leaking a single token to the cloud," as stated in the tweet. webFrame's ability to run large models efficiently on commodity hardware positions it as a leading solution for organizations seeking to deploy powerful LLMs while maintaining strict control over their data and infrastructure.