According to Ullal, two additional key features of Arista’s Etherlink platforms are:
- Predictable latency: “Rapid and reliable bulk transfer from source to destination is key to all AI job completion. Per-packet latency is important, but the AI workload is most dependent on the timely completion of an entire processing step. In other words, the latency of the whole message is critical. Flexible ordering mechanisms use all Etherlink paths from the NIC to the switch to guarantee end-to-end predictable communication.”
- Congestion management: “Managing AI network congestion is a common ‘incast’ problem. It can occur on the last link of the AI receiver when multiple uncoordinated senders simultaneously send traffic to it. To avoid hotspots or flow collisions across expensive GPU clusters, algorithms are being defined to throttle, notify, and evenly spread the load across multipaths, improving the utilization and TCO of these expensive GPUs with a VoQ fabric,” Ullal wrote. The Arista Virtual Output Queuing (VoQ) fabric features a distributed scheduling mechanism that guarantees traffic flow delivery in congested switch ports.
Arista AI networking also depends on a combination of the vendor’s core EOS operating system and its natural-language, generative AI-based Autonomous Virtual Assist (AVA) system for delivering network insights, Ullal wrote.
“Arista AVA imitates human expertise at cloud scale through an AI-based expert system that automates complex tasks like troubleshooting, root cause analysis, and securing from cyber threats,” Ullal wrote. “It starts with real-time, ground-truth data about the network devices’ state and, if required, the raw packets. AVA combines our vast expertise in networking with an ensemble of AI/ML techniques, including supervised and unsupervised ML and NLP (Natural Language Processing). Applying AVA to AI networking increases the fidelity and security of the network with autonomous network detection and response and real-time observability.”
Regarding Arista’s EOS software stack, Ullal said it can help customers build resilient AI clusters. “EOS offers improved load balancing algorithms and hashing mechanisms that map traffic from ingress host ports to the uplinks so that flows are automatically re-balanced when a link fails,” Ullel wrote. “Our customers can now pick and choose packet header fields for better entropy and efficient load-balancing of AI workloads.
AI network visibility is another critical aspect in the training phase for large datasets used to improve the accuracy of LLMs, according to Ullal. “In addition to the EOS-based Latency Analyzer that monitors buffer utilization, Arista’s AI Analyzer monitors and reports traffic counters at microsecond-level windows. This is instrumental in detecting and addressing microbursts which are difficult to catch at intervals of seconds,” Ullal wrote.
In general, AI training clusters require a fundamentally new approach to building networks, “given the massively parallelized workloads” that can cause congestion, according to Ullal. “Traffic congestion in any single flow can lead to a ripple effect slowing down the entire AI cluster, as the workload must wait for that delayed transmission to complete. AI clusters must be architected with massive capacity to accommodate these traffic patterns from distributed GPUs, with deterministic latency and lossless deep buffer fabrics designed to eliminate unwanted congestion,” she wrote.