Network Innovation in the AI Era: Challenges in Large Model Training and Three Development Directions

robot
Abstract generation in progress

The Importance of the Internet and Innovation Directions in the AI Era

The network plays a key role in the era of AI large models. As the scale of models grows rapidly, multi-server clusters have become the main way to solve model training, which forms the basis for the network to "rise" in the AI era. Compared to the past, when it was mainly used for data transmission, the network is now more used for synchronizing model parameters between graphics cards, which puts higher demands on the density and capacity of the network.

Large model training faces three major challenges:

  1. The increasingly large model size: Training time is positively correlated with the number of model parameters and the scale of data, and negatively correlated with computational speed. Improving computational efficiency is key to shortening training time, while increasing the number of devices and enhancing parallel efficiency directly determines computational power.

  2. Complex communication of multi-card synchronization: After the model is split to a single card, alignment is required for each computation. Operations like All-to-All impose higher demands on network transmission and exchange.

  3. Increasingly expensive failure costs: Training large models often lasts for months, and interruptions can lead to several days of retraining, resulting in significant losses. Modern AI networks have become the crystallization of human systems engineering capabilities comparable to airplanes, aircraft carriers, and other complex systems.

Network innovation mainly revolves around three directions:

  1. The evolution of communication media: optical modules, copper cables, and silicon-based interconnections each have their advantages, and efforts are being made to explore cost reduction and performance improvement.

  2. Competition of Network Protocols: The inter-chip communication protocol is strongly tied to the graphics card, while the competition between node communication is mainly between IB and Ethernet.

  3. Changes in Network Architecture: Leaf-Spine architecture faces challenges, and new architectures such as Dragonfly and rail-only are expected to become the evolution direction for ultra-large clusters.

Investment advice should focus on companies related to the core and innovative segments of communication systems. Overall, the innovation of networks in the AI era will revolve around "cost reduction", "openness", and a balance of computing power scale, continuously driving the advancement of communication technology.

ETH0.08%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 3
  • Share
Comment
0/400
MetadataExplorervip
· 07-30 01:55
These architectures are really hard to change....
View OriginalReply0
SlowLearnerWangvip
· 07-30 01:49
Oh, I was wondering why the internet has been so slow lately. It turns out it's been waiting for me here.
View OriginalReply0
PretendingSeriousvip
· 07-30 01:36
It's rolled up, everyone.
View OriginalReply0
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)