arxiv:2405.07719

A Unified Sequence Parallelism Approach for Long Context Generative AI

Published on May 13, 2024

Authors:

Abstract

A unified sequence parallelism approach for generative AI models is proposed, enhancing robustness across transformer architectures and hardware, and achieving high memory and communication efficiency.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/expert/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 86% MFU on two 8xA800 nodes using SP for sequence length 208K for the LLAMA3-8B model. Our code is publicly available on https://github.com/feifeibear/long-context-attention.