Papers
arxiv:2405.07719

A Unified Sequence Parallelism Approach for Long Context Generative AI

Published on May 13, 2024
Authors:
,

Abstract

A unified sequence parallelism approach for generative AI models is proposed, enhancing robustness across transformer architectures and hardware, and achieving high memory and communication efficiency.

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/expert/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 86% MFU on two 8xA800 nodes using SP for sequence length 208K for the LLAMA3-8B model. Our code is publicly available on https://github.com/feifeibear/long-context-attention.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2405.07719
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 15

Browse 15 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.07719 in a dataset README.md to link it from this page.

Spaces citing this paper 521

Browse 521 spaces citing this paper

Collections including this paper 1