Pipeline Parallel (PP) partitions layers of a model across multiple devices to form a pipelined execution of the training.
PP takes as input a list of microbatches of data per iteration and performs pipelined training execution (forward, backward, and optimizer update) on each microbatch, while overlaps communication with computation on each device.
Existing PP systems suffer multiple drawbacks as below, which prevent productization within a company:
Complex API: assuming that model developers are also systems experts in PP
Hacking model code: requiring manually rewrite the model code to run PP
Lacking single device abstraction: requiring manually rewrite the training script to be PP device-specific
Lacking options of pipeline construction: relying on a single option of graph tracing, or perfect graph tracing, or solely manual construction of the pipeline.
Lacking customizability of pipeline schedule: deeply coupling the entire runtime (e.g., compute, communication) with a specific PP schedule (e.g., 1F1B)
Lacking diverse model support: supporting only sequential model architecture without branching, or supporting only pipeline stages having single input or single output without multiple input/output.
veScale PP offers a new PP framework that is both Easy-to-Use and Easy-to-Customize, thus it is used internally in our production.
Especially, veScale PP provides:
Easy API: hiding the complexity of PP systems and runtimes from model developers
Zero model code change: keeping the original torch model code as it is for transparent pipelined models
Single device abstraction: keeping the single device training script as it is for transparent pipelined training on multiple devices
Multiple options of pipeline construction: user can flexibly choose modes:
GRAPH_EAGER mode automatically traces and parses the model into a graph, splits the graph into pipeline stages, and constructs each stage for pipeline execution
MANUAL_EAGER mode manually constructs each pipeline stage for pipeline execution, without graph tracing, parsing, and splitting.
Customizable pipeline schedule: empowering users to define their custom pipeline schedules, beyond our built-in schedule as below:
1F1B
Interleaved 1F1B
Zero Bubble
Support diverse models: support comprehensive model archictures for non-sequential models, multiple-input-multiple-output stages, and etc.
Compared with Megatron-LM's PP, veScale PP offers not only a better Ease-of-Use experience in all aspects (easy API, zero model code, single device abstraction, options of pipeline construction) but also a plus of Customizability allowing users to conveniently customize new pipeline schedules.
Compared with DeepSpeed, veScale PP requires no modification of model code. It further supports multi-stage scheduling for non-sequential multimodal architecture and multi-input settings instead of being constrained by nn.Sequential's syntax.
Compared with the pre-release torchtitan, veScale PP provides: i) single device abstraction of training script, ii) wider options of graph tracer support, iii) wider model architecture support, and iv) guarantees bitwise accuracy alignment between PP and single device code.
Spinning up a PP job typically requires three steps: i) trace and parse model graph, ii) construct pipeline stage, and iii) execute pipeline schedule. Each step is handled by PipeParser, PipeModule, and PipeEngine. Upon receiving the model definition, PipeParser (GRAPH_EAGER mode) breaks down the model code to the intermediate representation of low-level modules and operators up to the granularity of your choice. Under MANUAL_EAGER mode, users only need to assign stage modules and their communication relationships. PipeModule collects parameters and operators, and optimizer states belonging to the same stage, and resolves communication topology among devices. PipeEngine will schedule steps to execute training according to pipeline schedules.
Example of using GRAPH_EAGER mode:
Example of using MANUAL_EAGER mode: Coming Soon.
APIs can be found in <repo>/vescale/pipe/pipe_stage.py and <repo>/vescale/pipe/pipe.py
More examples can be found in <repo>/test/parallel/pipeline/api/test_simple_api.py