Some concepts in ML/DL/LLM

April 28, 2025, last update: April 12, 2026

As an promising field, continuously emerging "new concepts" are re-invented every day in DL/ML/LLM. But most of them are just an alias of existing old concepts, being remamed to draw attention from investors/researchers. In this post, I'm going to record my (two cents) understanding of some fancy terms/concepts with simple (but maybe not accurate enough) explanations.

LLM training/serving

PP and vPP

Pipeline parallel size (PP) and virtual pipeline parallel size (vPP) are two knobs to control the model layer layout when applying pipeline parallelism. When vPP is set, the model layers are scattered across PP instances in a interleaved way, where each vPP instance handles a micro-batch to minimize the bubbles during training. This was proposed in Megatron-LM paper 1.

ETP and EDP 2

Conventionally, TP means we split the weights of attention modules into different ranks. Extensively, we can apply TP for MoE modules in a similar way. That is, we split the weights of same experts into different ranks.

🧐

Q: Why traidtional EP has to be a subgroup of DP?

A: Ensure all tokens in a DP replica could be directed to any possible expert.

With the proposed expert parallel folding, expert TP/DP could be decoupled from attention module, offering a more flexible parallel configuration for LLM layer orchestration.

Radix tree / radix attention

In current form of LLM applications, there are multiple parts of prompts that can be reused in different requests, e.g., system prompts in code completion or chatbox. The inference engine usually organizes the processing requests in a radix tree, and manages the KV cache by harnessing the potential opportunities to reuse the prefix across requests. Regarding the request scheduling and batching, the system may choose requests that share a large ratio of cached tokens to avoid frequent KV cache eviction and thrashing. Note that the longest-prefix-length traversal order is equilavent to the DFS on the radix tree 3.

RL in LLM

Using RL concepts in terms of LLM training: the LLM with latest parameters is regarded as the actor model. The prompt, together with the autoregressively generated tokens (prefill + decode for a setence), are treated as a sample. The reward model (another model) is used to evauate the quality of RL samples under some criterias, e.g., code completion or math problem solving. Besides, extra metric like KL divergence is included into the final loss to prevent the RL generation deviating much from the corpus.

Off-policy

In synchronous RL training, the weights used in generation and evaluation are identical. That's say, the generation needs to be stopped until the parameters are updated from evaluation across iterations. The off-policy RL sheme refers to the asynchronous setting which decouples the synchronization of parameters in generation and evaluation.

Footnotes

  1. Efficient large-scale language model training on GPU clusters using megatron-LM @ SC '21

  2. MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

  3. SGLang: Efficient Execution of Structured Language Model Programs