Some concepts in ML/DL/LLM

April 28, 2025

As an promising field, continuously emerging "new concepts" are re-invented every day in DL/ML/LLM. But most of them are just an alias of existing old concepts, being remamed to draw attention from investors/researchers. In this post, I'm going to record my (two cents) understanding of some fancy terms/concepts with simple (but maybe not accurate enough) explanations.

LLM serving

Radix tree / radix attention

In current form of LLM applications, there are multiple parts of prompts that can be reused in different requests, e.g., system prompts in code completion or chatbox. The inference engine usually organizes the processing requests in a radix tree, and manages the KV cache by harnessing the potential opportunities to reuse the prefix across requests. Regarding the request scheduling and batching, the system may choose requests that share a large ratio of cached tokens to avoid frequent KV cache eviction and thrashing. Note that the longest-prefix-length traversal order is equilavent to the DFS on the radix tree 1.

RL in LLM

Using RL concepts in terms of LLM training: the LLM with latest parameters is regarded as the actor model. The prompt, together with the autoregressively generated tokens (prefill + decode for a setence), are treated as a sample. The reward model (another model) is used to evauate the quality of RL samples under some criterias, e.g., code completion or math problem solving. Besides, extra metric like KL divergence is included into the final loss to prevent the RL generation deviating much from the corpus.

Off-policy

In synchronous RL training, the weights used in generation and evaluation are identical. That's say, the generation needs to be stopped until the parameters are updated from evaluation across iterations. The off-policy RL sheme refers to the asynchronous setting which decouples the synchronization of parameters in generation and evaluation.

Footnotes

  1. SGLang: Efficient Execution of Structured Language Model Programs