KV Streaming
Fast Long-Context LLM Serving via Streaming Layerwise-Compressed KV Cache
The summary of the project is as follows.
- Led a project to overlap model inference, KV cache streaming, and decoding in a layerwise manner to reduce TTFT.
- Developed a layerwise inference engine, encoding and decoding tools, and a streaming server for pipelined execution.
- Achieved a 5–15% reduction in TTFT compared to the non-overlapped baseline.