KV Streaming

Fast Long-Context LLM Serving via Streaming Layerwise-Compressed KV Cache

The summary of the project is as follows.

  • Led a project to overlap model inference, KV cache streaming, and decoding in a layerwise manner to reduce TTFT.
  • Developed a layerwise inference engine, encoding and decoding tools, and a streaming server for pipelined execution.
  • Achieved a 5–15% reduction in TTFT compared to the non-overlapped baseline.