Accelerating the Future of AI,

One Cache at a Time

The first open-source Knowledge Delivery Network (KDN) that accelerates LLM applications up to 8x faster, at 8x lower cost.

See Code

See Demo

KV Size Calculator

Prompt Caching

Enable fast, uninterrupted interactions with AI chatbots and document processing tools by caching long conversational histories for quick retrieval.

Without LMCache: Slow Response

With LMCache: 8-10x Faster Response

Fast RAG

Enhance the speed and accuracy of RAG queries by dynamically combining stored KV caches from various text chunks, perfect for enterprise search engines and AI-based document processing.

Without LMCache: Slow Response

With LMCache: 4-10x Faster Response

Research Papers

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Authors: Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Authors: Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

Authors: Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang

Advantages of LMCache

Scalability

LMCache scales effortlessly, eliminating the need for complex GPU request routing.

Cost Efficiency

Our novel compression techniques reduce the cost of storing and delivering KV caches.

Speed

Our unique streaming and decompression methods minimize latency, ensuring fast responses.

Cross-Platform

Seamless integration with popular LLM serving engines like vLLM and TGI.

Quality

LMCache enhances the quality of LLM inferences through offline content upgrades.