LMCache – Building the foundation of AI memory tensor with KV Cache Infrastructure

GitHub Stars

8,200+

Contributors

350+

PyPI downloads

115K+

Community members

900+

Building the Foundation of AI Memory Tensor with KV Cache Infrastructure

LMCache pioneers KV-cache infrastructure for LLM inference, turning KV cache into AI-native memory that can be stored, compressed, searched, and reused across your entire cluster

Could AI dream with an electric LMCache?

Dream

Want to know more about deployment?

Compatibility

Supports major inference engines, hardware platforms, and pluggable external storage backends.

Compatibility

Supports major inference engines, hardware platforms, and pluggable external storage backends.

Supported Deployment Patterns

No architectural changes required. Choose the deployment mode that matches your setup.

In-Process

Simplest integration.
LMCache runs inside the inference engine process.

Serving
Engine

1 KV
Library

Serving
Engine

2 KV
Library

Serving
Engine

N

KV
Library

LMCache Multiprocess

Run LMCache as a standalone server, separate from the inference engine. The engine handles model execution, while LMCache manages KV cache storage, reuse, and recovery across workers, providing process isolation and cache that survives worker restarts or failures.

Serving
Engine

1 Serving
Engine

2 Serving
Engine

N

LMCache Server

MP mode is the recommended deployment path and the focus of future LMCache development.

LMCache Capabilities

LMCache is a modular KV cache layer for LLM inference, designed for workloads where long prompts, conversation history, and retrieved content are reused across requests. It helps teams store, reuse, compress, search, move, and observe KV cache across GPU, CPU memory, and external backends.

Store

Persist KV cache beyond GPU memory using CPU memory, local disk, or external backends.

Reuse

Load previously computed KV cache across requests to reduce repeated prefill work.

Search

Find reusable KV cache chunks beyond exact prefix matches with CacheBlend.

Compress

Reduce KV cache memory footprint to support longer contexts and higher concurrency.

Move

Transfer KV cache across workers, engines, and deployment modes for distributed inference.

Observe

Track cache behavior, storage movement, and reuse patterns across the serving stack.

Broad Ecosystem Collaboration

LMCache is used in production by infrastructure teams, cloud providers, and open-source projects worldwide.

Nvidia

Dynamo integrates seamlessly with popular inference engines like vLLM and open source tools like LMCache, enabling efficient cache reuse, reduced recomputation, and better support for long-context and high-concurrency workloads.

Google Cloud

For Google Kubernetes Engine users, LMCache’s tiered storage solution improves inference performance by using node-local storage, especially for long system prompts that generate large KV Caches.

CoreWeave

Together, LMCache and CoreWeave AI Object Storage form a tightly integrated system: LMCache handles cache serialization and coordination, while CoreWeave AI Object Storage provides the distributed performance backbone that makes external caching seamless.

Redis

LMCache reduces redundant computation by caching and reusing key-value (KV) pairs for repeated token chunks. Redis provides the real-time infrastructure to store and retrieve those chunks at scale. Together, they enable faster inference.

Nvidia

Dynamo integrates seamlessly with popular inference engines like vLLM and open source tools like LMCache, enabling efficient cache reuse, reduced recomputation, and better support for long-context and high-concurrency workloads.

Google Cloud

For Google Kubernetes Engine users, LMCache’s tiered storage solution improves inference performance by using node-local storage, especially for long system prompts that generate large KV Caches.

AMD

When integrated with vLLM, LMCache delivers 3–10× improvements on AMD Instinct MI300X GPUs for a wide range of community models, including Qwen3, Llama3, and Qwen-VL.

CoreWeave

Together, LMCache and CoreWeave AI Object Storage form a tightly integrated system: LMCache handles cache serialization and coordination, while CoreWeave AI Object Storage provides the distributed performance backbone that makes external caching seamless.

Redis

LMCache reduces redundant computation by caching and reusing key-value (KV) pairs for repeated token chunks. Redis provides the real-time infrastructure to store and retrieve those chunks at scale. Together, they enable faster inference.

PyTorch Foundation

LMCache is the first and most efficient open source Key-Value caching solution.

Tensormesh

Every improvement to LMCache’s architecture means more efficient caching, faster inference, and lower bills for teams running AI workloads at scale.

Built on Best Research

LMCache grew out of systems research at the University of Chicago and continues to evolve as an open-source project. Join us in building the most efficient LLM serving infrastructure.

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

CacheGen: KV Cache Compression and Streaming for Fast LLM Serving

CacheBlend: Fast LLM Serving for RAG with Cached Knowledge Fusion

Resources

Practical guides, community tools, and everything you need to deploy, contribute to, and stay current with LMCache.

Recipes

Deployment guides for running LMCache across different model architectures, serving engines, and environments.

Roadmap

Follow LMCache’s quarterly priorities and upcoming development milestones.

Contribution Guidelines

Fix bugs, improve docs, add model support, or help other users. Here’s how to get started.

Fresh from the Community

The latest benchmarks, release notes, and technical deep-dives from the LMCache team and contributors.

Tools

Calculators and observability tools to plan, deploy, and optimize your caching infrastructure.

Get Started

Dive In

Read the docs, install in minutes

Join the community

Slack, GitHub, Office Hours

Read the blog

Benchmarks, tutorials, release notes

GitHub Stars

Contributors

PyPI downloads

Community members

900+

Building the Foundation of AI Memory Tensor with KV Cache Infrastructure

Could AI dream with an electric LMCache?

Compatibility

Compatibility

Supported Deployment Patterns

In-Process

Serving Engine

1

KVLibrary

Serving Engine

2

KVLibrary

Serving Engine

N

KVLibrary

LMCache Multiprocess

Serving Engine

1

Serving Engine

2

Serving Engine

N

LMCache Server

LMCache Capabilities

Store

Reuse

Search

Compress

Move

Observe

Broad Ecosystem Collaboration

Nvidia

Dynamo integrates seamlessly with popular inference engines like vLLM and open source tools like LMCache, enabling efficient cache reuse, reduced recomputation, and better support for long-context and high-concurrency workloads.

Nvidia

Dynamo integrates seamlessly with popular inference engines like vLLM and open source tools like LMCache, enabling efficient cache reuse, reduced recomputation, and better support for long-context and high-concurrency workloads.

Google Cloud

For Google Kubernetes Engine users, LMCache’s tiered storage solution improves inference performance by using node-local storage, especially for long system prompts that generate large KV Caches.

AMD

When integrated with vLLM, LMCache delivers 3–10× improvements on AMD Instinct MI300X GPUs for a wide range of community models, including Qwen3, Llama3, and Qwen-VL.

CoreWeave

Together, LMCache and CoreWeave AI Object Storage form a tightly integrated system: LMCache handles cache serialization and coordination, while CoreWeave AI Object Storage provides the distributed performance backbone that makes external caching seamless.

Redis

LMCache reduces redundant computation by caching and reusing key-value (KV) pairs for repeated token chunks. Redis provides the real-time infrastructure to store and retrieve those chunks at scale. Together, they enable faster inference.

PyTorch Foundation

LMCache is the first and most efficient open source Key-Value caching solution.

Tensormesh

Every improvement to LMCache’s architecture means more efficient caching, faster inference, and lower bills for teams running AI workloads at scale.

Built on Best Research

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

CacheGen: KV Cache Compression and Streaming for Fast LLM Serving

CacheBlend: Fast LLM Serving for RAG with Cached Knowledge Fusion

Resources

Recipes

Roadmap

Contribution Guidelines

Fresh from the Community

Tools

Get Started

Dive In

Join the community

Read the blog

Serving
Engine

KV
Library

Serving
Engine

KV
Library

Serving
Engine

KV
Library

Serving
Engine

Serving
Engine

Serving
Engine