
Hands-On LLM Serving and Optimization: Hosting LLMs at Scale
Author(s): Chi Wang (Author), Peiheng Hu (Author)
- Publisher: O'Reilly Media
- Publication Date: June 2, 2026
- Edition: 1st
- Language: English
- Print length: 371 pages
- ISBN-10: B0G48JRRMF
- ISBN-13: 9798341621497
Book Description
Large language models (LLMs) are the reasoning engines of modern AI. Today, a major inflection point has arrived: as the world races to deploy AI at scale, model inference has moved to the center of the stack. Welcome to the inference era.
Without proper optimization, however, LLMs can be expensive and slow to serve. Hands-On LLM Serving and Optimization is a comprehensive guide to the complexities of deploying and optimizing LLMs at scale.
In this hands-on, engineering-focused book, authors Chi Wang and Peiheng Hu combine practical examples, code, and strategies for building robust, performant, and cost-efficient AI token factories. Whether youâre building the LLM inference infrastructure or the applications that consume it, a deep understanding of LLM serving will make you a more effective, future-ready engineer as AI transforms how we work and build.
- Learn the foundations of model serving with core concepts, design paradigms, and industry best practices
- Understand the common challenges of hosting LLMs at scale
- Balance latency and throughput to meet the demands of AI applications and business requirements
- Host LLMs cost-effectively with practical, code-backed techniques
Editorial Reviews
Review
— Caiming Xiong, Co-founder of Recursive AI Startup and ex-SVP of AI Research & Applied Research, Salesforce
"The missing manual for LLM serving and inference — comprehensive coverage of LLM serving challenges and optimization techniques such as scaling attention, multi-node inferencing, and disaggregation, with real-world examples. Essential reading for anyone scaling AI infrastructure."
— Winnie Kwon, Engineering Manager, Broadcom
"This book bridges the gap between LLM theory and production reality—from semantic routing to Multi-LoRA serving, it equips any ML engineer with the mental models needed to build and optimize real-world inference systems."
— Ming-Chia (Marcus) Tsai, Senior Principal Engineer, Saviynt
"This book delivers real-world insight into the model serving architectures and optimization techniques required to build scalable, efficient LLM inference systems. Its hands-on approach makes complex LLM serving concepts accessible for anyone."
— Patrice Castonguay, Engineering Leader in LLM Inference
About the Author
Peiheng Hu is an accomplished machine learning engineer with over 10 years of industry experience and expertise in building large-scale AI systems. He currently works at NVIDIA, where he focuses on the cutting-edge distributed LLM inference, pushing the boundaries of high-performance inference engines on the latest NVIDIA GPUs. He holds a master of science in computational science and engineering from Harvard University and a bachelor of science in industrial engineering operations research from Georgia Institute of Technology. Previously, Peiheng served as a principal member of technical staff at Salesforce, where he led the development of the company's only unified serving platform, handling thousands of per-tenant models and LLM optimizations for Agentforce that saved millions in AI infrastructure expenses. Prior to that, he was a senior ML engineer at Microsoft Azure, where he architected distributed ML processing solutions for cloud security detection and analytics, handling billions of transactions per hour.
电子书百科大全







评论前必须登录!
立即登录 注册