LLMs are driving major advances in research and development today. A significant shift has been observed in research objectives and methodologies toward an LLM-centric approach. However, they are associated with high expenses, making LLMs for large-scale utilization inaccessible to many. It is, therefore, a significant challenge to reduce the latency of operations, especially in dynamic applications that demand responsiveness.
KV cache is used for autoregressive decoding in LLMs. It stores key-value pairs in multi-headed attention during the pre-filling phase of inference. During the decoding stage, new KV pairs get appended to the memory. KV cache stores the intermediate key and value activations in the attention mechanism, thus reducing complexity from quadratic to linear order. KV cache allows for improved efficiency but grows linearly with batch size, sequence length, and model size. The growing memory size of the KV cache exceeds the handling capacity of GPUs, and transferring it to the CPU introduces several bottlenecks, increasing latency while reducing throughput.
PCIe interfaces become a limiting factor, especially when transferring the cache from the CPU to the GPU for computation. Slow PCIe interfaces can result in latency exceeding normal levels by an order of magnitude, leading to substantial GPU idle time.
Previous work has attempted to mitigate the issue of slow PCIe performance. Still, these approaches often fail due to mismatched data transfer and GPU computation times, particularly with large batch and context sizes. Others depended on CPU resources, which again became a limiting factor. This article discusses a novel approach to PCIe and GPU optimization.
University of Southern California researchers propose an efficient CPU-GPU I/O-aware LLM inference method for optimized PCIe utilization. It leverages partial KV cache recomputation and asynchronous overlapping to address the system bottleneck of loading large KV caches. Their process involves transferring smaller activation segments of the cache to the GPU rather than transferring the entire KV cache. The GPU then reconstructs the whole cache memory from these smaller activation bits. The key lies in computing attention scores that ensure minimal information loss.
The authors propose a fully automated method for determining recomputation and communication splits. This work consists of three modules to minimize GPU latency:
- Profiler Module: Collects system hardware information, such as PCIe bandwidth and GPU processing speed.
- Scheduler Module: Formulates the problem as a linear programming task to determine the optimal KV split point using hardware information and user configuration. The objective is to maximize the overlap between computation and communication processes.
- Runtime Module: Coordinates data transfer between the two devices and manages memory allocations.
The Scheduler Module, which is responsible for finding the optimal KV split, works in two ways:
Row-by-Row Schedule: Reduces latency with a row-by-row execution plan. Here, the GPU begins reconstructing the KV cache while the remaining activations are asynchronously loading. Column-by-Column Schedule: Maximizes throughput and accommodates significant batch size inference by reusing model weights across batches. It overlaps the transmission of KV cache and activations with the computation of MHA (multi-headed attention) across multiple batches instead of processing each layer sequentially in a batch.Further using a six-process communication parallelism strategy, the Runtime Module enables concurrent GPU computation and CPU-GPU communication.
The authors tested the proposed framework for efficient LLM inference using an NVIDIA A100 GPU connected to a CPU via a PCIe 4.0 x16 interface. Experiments were conducted with two objectives to assess the framework’s performance:
- Latency-Oriented Workload: The proposed method outperformed baselines, reducing latency by 35.8%.
- Throughput-Oriented Workload: The method achieved up to a 29% improvement relative to the baseline.
Conclusion:
The CPU-GPU I/O-aware LLM inference method efficiently reduces latency while increasing throughput in LLM inference. It leverages partial KV cache recomputation and overlaps it with data transmission to minimize idle GPU time and enhance efficiency.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.
🚨 [Partner with us]: ‘Next Magazine/Report- Open Source AI in Production’
Adeeba Alam Ansari is currently pursuing her Dual Degree at the Indian Institute of Technology (IIT) Kharagpur, earning a B.Tech in Industrial Engineering and an M.Tech in Financial Engineering. With a keen interest in machine learning and artificial intelligence, she is an avid reader and an inquisitive individual. Adeeba firmly believes in the power of technology to empower society and promote welfare through innovative solutions driven by empathy and a deep understanding of real-world challenges.
Credit: Source link