Distributed Systems¶
Distributed algorithms¶
Focusing on the distributed algorithms, such as consensus and replication, like RAFT.
Challenge: concurrency, synchronous and communication complexities across independent nodes
Solution: problems that require coordination, computation, and data management across multiple independent computer systems.
Computing Framework¶
Solution: Developing distributed algorithms requires a clear understanding of the computing framework, which scales small computing units to achieve a more clear data processing. The common computing frameworks are MapReduce, Spark, etc.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2004 | OSDI | MapReduce: simplified data processing on large clusters | divife the data processing into map and reduce stages; use master-worker architecture | 4 | 5 | 5 |
Domain Specific Computing Framework¶
Challenge: specific bounds of different situations
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | PPoPP | NUDT | GraphCube: Interconnection Hierarchy-aware Graph Processing | interconnection hierarchy-aware; topology-aware graph partitioning; extreme-scale graph processing | 4 | 5 | 5 |
Parallel Strategies¶
Soultion: using the computation and memory resources of multiple processors to solve a problem.
Challenge: communication overhead and load balancing
Data Parallelism ¶
Solution: Data parallelism addresses scenarios where a single GPU can accommodate the model, but the dataset's size necessitates distribution across multiple GPUs for efficient processing and accelerated training.
Modern DNN acceleration systems commonly use the combination of data parallelism and model parallelism.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2012 | Nips | Large Scale Distributed Deep Networks | data parallel; use many model to optimize the same data; distributed model training | 3 | 4 | 3 | |
2014 | OSDI | CMU | Scaling Distributed Machine Learning with the Parameter Server | the foundation of tensor parallel; parameter server; pull-based data transfer | 4 | 5 | 3 |
2020 | SC | Miscrosoft | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | fix the problem that dp cannot reduce the memory usage on single GPU | 3 | 4 | 3 |
Model Parallelism ¶
Solution: Model parallelism addresses scenarios where the model's size exceeds the processing and memory capacity of a single GPU. There are two types of model parallelism:
-
Pipeline parallelism: divide the model as pipeline stages, each gpu processes one or more stages.
-
Tensor parallelism: divide the tensor into different GPUs.
Usually, pipeline parallelism and tensor parallelism are used together.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2019 | arXiv | NVIDIA | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | transformer based model parallel; pipeline parallel; divide model into different GPUs | 3 | 4 | 3 |
2021 | SC | NVIDIA | Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Megatron2; dive deep into tensor parallelism; how to train a LLM on 1000 GPUs | 4 | 4 | 3 |
2022 | arXiv | NVIDIA | Reducing Activation Recomputation in Large Transformer Models | Megatron3; sequence parallel; selective activation recomputation; reduce the amount of recomputed activation | 3 | 4 | 3 |
LLM-specific Parallel Strategies¶
Focusing on the parallel strategies for LLM-specific deep learning systems.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2022 | ACL | NUS | Sequence Parallelism: Long Sequence Training from System Perspective | splits input sequences into chunks; Ring Self-Attention; sparse attention | 3 | 4 | 3 |
Cloud computing platforms and architectures¶
Challenge: when providing services to users, facing scalability, resource management, fault tolerance, and cost-effectiveness for building and deploying large-scale distributed applications and services.
Cloud Platform LLM Scheduling¶
Challenge: meet the SLO when providing LLM service on cloud platform.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Azure | TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms | thermal/power property characterization; dynamically adjust in response to power or cooling failures; thermal- and poweraware manner |
Microservices¶
Focusing on the microservices.
Memory Management¶
Challenge: coordinating memory access and maintaining data consistency across multiple independent nodes with their own local memories, especially when dealing with shared data.
Remote Memory¶
Challenge: efficiently providing access to memory on a remote node while minimizing latency and overhead, and ensuring consistency and reliability despite network communication complexities and potential failures.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2020 | TC | Georgia Tech | Hierarchical Orchestration of Disaggregated Memory | XMemPod architecture for hierarchical memory orchestration; compressed swap page table (CSPT) for metadata management; hybrid swap-out algorithm for memory utilization; proactive swap-in optimization for performance; RDMA-based remote memory sharing for low-latency access | |||
2025 | ATC | HUST | Fast Distributed Transactions for RDMA-based Disaggregated Memory | fast commit protocol by coalescing validation and commit phases; RDMA-enabled offloading for data synchronization; priority-based locking for mission-critical transactions | 2 | 3 | 4 |
Scratchpad Memory¶
Challenge: efficiently allocating and coordinating limited fast memory across distributed nodes to minimize access latency and contention, while ensuring data consistency and scalability.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | ASPLOS | Cornell | Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories | work-stealing based dynamic task parallelism; stack/task queue in SPM; read-only data duplication | 3 | 3 | 3 |
Memory Optimization for Graph Processing¶
Challenge: efficiently optimize huge memory requirement from graph processing.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | PPoPP | KAIST | INFINEL: An efficient GPU-based processing method for unpredictable large output graph queries | unpredictable large output queries; one-phase GPU graph processing; kernel stop/restart | 4 | 4 | 3 |
LLM Memory Management¶
Solution: efficient memory management can reduce memory usage, thus enable larger batch size and higher throughput.
Memory Management Algorithms¶
Solution: efficient memory management algorithms, like virtual memory, page table, etc.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | SOSP | UCB | Efficient Memory Management for Large Language Model Serving with PagedAttention | Paged KV-Cache management; Better memory management for larger batch size; Preemptive memory scheduling | 4 | 5 | 3 |
2025 | ASPLOS | Miscrosoft | vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention | use cuda hardware page table instead of vllm's; hack cuda's driver to support page table modify | 2 | 3 | 3 |
2025 | arXiv | SJTU | eLLM: Elastic Memory Management Framework for Efficient LLM Serving | activation weight paged; all scene virtual memory; cpu memory swap | 3 | 2 | 2 |
Tradeoff between compute and memory¶
Solution: Transformer is a compute-bound model. To improve the performance, sometimes can use recomputation to reduce the memory usage.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2022 | NIPS | Stanford | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Generalized Acceleration of Attention Mechanisms; Change attention to utilize the SRAM on GPU; use recompute to reduce IO burden | 4 | 5 | 4 |
2023 | ICLR | Stanford | FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | optimize the thread block parallelization of attention; parallel memory access; reduce no-malmul operation | 4 | 4 | 3 |
2024 | Nips | Stanford | FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | Hopper architecture based optimization; fp8 quantization; backward support | 3 | 3 | 3 |
General LLM Memory Management¶
Challenge: LLM memory management faces challenges like limited HBM memory, efficient KV Cache management, memory sharing between multiple GPUs, multi-level memory management.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2022 | SC | Miscrosoft | DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | kernel fusion; GPU-CPU-NVMe heterogeneous memory; PCIe-based memory prefetch | 4 | 4 | 3 |
2025 | arXiv | THU | Jenga: Effective Memory Management for Serving LLM with Heterogeneity | fixed-size embeddings; full-prefix dependency; two-level memory allocator | 4 | 4 | 3 |
2025 | FAST | THU | Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot | PD-disaggregate system; kv-cache centered; global kv-cache pool; dynamic SLO scheduler; paged KV-Cache storage | 3 | 4 | 2 |
Application specific memory management¶
Solution: Memory management is actually the core for request scheduleing. Application specific memory management use the application information to manage the memory.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UCSD | KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows | agent grapg; prefetch KV Cache from CPU for next agent; agent-aware prefix cache management | 2 | 2 | 2 |
Solution: Memory management is actually the core for request scheduleing. Application specific memory management use the application information to manage the memory.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UCSD | KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows | agent grapg; prefetch KV Cache from CPU for next agent; agent-aware prefix cache management | 2 | 2 | 2 |
KV Cache Reuse Systems¶
Solution: reduce redundant computation and high memory consumption during inference by allowing the reuse of previously computed key-value pairs for shared or repeated parts of input sequences.
Prefix Sharing¶
Solution: reuse KV Cache when the input sequence has shared or repeated parts, use prefix tree to manage KV Cache.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | Nips | Stanford | SGLang: Efficient Execution of Structured Language Model Programs | KV-Cache share; python-like DSL; compute graph; LRU cache management stragety | 4 | 4 | 3 |
2024 | ACL | Microsoft | ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition | prefix aware attention compute; manage kv-cache chunks as prefix tree; reduce kv-cache redundancy | 3 | 4 | 2 |
2024 | arXiv | Microsoft | BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching | global prefix tree ahead-of-time; request reorder; horizontal fusioned prefix-shared attention kernel | |||
2024 | arXiv | Berkeley | BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching | offline batch inference; resource-aware prefix tree; compute-intensive / memory-intensive requests | |||
2024 | arXiv | UChicago | DroidSpeak: Enhancing Cross-LLM Communication | selectively layer reuse; communication protocol for inter-agent exchanges; LLMs that share a common foundational model |
KV cache store¶
Solution: store the KV cache in the memory or other storage device, supporting multi-level storage.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | ATC | Huawei | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | store KV cache in the memory; multi level KV cache management; position mask modified | 3 | 3 | 3 |
2024 | SIGCOMM | UChicago | CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving | efficient KV Cache streaming; KV Cache compression; knowledge delivery network; The transfer part of LMCache | 3 | 4 | 3 |
2024 | EuroSys | UChicago | CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion | multiple precomputed text chunks; selective KV recompute; sparsity of attention matrices; The system intro of LMCache | 3 | 4 | 3 |
Other Techniques¶
Solution: KV cache reuse techniques beyond prefix sharing. Prefix is a high requirement and is not always possible.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | Berkeley | Optimizing LLM Queries in Relational Workloads | prefix sharing maximization; KV cache hit rate; deduplication and cost estimation techniques |
KV Cache Storage Systems¶
Solution: efficiently storing and retrieving the key-value cache, thus reuse when needed.
Challenge: the prefetch and eviction of the KV cache, the balance between saving GPU memory and refetching time from the storage device.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | NVIDIA | FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving | block-sparse format; customizable attention template; dynamic load-balanced scheduling framework | |||
2025 | arXiv | PKU | FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference | imbalanced KV cache compression mitigation; fair-copying for load balancing; best-effort assignment |
KV Cache Evict Systems¶
Challenge: selectively discard the least important key-value pairs to free up memory for longer contexts or larger batch sizes without significantly degrading the model's generation quality or increasing computational overhead for the eviction process itself.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | NIPS | UT-Austin | H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | sparsity for small cache size; heavy-hitters; greedy algorithm for low-cost policy | |||
2024 | arXiv | Fujitsu | CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models | long measurement step; decay of the accumulated attention score; adjusting FIFO cache size |
Systems with Other Caches¶
Solution: use other caches (not just KV cache) to improve the performance of LLM inference.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | KAIST | Efficient LLM Inference with Activation Checkpointing and Hybrid Caching | activation checkpointing; KV-activation hybrid caching; balanced approach to determine the best ratio |
LLM Prefetching¶
Solution: prefetch to avoid memory transfer between different devices, reduce the latency of memory access.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Huawei Zurich | PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving | computational graph-based prefetching; prefetch KV cache to L2 cache |
Communication-Centric Optimization¶
Challenge: communication is a bottleneck of some distributed systems, trying to reduce the communication.
I/O Characterization and Optimization¶
Challenge: minimize data movement and maximize resource utilization across heterogeneous distributed environments.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2020 | ASPLOS | CMU | Livia: Data-Centric Computing Throughout the Memory Hierarchy | Memory service programming model; task graphs linked to data location; dynamic task/data scheduling for minimal movement | 2 | 4 | 3 |
2025 | arXiv | UOregon | Parallel I/O Characterization and Optimization on Large-Scale HPC Systems: A 360-Degree Survey | different HPC I/O stack layers; profiling and tracing tools; tuning echniques |
GPU-GPU Communication¶
Challenge: limited interconnect bandwidth between GPUs using nvLink, PCIe, synchronization delays in parallel workloads, load imbalance across GPUs
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Apple | SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models | sync-point drop; block-wise sensitivity analysis; attention output synchronization reduction | |||
2025 | arXiv | Miscrosoft | Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters | heterogeneous pipeline stages with flexible GPU counts and types; CPU offloading of both parameters and activations | 4 | 4 | 2 |
Many-Core Systems¶
Challenge: the heterogeneity of cores, the load imbalance, and the communication overhead.
Workload Characterization¶
Challenge: dynamic workloads across numerous cores, resource contention for shared hardware.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2015 | VLDB | Intel | GraphMat: High performance graph analytics made productive | vertex program to sparse matrix mapping; generalized SPMV for graph analytics; single-node multicore framework | 4 | 4 | 4 |
2018 | SC | Intel | Many-Core Graph Workload Analysis | multicore simulator sniper; selective caching and prefetching; heterogeneous high-performance low-power cores | |||
2018 | DATE | UGA | Parallel Code Generation of Synchronous Programs for a Many-core Architecture | banked memory mapping; worst-case response time analysis | |||
2025 | IPDPS | UChicago | Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems | lock-less and concurrent task queue xqueue; distributed tree barrier; NUMA-aware redirect push/work stealing |
Fault Propagation¶
Challenge: one core or component can easily spread to others due to shared resources, leading to system-wide reliability issues. Core counts grow make it hard to predict, detect, and contain errors effectively.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2008 | ASPLOS | UIUC | Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design | stuck-at fault; bridging fault; software failure detection | |||
2010 | PRDC | UBC | Modeling the Propagation of Intermittent Hardware Faults in Programs | instruction based intermittent fault; dynamic dependency graph(DDG) based propagation modeling | |||
2015 | SC | IBM | Understanding the Propagation of Transient Errors in HPC Applications | fault propagation in MPI application; fault classification:V,ONA,WO,PEX,C; fault propagation speed factors | |||
2023 | ISCA | UChicago | Understanding and Mitigating Hardware Failures in Deep Learning Training Accelerator Systems | NVDLA based fault injection framework; re-execution based light-weight recovery technique; failure effects:SlowDegrade,SharpSlowDegrade,SharpDegrade,LowTestAccuracy |
Fault Injection Technique¶
Challenge: It is difficult to target specific components, reproduce realistic fault scenarios, and observe system behavior without disturbing normal operation, especially as system scale and complexity increase.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2008 | VLSI | DISCA | Enhancement of Fault Injection Techniques Based on the Modification of VHDL Code | saboteurs and mutants technique based fault injection; VHDL level fault-tolerance mechanism | |||
2014 | DSN | UBC | Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults | fault injection quantification; assembly level fault injection; LLVM compiler based fault injector |
Communication¶
Challenge: efficiently managing data exchange between a large number of cores, due to limited bandwidth, high latency, and contention in shared resources like interconnects and memory.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UCLM | Understanding intra-node communication in HPC systems and Datacenters | intra- and inter-node simulation model; intra-node network interface bottleneck; impacts of communication pattern |
Heterogeneous Systems¶
Heterogeneous systems are systems that have different types of processors, such as CPUs and GPUs.
Solution: ultilize the heterogeneous resources to improve the performance.
General Applications¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2013 | SOSP | MSR Silicon Valley | Dandelion: a Compiler and Runtime for Heterogeneous Systems | unified programming model; “single machine” abstraction; a rich object-oriented programming language for data-parallel computing | |||
2025 | EuroSys | SJTU | Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing | Bubble-less spatial-temporal sharing; kernel squad scheduling; fine-grained concurrent kernel management | 4 | 3 | 2 |
2025 | ISPASS | CMU | Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures | Effective regions for balanced utilization of PUs; Proximity-based kernel fusion recommendation; operator-kernel dependency graphs from PyTorch Profiler traces | 3 | 4 | 2 |
Decentralized Serving¶
Challenge: managing diverse hardware and software environments, balancing workloads across uneven resources, minimizing communication overhead, ensuring consistency without centralized control.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2019 | ASPLOS | USC | Hop: Heterogeneity-aware Decentralized Training | iteration gap; queue-based synchronization; backup workers and bounded staleness | |||
2020 | ASPLOS | USC | Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training | Partial All-Reduce to reduce synchronization cost; group scheduling to avoid conflicts | |||
2025 | arXiv | Berkeley | DeServe: Towards Affordable Offline LLM Inference via Decentralization | decentralized LLM inference; high-latency optimization; idle GPU utilization; modular on-chain integration | |||
2025 | arXiv | HKUST | DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization | partial synchronization based local SGD; DFS algorithm with pruned search space; enables the opportunity of overlapping communication and computation |
ML Training Systems¶
Solution: balance between faster training and high precision.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | SOSP | CMU | Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling | heterogeneity-aware and adaptivity-aware; ILP formulation for scheduling; bootstrapped from observing just a few mini-batches |
LLM Inference Heterogeneous Systems ¶
Solution: managing diverse hardware and software environments, balancing workloads across uneven resources, meeting the SLO.
Mobile & Edge-Network Serving¶
Challenge: limited computation, memory, power coupled with intermittent and unreliable network connectivity, making it difficult to perform computationally intensive training tasks, manage large datasets, and ensure efficient communication and synchronization across distributed edge nodes.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | UIC | Priority-Aware Model-Distributed Inference at Edge Networks | priority-aware model distributed inference algorithm; prioritization of ML inference tasks; model-distributed inferencing mechanism | |||
2024 | arXiv | Yonsei | Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models | hybrid language model; selectively skip uplink transmissions; uncertainty-aware | |||
2024 | arXiv | UMD | Distributed Mixture-of-Agents for Edge Inference with Large Language Models | Mixture-of-Agents; semantics of the data being gossiped and its timeliness; queuing stability | |||
2025 | arXiv | PKU | SplitLLM: Hierarchical Split Learning for Large Language Model over Wireless Network | hierarchical split learning; edge-cloud collaboration; LoRA adapter update | |||
2025 | arXiv | SJTU | HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators | both layer-level and tensor-level GPU-NPU parallelism; different tensor partition strategies; fast synchronization mechanism based on predictable kernel waiting times; tensor partition solver |
GPU-GPU Heterogeneous System¶
Solution: the system is composed of heterogeneous GPUs and not inferencing on CPU/ The system need to manage the heterogeneous GPUs' communication and memory.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | CMU | Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | LLM model placement as a max-flow problem; per-request pipeline; mixed integer linear programming | |||
2025 | ICLR | HKUST | HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment | a combination of graph partitioning and max-flow algorithm; TP and PP with disaggregation; bottleneck and underutilized edges; swap edges |
XPU-GPU Heterogeneous System¶
Challenge: effectively managing and coordinating diverse hardware (CPUs, TPUs, etc.), interconnects, and memory hierarchies
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | ICML | Stanford | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | dynamic offload tensor; quantize the weights to 4-bits; linear aggregation of the store and load operations | 4 | 4 | 3 |
2025 | arXiv | CMU | Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures | SKIP profiling tool; TKLQT metric for CPU/GPU boundedness; proximity score kernel fusion | 2 | 3 | 2 |
2025 | SPAA | Huawei | WindVE: Collaborative CPU-NPU Vector Embedding | seamless CPU-NPU collaboration for vector embedding; linear regression based estimator; high-throughput offloading vector embedding | 2 | 4 | 3 |
2025 | arXiv | Huawei | High-Throughput LLM inference on Heterogeneous Clusters | lightweight profiling while avoiding resource-intensive throughput benchmarks; a scheduler that accounts for both instance computational capacity and memory usage; exhaustive search method | 2 | 4 | 2 |
2025 | ISCA | KAIST | EOD: Enabling Low Latency GNN Inference via Near-Memory Concatenate Aggregation | concatenated ZVC compression; precomputation for neighborhood explosion problem | 2 | 3 | 2 |
Heterogeneous Device Task Scheduling¶
Solution: assigning different parts of the LLM serving workload to the most suitable heterogeneous devices to maximize throughput and minimize latency.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | PACT | Yonsei | Virtual PIM: Resource-aware Dynamic DPU Allocation and Workload Scheduling Framework for Multi-DPU PIM Architecture | dynamic DPU allocation for multitasking; fine-grained scheduling | 3 | 2 | 2 |
2025 | arXiv | NUS | Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems | lightweight and input-aware framework; multiobjective and multi-constraint design space; dynamically creating optimal schedules | |||
2025 | HPCA | Samsung | PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM | task scheduling algorithm across host and PIM; interleave-batched GEMM; data layout adjustment | 2 | 3 | 3 |
Task Scheduling for specific tasks¶
Solution: In specific scene, the schedule goal is different. Assigning tasks to differnet devices can fix the gap between the characteristics of devices' and tasks'.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | HPCA | Princeton | Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications | distributed data-local tiled architecture; task-based programming for pointer indirection; traffic-aware task scheduling with headerless NoC | 3 | 3 | 3 |
2025 | arXiv | Georgia Tech | HARP: A Taxonomy for Heterogeneous and Hierarchical Processors for Mixed-reuse Workloads | a taxonomy to classify the heterogeneous and hierarchical accelerators; characterize hardware organization of different accelerators; classify based on relative location of sub-accelerators | |||
2025 | arXiv | PKU | Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC | agent application-specific scheduling on heterogeneous SoC; heterogeneous execution graph with eastic kernels; bandwidth-aware dispatch for NPU-iGPU contention mitigation | 3 | 2 | 3 |
LLM Training Heterogeneous Systems¶
Solution: compared to LLM Inference Heterogeneous Systems, need to solve the backward compatibility and heterogeneity issues.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | PKU | Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences | data sampling imbalance; data packing imbalance; subgraph abstraction | |||
2024 | arXiv | Ant Group | EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models | Local Stochastic Gradient Descent (Local SGD); consistent stragglers within heterogeneous devices; hierarchical distribution strategy on a two-dimensional device mesh; layer by layer forward syncing; pseudo-gradient penalty method | |||
2024 | arXiv | ZJU | Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters | efficient and low-overhead task-to-cluster scheduling; bin-packing algorithms; seamless and user-friendly | |||
2025 | arXiv | OSU | Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning | low-bandwidth interconnects; three-level hierarchical partitioning strategy; improved hierarchical partitioning on top of ZeRO++ | |||
2025 | arXiv | PKU | Split Fine-Tuning for Large Language Models in Wireless Networks | split fine-tuning; device and server partition; novel compression scheme and resource management algorithm | |||
2025 | arXiv | Neuchatel | SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks | partial pipeline parallelism; stage skipping; path scheduling algorithm |
Schedule Optimization¶
Solution: develop task schedule algorithms, to achieve efficient overall system performance despite incomplete and evolving system state information.performance.
General Task Scheduling¶
Solution: optimizing the allocation and execution of diverse and dynamic workloads.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2019 | NSDI | MIT | Shinjuku: Preemptive Scheduling for µsecond-scale Tail Latency | preemptive scheduling; single-address space OS; hardware-supported virtualization | |||
2021 | SOSP | UPenn | When Idling is Ideal: Optimizing Tail-Latency for Heavy-Tailed Datacenter Workloads with Perséphone | reserve cores; non-conserving; request dispatching algorithm | |||
2017 | HPCA | UGent | Reliability-Aware Scheduling on Heterogeneous Multicore Processors | core reliability characteristics difference; system soft error rate; sampling-based reliability-aware scheduling algorithm | |||
2020 | TCAD | ASU | Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems | offline Oracle optimizaion strategy; hierarchical imitation learning based scheduling; two-level scheduling |
Speculative Execution (Non-LLM) ¶
Solution: balancing the potential performance gains from speculative executions, including accurately predicting outcomes, handling incorrect speculations and their side effects across multiple nodes.
Refer to Speculative Execution for the speculative execution algorithms for LLM.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | MSR | Forerunner: Constraint-based Speculative Transaction Execution for Ethereum | constraint-based speculative transaction execution; many-future nature; specialized fast-path program | |||
2024 | arXiv | Politecnico di Milano | Minimizing speculation overhead in a parallel recognizer for regular texts | speculation overhead; chunk automaton; reduced-interface DFA |
LLM-Related Scheduling ¶
Challenge: efficiently managing the immense computational and memory demands of training and inference across numerous interconnected devices, requiring sophisticated strategies to partition massive models.
LLM Request Scheduling¶
Solution: develop intelligent strategies to route requests, prioritize urgent or critical tasks, handle varying input lengths and complexities, manage resource contention to meet the SLO requirements.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | UCSB | Multi-Bin Batching for Increasing LLM Inference Throughput | binning-based scheduling strategy; queueing-theoretical analysis; asymptotical throughput optimality | |||
2024 | arXiv | Yale | TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications | segmented generation; time-sensitive scheduling; latency-guided batch size selection | |||
2025 | arXiv | MSRI | Niyama : Breaking the Silos of LLM Inference Serving | QoS-driven LLM inference serving system; co-scheduling requests with diverse QoS targets on a shared rather than siloed infrastructure; allows graceful service degradation during overload conditions; deadline slack; a hybrid prioritization and an eager relegation policy | 4 | 4 | 3 |
2025 | arXiv | MIT | Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints | fluid dynamics approximation; Waiting for Accumulated Inference Threshold; a hierarchical framework comprising multiple segments | 3 | 4 | 2 |
2025 | arXiv | PKU | SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference | service-aware and latency-optimized scheduling algorithm; doubling budget (DB) scheduling algorithm; search-based placement algorithm | 3 | 4 | 2 |
Info Predict Scheduling¶
Challenge: The general schedule if for better batching and meeting the SLO requirements. By predicting the information of the requests, we can make the schedule more efficient.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | Nips | Harvard | S3: Increasing GPU Utilization during Generative Inference for Higher Throughput | predict the length of LLM request to fixed types; Orca based dynamic batching | 3 | 2 | 3 |
2024 | ASPLOS | UIUC | Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | length prediction; left time prediction; bert-based proxy model | 4 | 3 | 2 |
LLM Application-Level Scheduling¶
Solution: to optimize the end-to-end latency of the application, including the scheduling of the LLM instances.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | OSDI | SJTU | Parrot: Efficient Serving of LLM-based Applications with Semantic Variable | Semantic Variable; application-level information; LLM applications as first-class citizens | |||
2024 | OSDI | CUHK | Teola: Towards End-to-End Optimization of LLM-based Applications | mismatch between request-level scheduling and end-to-end application performance; primitive-level dataflow graph; two-tier scheduling mechanism | |||
2024 | arXiv | Yext | SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering | constantly changing and sometimes adverse conditions; Dynamically Reconfigurable Horizontal Scaling Framework; dynamically adjust resource allocation based on query requirements | |||
2025 | arXiv | Berkeley | Autellix: An Efficient Serving Engine for LLM Agents as General Programs | formalize agentic programs as dynamic, non-deterministic DAGs; non-clairvoyant scheduler; simple load-balancing policy to balance data locality and KV-cache recomputation | |||
2025 | ICDCS | SJTU | LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications | a DAG with regular stage, LLM stage, dynamic stage; bayesian network-based profiler; identify uncertainty-reducing stages | 4 | 4 | 3 |
2025 | arXiv | SJTU | Efficient Serving of LLM Applications with Probabilistic Demand Modeling | DAG-based scheduling; dynamic excution; cpu excutor warmup | 3 | 1 | 1 |
LLM Speculative Inference ¶
Refer to non-LLM speculative execution.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | F&M College | AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration | simultaneous and independent predictions; asynchronous speculative decoding; rollback mechanism | |||
2024 | arXiv | Purdue | Constrained Decoding with Speculative Lookaheads | computational expense of generating lookaheads; speculated lookaheads; task specific reward function | |||
2024 | arXiv | Rutgers | Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface | active user intervention; speculative planning algorithm; UI-level rescheduling algorithm | |||
2024 | arXiv | USTC | Parallel Speculative Decoding with Adaptive Draft Length | adaptive draft length; pre-verify and post-verify; draft-then-verify framework; mutual waiting problem | |||
2024 | arXiv | SEU | SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding | reasoning tree construction; parallel drafting with speculative decoding; FCFS queue verification |
Spec + Others¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Huawei | Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling | speculative MoE; speculative token shuffling; speculative expert pre-grouping | |||
2025 | INFOCOM | UoA | SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models | internal neurons sparsification; model-agnostic acceleration framework; dynamic early-exit thresholds; multi-layered feature fusion | |||
2025 | arXic | SUST | FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference | SPEC on memory limited dedvices; Efficient draft management with tree pruning and early stop reduces redundancy and maintains causal relationships | 3 | 3 | 3 |
LLM Serving Outages and Incidents¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Vrije Universiteit Amsterdam | An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models | empirical characterization of outages; failure recovery optimization; public LLM service reliability |
Energy-Optimized LLM Scheduling¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UvA | GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation | dynamic early exit; energy-aware code generation; reinforcement learning for llms |
Multi-LLM Scheduling¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UCLA | Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving | Long-tail model popularity; Frequent idle periods; Rapid workload fluctuations | 3 | 4 | 2 |
DNN Scheduling¶
Solution: optimizing data parallelism and model parallelism while minimizing communication overhead between nodes, effectively managing limited GPU memory and other resources to achieve scalability and high throughput.
Refer to LLM-Related Scheduling for the LLM-related scheduling algorithms.
Task Offloading¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | USTC | Collaborative Inference for Large Models with Task Offloading and Early Exiting | early exit mechanism; jointly optimize its offloading strategy and the confidence threshold; distributed task offloading algorithm | |||
2025 | ISCA | ETHZ | OptiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming | integer linear programming for offload optimization; PIM-friendly mapping representation; accurate cost modeling for data layout | 4 | 2 | 3 |
General optimizations for Deep Learning Systems¶
Solution: general optimizations for deep learning systems.
If the paper is focusing on an above-mentioned specific scene (e.g., memory, scheduling, IO, etc.), it will be put in the corresponding section.
LLM Training Systems¶
Solution: arrange model parameters and data across multiple devices, reduce the time spent communicating, scale up smoothly as models and data keep growing—all while staying efficient and speeding up training.
General Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | THU | Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism | chronos-aware pipeline parallelism; temporal locality optimization; activation balancing | |||
2025 | arXiv | NUS | PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization | selective offload strategy; memory offload optimization; pipeline parallelism scalability; lifespan-based offloading | |||
2025 | arXiv | UCSD | WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training | workload-aware variable-length document packing; per-document sharding strategy; adaptive sharding selection mechanism; delay execution of extremely long documents | 4 | 5 | 2 |
2025 | EuroSys | UToronto | Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization | fine-grained overlap-centric scheduling; symbolic-based performance analysis; imbalance-aware hierarchical tuning | 4 | 4 | 2 |
Optimizations on Special Scene¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | HKU | Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism | Fully Sharded Sparse Data Parallelism (FSSDP); sparsely materializes MoE parameters; two sparse collective communications | |||
2025 | arXiv | SJTU | PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline | dynamic interleaved pipeline; hierarchical schedule space for rapid pipeline schedule search; spatialtemporal subgraph reuse | 3 | 4 | 2 |
Experiments¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | JSC | Memory and Bandwidth are All You Need for Fully Sharded Data Parallel | an extensive analysis of the FSDP training distribution strategy; a grid search methodology; both simulation and empirical results | 2 | 4 | 1 |
Multi-Modal Optimizations¶
Challenge: multimodal data is more complex and requires more resources to train.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | ByteDance | OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training | multimodal mini-batch imbalance; batch post-balancing algorithm; node-wise all-to-all communicator for practical rearrangement of mini-batches | 4 | 4 | 3 |
2025 | arXiv | ICT | ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism | unified prefix cache fusing vision and text tokens; modality-aware load balancer for bursty vision traffic | 2 | 3 | 2 |
Kernel-Level Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | HUST | CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures | model segment profile-based cost model; communication-free tensor partition propagation property; extracting a set of unique model segments; Communication-Free Preserve | 4 | 5 | 3 |
LLM Inference Systems¶
Focusing on the optimizations for LLM inference systems.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | ISCA | DeepSeek | Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures | software-hardware co-design for deepseek-v3; insight into hardware for ai architectures | 5 | 5 | 4 |
2024 | Mlsys | SJTU | FlashDecoding++: Faster Large Language Model Inference on GPUs | asynchronized softmax with unified max value; flat GEMM optimization with double buffering; heuristic dataflow with hardware resource adaptation | 4 | 4 | 3 |
SLO-Aware Systems¶
Challenge: providing service for users to meet specific latency requirements with limited resources.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Berkeley | AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding | fine-grained speculative decoding; token tree verification; slo customization | |||
2025 | arXiv | UIUC | HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location | online-offline request co-location; interference-aware profiler; latency predictor; adaptive scheduler | |||
2025 | arXiv | PKU | Memory Offloading for Large Language Model Inference with Latency SLO Guarantees | effectively captures the tension between meeting SLOs and maximizing host memory usage; dynamic offloading interval; per-bus coordinator | |||
2025 | arXiv | Huawei | Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization | hybrid offline-online scheduling; preemptive scheduling for hardware utilization; lagrangian method for cost efficiency evaluation | |||
2025 | ASPLOS | BUAA | Past-Future Scheduler for LLM Serving under SLA Guarantees | lightLLM; predict future system memory usage; reduce evict by better request scheduling | 3 | 2 | 3 |
Surveys¶
System Optimization Surveys¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | NEU | LLM Inference Serving: Survey of Recent Advances and Opportunities | KV cache and memory management; LLM computation optimization; Cloud LLM deployment; focus on system-level enhancements | |||
2024 | arXiv | CUHK | A Survey on Inference Optimization Techniques for Mixture of Experts Models | model compression; expert skip; expert merge; sparse to dense; expert parallel; expert offloading | |||
2024 | arXiv | PolyU | A Survey on Large Language Model Acceleration based on KV Cache Management | cache selection; budget allocation; cache merging; cache quantization; cache low-rank decomposition; attention grouping and sharing; memory management; hardware-aware design | |||
2025 | arXiv | THU | Beyond A Single AI Cluster: A Survey of Decentralized LLM Training | resource-driven paradigm; community-driven decentralization; organizational decentralization; decentralized LLM training taxonomy | |||
2025 | arXiv | FIU | Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions | distributed solutions for LMs; workload imbalance in LLM training; M-ICL; model security enhancement |
Application Surveys¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | PKU | Retrieval-Augmented Generation for AI-Generated Content: A Survey | Query Transformation; Data Augmentation; Recursive Retrieval; Chunk Optimization; Retriever Finetuning; Hybrid Retrieval; Re-ranking; Retrieval Transformation; Prompt Engineering; Decoding Tuning; Generator Finetuning; Output Rewrite; Adaptive Retrieval; Iterative RAG | |||
2024 | arXiv | WHU | A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges | personalized characteristics; perceive environmental information; utilize memory mechanisms; mutual interaction; agent self-reflection | |||
2024 | arXiv | PolyU | Deploying Foundation Model Powered Agent Services: A Survey | FM-powered agent services within the edge-cloud environment; low-level hardware perspective; high-level software perspective |
Multimodal Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UW–Madison | LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models | query-block distributed exchange; shared visual token recomputation; sequence-parallelism with minimal communication overhead | |||
2025 | arXiv | Microsoft | Towards Efficient Large Multimodal Model Serving | fine-grained stage-aware resource management; multimodal workload-specific scheduling; model architecture-specific optimizations | |||
2025 | arXiv | Huawei | Efficiently Serving Large Multimedia Models Using EPD Disaggregation | encode-prefill-decode disaggregation; multimodal cache; intra-request parallel | |||
2025 | arXiv | TU/e | Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach | Multimodal Parallel Split Learning; computation-efficient training; server-side loss aggregation mechanism | |||
2025 | arXiv | HUST | FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework | resource-aware KV-cache memory pool; multimodal KV-cache compression; modality-specific compression |
Mixture-of-Experts LLM Systems¶
Challenge: efficiently coordinating and scaling expert models across multiple nodes, leading to issues like uneven workload distribution, high communication overhead, and difficulty in fault tolerance.
Expert Offloading and Placement¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | DATE | Berkeley | DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference | data-aware offloading; predictive pre-calculation; sequence-specific expert allocation | |||
2025 | arXiv | Stevens Tech | fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving | expert map; iteration-level probability distributions; track fine-grained input semantic embeddings; semantic-based and trajectorybased | |||
2025 | arXiv | Georgia Tech | MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing | ILP for expert placement; cross-layer dependencies; minimizing total dispatched token number | |||
2025 | EuroMLSys | EPFL | Accelerating MoE Model Inference with Expert Sharding | expert sharding for load balancing; tensor sharding for moe experts; fused expert computations for reduced kernel launches | |||
2025 | DAC | PKU | HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference | dynamically balances workloads across GPUs and CPUs; impact-driven prefetching; MoE-specialized cache management | 3 | 4 | 2 |
Batching and Scheduling¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Alibaba | Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference | statically batching irregular workloads; batch-task-tile partition; decompress the mapping and dispatch the workload | |||
2025 | arXiv | Edinburgh | MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching | module-based batching; high-throughput MoE inference; full KV-cache offloading | |||
2025 | arXiv | KTH | Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference | fine-grained preemption; priority-aware scheduling; per-expert queues; expert-level preemption | |||
2025 | arXiv | UMich | MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints | two-stage performance modeling; analyzes the theoretical performance upper bound; captures how system execution mechanisms | 4 | 4 | 2 |
2025 | Arxiv | Nvidia | MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core | decouples parallelization strategies for attention and MoE layers; flexible and efficient token-level dispatcher; 5-D hybrid parallelism | 4 | 5 | 2 |
Memory and Communication Efficiency¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | ByteDance | Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts | fine-grained communication-computation overlapping for efficient MoE execution; dependency resolving method; adaptive workload assignment method; shared data buffers between communication and computation operations | |||
2025 | arXiv | UVA | eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference | expert prediction; task-aware expert loading; task-aware request scheduling | |||
2025 | mobiCom | HKUST | D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving | dually sparselygated Mixture-of-Experts; token-adaptive bit-width selection; matryoshka weight quantization; bit-width-aware I/O-compute pipeline | 3 | 4 | 4 |
2025 | ODSI | SJTU | Fast and Live Model Auto Scaling with O(1) Host Caching | auto-scaling with minimal caching; optimize parameter loading; enabling fine-grained layer-level scale | 3 | 3 | 2 |
2023 | ASPLOS | TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators | hybrid heuristic-solver memory allocator for ML accelerators; contention-aware phased allocation strategy | 4 | 4 | 3 |
Architectural Innovations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Shanghai AI | Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts | linear sequence modeling with MoE; sparse activation via moe layers; hybrid models combining linear-moe and transformer-moe layers | |||
2025 | arXiv | Berkeley | HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs | zebra parallelism; attention-expert disaggregation; asymmetric expert assignment mechanism; gather and squeeze strategy | 4 | 5 | 3 |
Compute-Kernel-Level Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | SJTU | Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores | dual-side structured sparsity; sparse-sparse matrix multiplication kernel; vector-wise + 2:4 hybrid sparsity; token-aware activation compression |
Long Sequence LLM Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | OSDI | SJTU & Alibaba | Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache | inefficient model parallelism intra-instance; inefficient resource management inter-instance; KV cache scheduling | |||
2025 | arXiv | PKU | ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs | hybrid data parallelism; data-aware sharding; a heuristic algorithm that reorganizes data assignment based on the characteristics of data and pipeline parallelism | |||
2025 | ICML | ByteDance | ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference | offload value cache to CPU and keep outliers on GPU; landmark-guided sparse KV selection per chunk | 3 | 3 | 3 |
Sparse Attention¶
Solution: handle the prompt token by token introduce high latency, trying to use sparse attention to reduce the computation and memory burden. This can be achieved by not using the full attention matrix, but only the upper triangular part.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | CWRU | Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques | sparse attention with graph computing perspective; work-optimal graph algorithms; achieve true sparsity | |||
2025 | MLSys | MIT | LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention | unified sparse attention; hybrid static and dynamic sparsity; hierarchical kv cache management with query-centric pruning |
Ring Computation¶
Solution: use the device layout to reduce the communication overhead. The key idea is to parallel the computation and communication.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | Nips | UCB | Ring Attention with Blockwise Transformers for Near-Infinite Context | divide the input into blocks and each block is processed by a single GPU; ring-type device layout | 4 | 3 | 3 |
2024 | arXiv | SJTU | TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication | communication-oriented parallelism framework; inter-node P2P bidirectional communication bandwidth; optimization of attention block communication |
P-D Disaggregated Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | OSDI | PKU | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | goodput-optimized; prefill-decoding interference;novel placement algorithm for p-d schema | |||
2024 | ISCA | UW | Splitwise: Efficient Generative LLM Inference Using Phase Splitting | optimized cache context transfer; performance per dollar; performance per watt; exploration of homogeneous and heterogeneous cluster deployments | |||
2024 | arXiv | CMU | A System for Microserving of LLMs | fine-grained sub-request level actions; dynamic reconfiguration according to workloads; unified KV cache abstraction | |||
2025 | arXiv | PKU | ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments | two-level hierarchical optimization; tabu search algorithm for GPU partition; a lightweight re-scheduling mechanism |
P-D Disaggregated System Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | ByteDance | KVDirect: Distributed Disaggregated LLM Inference | tensor-centric communication mechanism; pull-based KV cache transfer; dynamic GPU resource scheduling via RDMA | |||
2025 | arXiv | SYSU | Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation | attention disaggregation and offloading mechanism; low-latency decoding synchronization; resource-efficient prefill colocation; load-aware offloading scheduling | 4 | 4 | 3 |
2025 | arXiv | Alibaba | FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling | analyze the communication patterns; KV cache structure adjustment method; load-aware scheduling | 4 | 4 | 2 |
2025 | arXiv | NUS & USTC | DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving | a novel Tandem Serving execution model; two virtual subrequests; explicitly permit the two subrequests to execute on either GPU instance | 3 | 4 | 2 |
Throughput-Optimized Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | HKUST | Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation | sampling-then-simulation cost model; model-level pipeline parallelism; minimumtotal-latency application scheduling | 4 | 4 | 3 |
Fair Serving Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | Virginia Tech | Ensuring Fair LLM Serving Amid Diverse Applications | multi-tenant LLM platform; overload and interaction-driven throttling; weighted service counter | |||
2025 | arXiv | UIUC | Hierarchical Autoscaling for Large Language Model Serving with Chiron | hierarchical backpressure; interactive requests and batch requests; mixed instances | |||
2025 | arXiv | Berkeley | Locality-aware Fair Scheduling in LLM Serving | deficit-based longest prefix matching; distributed deficit-round coordination; prefix-aware fairness bound analysis |
RLHF System¶
Challenge: RLHF system includes both training and inference. On top of that, multi agents(LLMs) when running in parallel, which makes the data flow more complex.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | EuroSys | HKU | HybridFlow: A Flexible and Efficient RLHF Framework | auto-mapping model placement; 3D-HybridEngine to reduce the communication overhead; hybrid programming | 4 | 4 | 3 |
2025 | arXiv | Alibaba | Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library | bind many LLMs in one device cluster; fix the batch problem of long tail requests; reuse many utils in HybridFlow | 4 | 4 | 2 |
Communication-Computation Overlap¶
Challenge: effectively hiding communication latency by overlapping it with computation, which requires careful scheduling and resource management to avoid bottlenecks and ensure that both communication and computation proceed efficiently without stalling each other.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | NSDI | KAIST | ARK: GPU-driven Code Execution for Distributed Deep Learning | communication-motivated DL system; pipeline DMA engine; GPU-direct-controlled DMA | |||
2024 | ASPLOS | PKU | Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning | communication partition abstraction; hybrid LLM training tasks; 3-level decompose | |||
2024 | ASPLOS | UW–Madison | T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives | lightweight track and trigger; pre-programmed DMA commands; atomic memory update | |||
2024 | ASPLOS | UIUC | Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM | distributed SpMM; sparsity-aware partition; Synchronous Stripes and Asynchronous Stripes | |||
2024 | arXiv | AMD | Optimizing ML Concurrent Computation and Communication with GPU DMA Engines | concurrent computation and communication; compute and memory interference among concurrent kernels; schedule prioritization and careful resource partitioning |
Configuration Optimization¶
Challenge: the configuration space is too large to be searched manually.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | OSDI | PKU | Mirage: A Multi-Level Superoptimizer for Tensor Programs | auto algebraically transfer tensor; using DAG to search configuration space; auto generate kernel function | 4 | 4 | 3 |
2020 | ASPLOS | PKU | FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System | tvm auto schedule; RL based stragety find; auto optimizing in large configuration space | 4 | 4 | 3 |