Skip to content

Distributed Systems

Distributed algorithms

Focusing on the distributed algorithms, such as consensus and replication, like RAFT.

Challenge: concurrency, synchronous and communication complexities across independent nodes

Solution: problems that require coordination, computation, and data management across multiple independent computer systems.

Computing Framework

Solution: Developing distributed algorithms requires a clear understanding of the computing framework, which scales small computing units to achieve a more clear data processing. The common computing frameworks are MapReduce, Spark, etc.

Year Venue Authors Title Tags P E N
2004 OSDI Google MapReduce: simplified data processing on large clusters divife the data processing into map and reduce stages; use master-worker architecture 4 5 5

Parallel Strategies

Soultion: using the computation and memory resources of multiple processors to solve a problem.

Challenge: communication overhead and load balancing

Data Parallelism

Solution: Data parallelism addresses scenarios where a single GPU can accommodate the model, but the dataset's size necessitates distribution across multiple GPUs for efficient processing and accelerated training.

Modern DNN acceleration systems commonly use the combination of data parallelism and model parallelism.

Year Venue Authors Title Tags P E N
2012 Nips Google Large Scale Distributed Deep Networks data parallel; use many model to optimize the same data; distributed model training 3 4 3
2014 OSDI CMU Scaling Distributed Machine Learning with the Parameter Server the foundation of tensor parallel; parameter server; pull-based data transfer 4 5 3
2020 SC Miscrosoft ZeRO: Memory Optimizations Toward Training Trillion Parameter Models fix the problem that dp cannot reduce the memory usage on single GPU 3 4 3

Model Parallelism

Solution: Model parallelism addresses scenarios where the model's size exceeds the processing and memory capacity of a single GPU. There are two types of model parallelism:

  1. Pipeline parallelism: divide the model as pipeline stages, each gpu processes one or more stages.

  2. Tensor parallelism: divide the tensor into different GPUs.

Usually, pipeline parallelism and tensor parallelism are used together.

Year Venue Authors Title Tags P E N
2019 arXiv NVIDIA Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism transformer based model parallel; pipeline parallel; divide model into different GPUs 3 4 3

LLM-specific Parallel Strategies

Focusing on the parallel strategies for LLM-specific deep learning systems.

Year Venue Authors Title Tags P E N
2022 ACL NUS Sequence Parallelism: Long Sequence Training from System Perspective splits input sequences into chunks; Ring Self-Attention; sparse attention 3 4 3

Cloud computing platforms and architectures

Challenge: when providing services to users, facing scalability, resource management, fault tolerance, and cost-effectiveness for building and deploying large-scale distributed applications and services.

Cloud Platform LLM Scheduling

Challenge: meet the SLO when providing LLM service on cloud platform.

Year Venue Authors Title Tags P E N
2025 arXiv Azure TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms thermal/power property characterization; dynamically adjust in response to power or cooling failures; thermal- and poweraware manner

Microservices

Focusing on the microservices.

Memory Management

Challenge: coordinating memory access and maintaining data consistency across multiple independent nodes with their own local memories, especially when dealing with shared data.

Remote Memory

Challenge: efficiently providing access to memory on a remote node while minimizing latency and overhead, and ensuring consistency and reliability despite network communication complexities and potential failures.

Year Venue Authors Title Tags P E N
2020 TC Georgia Tech Hierarchical Orchestration of Disaggregated Memory XMemPod architecture for hierarchical memory orchestration; compressed swap page table (CSPT) for metadata management; hybrid swap-out algorithm for memory utilization; proactive swap-in optimization for performance; RDMA-based remote memory sharing for low-latency access

Scratchpad Memory

Challenge: efficiently allocating and coordinating limited fast memory across distributed nodes to minimize access latency and contention, while ensuring data consistency and scalability.

Year Venue Authors Title Tags P E N
2023 ASPLOS Cornell Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories work-stealing based dynamic task parallelism; stack/task queue in SPM; read-only data duplication 3 3 3

LLM Memory Management

Solution: efficient memory management can reduce memory usage, thus enable larger batch size and higher throughput.

Memory Management Algorithms

Solution: efficient memory management algorithms, like virtual memory, page table, etc.

Year Venue Authors Title Tags P E N
2023 SOSP UCB Efficient Memory Management for Large Language Model Serving with PagedAttention Paged KV-Cache management; Better memory management for larger batch size; Preemptive memory scheduling 4 5 3
2022 NIPS Stanford FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Generalized Acceleration of Attention Mechanisms; Change attention to utilize the SRAM on GPU; use recompute to reduce IO burden 4 5 4

General LLM Memory Management

Challenge: LLM memory management faces challenges like limited HBM memory, efficient KV Cache management, memory sharing between multiple GPUs, multi-level memory management.

Year Venue Authors Title Tags P E N
2025 arXiv THU Jenga: Effective Memory Management for Serving LLM with Heterogeneity fixed-size embeddings; full-prefix dependency; two-level memory allocator 4 4 3
2025 FAST THU Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot PD-disaggregate system; kv-cache centered; global kv-cache pool; dynamic SLO scheduler; paged KV-Cache storage 3 4 2

KV Cache Reuse Systems

Solution: reduce redundant computation and high memory consumption during inference by allowing the reuse of previously computed key-value pairs for shared or repeated parts of input sequences.

Prefix Sharing

Solution: reuse KV Cache when the input sequence has shared or repeated parts, use prefix tree to manage KV Cache.

Year Venue Authors Title Tags P E N
2023 Nips Stanford SGLang: Efficient Execution of Structured Language Model Programs KV-Cache share; python-like DSL; compute graph; LRU cache management stragety 4 4 3
2024 ACL Microsoft ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition prefix aware attention compute; manage kv-cache chunks as prefix tree; reduce kv-cache redundancy 3 4 2
2024 arXiv Microsoft BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching global prefix tree ahead-of-time; request reorder; horizontal fusioned prefix-shared attention kernel
2024 arXiv Berkeley BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching offline batch inference; resource-aware prefix tree; compute-intensive / memory-intensive requests
2024 arXiv UChicago DroidSpeak: Enhancing Cross-LLM Communication selectively layer reuse; communication protocol for inter-agent exchanges; LLMs that share a common foundational model
KV cache store

Solution: store the KV cache in the memory or other storage device, supporting multi-level storage.

Year Venue Authors Title Tags P E N
2024 ATC Huawei Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention store KV cache in the memory; multi level KV cache management; position mask modified 3 3 3
Other Techniques

Solution: KV cache reuse techniques beyond prefix sharing. Prefix is a high requirement and is not always possible.

Year Venue Authors Title Tags P E N
2024 arXiv Berkeley Optimizing LLM Queries in Relational Workloads prefix sharing maximization; KV cache hit rate; deduplication and cost estimation techniques
2024 arXiv UChicago CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion multiple precomputed text chunks; selective KV recompute; sparsity of attention matrices

KV Cache Storage Systems

Solution: efficiently storing and retrieving the key-value cache, thus reuse when needed.

Challenge: the prefetch and eviction of the KV cache, the balance between saving GPU memory and refetching time from the storage device.

Year Venue Authors Title Tags P E N
2025 arXiv NVIDIA FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving block-sparse format; customizable attention template; dynamic load-balanced scheduling framework
2025 arXiv PKU FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference imbalanced KV cache compression mitigation; fair-copying for load balancing; best-effort assignment

KV Cache Evict Systems

Challenge: selectively discard the least important key-value pairs to free up memory for longer contexts or larger batch sizes without significantly degrading the model's generation quality or increasing computational overhead for the eviction process itself.

Year Venue Authors Title Tags P E N
2023 NIPS UT-Austin H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models sparsity for small cache size; heavy-hitters; greedy algorithm for low-cost policy
2024 arXiv Fujitsu CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models long measurement step; decay of the accumulated attention score; adjusting FIFO cache size

Systems with Other Caches

Solution: use other caches (not just KV cache) to improve the performance of LLM inference.

Year Venue Authors Title Tags P E N
2025 arXiv KAIST Efficient LLM Inference with Activation Checkpointing and Hybrid Caching activation checkpointing; KV-activation hybrid caching; balanced approach to determine the best ratio
LLM Prefetching

Solution: prefetch to avoid memory transfer between different devices, reduce the latency of memory access.

Year Venue Authors Title Tags P E N
2025 arXiv Huawei Zurich PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving computational graph-based prefetching; prefetch KV cache to L2 cache

Communication-Centric Optimization

Challenge: communication is a bottleneck of some distributed systems, trying to reduce the communication.

I/O Characterization and Optimization

Challenge: minimize data movement and maximize resource utilization across heterogeneous distributed environments.

Year Venue Authors Title Tags P E N
2020 ASPLOS CMU Livia: Data-Centric Computing Throughout the Memory Hierarchy Memory service programming model; task graphs linked to data location; dynamic task/data scheduling for minimal movement 2 4 3
2025 arXiv UOregon Parallel I/O Characterization and Optimization on Large-Scale HPC Systems: A 360-Degree Survey different HPC I/O stack layers; profiling and tracing tools; tuning echniques

GPU-GPU Communication

Challenge: limited interconnect bandwidth between GPUs using nvLink, PCIe, synchronization delays in parallel workloads, load imbalance across GPUs

Year Venue Authors Title Tags P E N
2025 arXiv Apple SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models sync-point drop; block-wise sensitivity analysis; attention output synchronization reduction

Many-Core Systems

Challenge: the heterogeneity of cores, the load imbalance, and the communication overhead.

Workload Characterization

Challenge: dynamic workloads across numerous cores, resource contention for shared hardware.

Year Venue Authors Title Tags P E N
2018 SC Intel Many-Core Graph Workload Analysis multicore simulator sniper; selective caching and prefetching; heterogeneous high-performance low-power cores
2018 DATE UGA Parallel Code Generation of Synchronous Programs for a Many-core Architecture banked memory mapping; worst-case response time analysis
2025 IPDPS UChicago Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems lock-less and concurrent task queue xqueue; distributed tree barrier; NUMA-aware redirect push/work stealing

Fault Propagation

Challenge: one core or component can easily spread to others due to shared resources, leading to system-wide reliability issues. Core counts grow make it hard to predict, detect, and contain errors effectively.

Year Venue Authors Title Tags P E N
2008 ASPLOS UIUC Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design stuck-at fault; bridging fault; software failure detection
2010 PRDC UBC Modeling the Propagation of Intermittent Hardware Faults in Programs instruction based intermittent fault; dynamic dependency graph(DDG) based propagation modeling
2015 SC IBM Understanding the Propagation of Transient Errors in HPC Applications fault propagation in MPI application; fault classification:V,ONA,WO,PEX,C; fault propagation speed factors
2023 ISCA UChicago Understanding and Mitigating Hardware Failures in Deep Learning Training Accelerator Systems NVDLA based fault injection framework; re-execution based light-weight recovery technique; failure effects:SlowDegrade,SharpSlowDegrade,SharpDegrade,LowTestAccuracy

Fault Injection Technique

Challenge: It is difficult to target specific components, reproduce realistic fault scenarios, and observe system behavior without disturbing normal operation, especially as system scale and complexity increase.

Year Venue Authors Title Tags P E N
2008 VLSI DISCA Enhancement of Fault Injection Techniques Based on the Modification of VHDL Code saboteurs and mutants technique based fault injection; VHDL level fault-tolerance mechanism
2014 DSN UBC Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults fault injection quantification; assembly level fault injection; LLVM compiler based fault injector

Communication

Challenge: efficiently managing data exchange between a large number of cores, due to limited bandwidth, high latency, and contention in shared resources like interconnects and memory.

Year Venue Authors Title Tags P E N
2025 arXiv UCLM Understanding intra-node communication in HPC systems and Datacenters intra- and inter-node simulation model; intra-node network interface bottleneck; impacts of communication pattern

Heterogeneous Systems

Heterogeneous systems are systems that have different types of processors, such as CPUs and GPUs.

Solution: ultilize the heterogeneous resources to improve the performance.

General Applications

Year Venue Authors Title Tags P E N
2013 SOSP MSR Silicon Valley Dandelion: a Compiler and Runtime for Heterogeneous Systems unified programming model; “single machine” abstraction; a rich object-oriented programming language for data-parallel computing
2025 EuroSys SJTU Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing Bubble-less spatial-temporal sharing; kernel squad scheduling; fine-grained concurrent kernel management 4 3 2
2025 ISPASS CMU Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures Effective regions for balanced utilization of PUs; Proximity-based kernel fusion recommendation; operator-kernel dependency graphs from PyTorch Profiler traces 3 4 2

Decentralized Serving

Challenge: managing diverse hardware and software environments, balancing workloads across uneven resources, minimizing communication overhead, ensuring consistency without centralized control.

Year Venue Authors Title Tags P E N
2019 ASPLOS USC Hop: Heterogeneity-aware Decentralized Training iteration gap; queue-based synchronization; backup workers and bounded staleness
2020 ASPLOS USC Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training Partial All-Reduce to reduce synchronization cost; group scheduling to avoid conflicts
2025 arXiv Berkeley DeServe: Towards Affordable Offline LLM Inference via Decentralization decentralized LLM inference; high-latency optimization; idle GPU utilization; modular on-chain integration
2025 arXiv HKUST DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization partial synchronization based local SGD; DFS algorithm with pruned search space; enables the opportunity of overlapping communication and computation

ML Training Systems

Solution: balance between faster training and high precision.

Year Venue Authors Title Tags P E N
2023 SOSP CMU Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling heterogeneity-aware and adaptivity-aware; ILP formulation for scheduling; bootstrapped from observing just a few mini-batches

LLM Inference Heterogeneous Systems

Solution: managing diverse hardware and software environments, balancing workloads across uneven resources, meeting the SLO.

Mobile & Edge-Network Serving

Challenge: limited computation, memory, power coupled with intermittent and unreliable network connectivity, making it difficult to perform computationally intensive training tasks, manage large datasets, and ensure efficient communication and synchronization across distributed edge nodes.

Year Venue Authors Title Tags P E N
2024 arXiv UIC Priority-Aware Model-Distributed Inference at Edge Networks priority-aware model distributed inference algorithm; prioritization of ML inference tasks; model-distributed inferencing mechanism
2024 arXiv Yonsei Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models hybrid language model; selectively skip uplink transmissions; uncertainty-aware
2024 arXiv UMD Distributed Mixture-of-Agents for Edge Inference with Large Language Models Mixture-of-Agents; semantics of the data being gossiped and its timeliness; queuing stability
2025 arXiv PKU SplitLLM: Hierarchical Split Learning for Large Language Model over Wireless Network hierarchical split learning; edge-cloud collaboration; LoRA adapter update
2025 arXiv SJTU HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators both layer-level and tensor-level GPU-NPU parallelism; different tensor partition strategies; fast synchronization mechanism based on predictable kernel waiting times; tensor partition solver

GPU-GPU Heterogeneous System

Solution: the system is composed of heterogeneous GPUs and not inferencing on CPU/ The system need to manage the heterogeneous GPUs' communication and memory.

Year Venue Authors Title Tags P E N
2024 arXiv CMU Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs LLM model placement as a max-flow problem; per-request pipeline; mixed integer linear programming
2025 ICLR HKUST HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment a combination of graph partitioning and max-flow algorithm; TP and PP with disaggregation; bottleneck and underutilized edges; swap edges

XPU-GPU Heterogeneous System

Challenge: effectively managing and coordinating diverse hardware (CPUs, TPUs, etc.), interconnects, and memory hierarchies

Year Venue Authors Title Tags P E N
2023 ICML Stanford FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU dynamic offload tensor; quantize the weights to 4-bits; linear aggregation of the store and load operations 4 4 3
2025 arXiv CMU Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures SKIP profiling tool; TKLQT metric for CPU/GPU boundedness; proximity score kernel fusion 2 3 2
2025 SPAA Huawei WindVE: Collaborative CPU-NPU Vector Embedding seamless CPU-NPU collaboration for vector embedding; linear regression based estimator; high-throughput offloading vector embedding 2 4 3
2025 arXiv Huawei High-Throughput LLM inference on Heterogeneous Clusters lightweight profiling while avoiding resource-intensive throughput benchmarks; a scheduler that accounts for both instance computational capacity and memory usage; exhaustive search method 2 4 2

Heterogeneous Device Task Scheduling

Solution: assigning different parts of the LLM serving workload to the most suitable heterogeneous devices to maximize throughput and minimize latency.

Year Venue Authors Title Tags P E N
2025 arXiv NUS Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems lightweight and input-aware framework; multiobjective and multi-constraint design space; dynamically creating optimal schedules
2025 arXiv Georgia Tech HARP: A Taxonomy for Heterogeneous and Hierarchical Processors for Mixed-reuse Workloads a taxonomy to classify the heterogeneous and hierarchical accelerators; characterize hardware organization of different accelerators; classify based on relative location of sub-accelerators

LLM Training Heterogeneous Systems

Solution: compared to LLM Inference Heterogeneous Systems, need to solve the backward compatibility and heterogeneity issues.

Year Venue Authors Title Tags P E N
2024 arXiv PKU Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences data sampling imbalance; data packing imbalance; subgraph abstraction
2024 arXiv Ant Group EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models Local Stochastic Gradient Descent (Local SGD); consistent stragglers within heterogeneous devices; hierarchical distribution strategy on a two-dimensional device mesh; layer by layer forward syncing; pseudo-gradient penalty method
2024 arXiv ZJU Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters efficient and low-overhead task-to-cluster scheduling; bin-packing algorithms; seamless and user-friendly
2025 arXiv OSU Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning low-bandwidth interconnects; three-level hierarchical partitioning strategy; improved hierarchical partitioning on top of ZeRO++
2025 arXiv PKU Split Fine-Tuning for Large Language Models in Wireless Networks split fine-tuning; device and server partition; novel compression scheme and resource management algorithm
2025 arXiv Neuchatel SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks partial pipeline parallelism; stage skipping; path scheduling algorithm

Schedule Optimization

Solution: develop task schedule algorithms, to achieve efficient overall system performance despite incomplete and evolving system state information.performance.

General Task Scheduling

Solution: optimizing the allocation and execution of diverse and dynamic workloads.

Year Venue Authors Title Tags P E N
2019 NSDI MIT Shinjuku: Preemptive Scheduling for µsecond-scale Tail Latency preemptive scheduling; single-address space OS; hardware-supported virtualization
2021 SOSP UPenn When Idling is Ideal: Optimizing Tail-Latency for Heavy-Tailed Datacenter Workloads with Perséphone reserve cores; non-conserving; request dispatching algorithm
2017 HPCA UGent Reliability-Aware Scheduling on Heterogeneous Multicore Processors core reliability characteristics difference; system soft error rate; sampling-based reliability-aware scheduling algorithm
2020 TCAD ASU Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems offline Oracle optimizaion strategy; hierarchical imitation learning based scheduling; two-level scheduling
2023 PACT Yonsei Virtual PIM: Resource-aware Dynamic DPU Allocation and Workload Scheduling Framework for Multi-DPU PIM Architecture Virtual PIM framework; dynamic DPU allocation for multitasking;fine-grained scheduling 3 2 2

Speculative Execution (Non-LLM)

Solution: balancing the potential performance gains from speculative executions, including accurately predicting outcomes, handling incorrect speculations and their side effects across multiple nodes.

Refer to Speculative Execution for the speculative execution algorithms for LLM.

Year Venue Authors Title Tags P E N
2024 arXiv MSR Forerunner: Constraint-based Speculative Transaction Execution for Ethereum constraint-based speculative transaction execution; many-future nature; specialized fast-path program
2024 arXiv Politecnico di Milano Minimizing speculation overhead in a parallel recognizer for regular texts speculation overhead; chunk automaton; reduced-interface DFA

Challenge: efficiently managing the immense computational and memory demands of training and inference across numerous interconnected devices, requiring sophisticated strategies to partition massive models.

LLM Request Scheduling

Solution: develop intelligent strategies to route requests, prioritize urgent or critical tasks, handle varying input lengths and complexities, manage resource contention to meet the SLO requirements.

Year Venue Authors Title Tags P E N
2024 arXiv UCSB Multi-Bin Batching for Increasing LLM Inference Throughput binning-based scheduling strategy; queueing-theoretical analysis; asymptotical throughput optimality
2024 arXiv Yale TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications segmented generation; time-sensitive scheduling; latency-guided batch size selection
2025 arXiv MSRI Niyama : Breaking the Silos of LLM Inference Serving QoS-driven LLM inference serving system; co-scheduling requests with diverse QoS targets on a shared rather than siloed infrastructure; allows graceful service degradation during overload conditions; deadline slack; a hybrid prioritization and an eager relegation policy 4 4 3
2025 arXiv MIT Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints fluid dynamics approximation; Waiting for Accumulated Inference Threshold; a hierarchical framework comprising multiple segments 3 4 2
2025 arXiv PKU SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference service-aware and latency-optimized scheduling algorithm; doubling budget (DB) scheduling algorithm; search-based placement algorithm 3 4 2

LLM Application-Level Scheduling

Solution: to optimize the end-to-end latency of the application, including the scheduling of the LLM instances.

Year Venue Authors Title Tags P E N
2024 OSDI SJTU Parrot: Efficient Serving of LLM-based Applications with Semantic Variable Semantic Variable; application-level information; LLM applications as first-class citizens
2024 OSDI CUHK Teola: Towards End-to-End Optimization of LLM-based Applications mismatch between request-level scheduling and end-to-end application performance; primitive-level dataflow graph; two-tier scheduling mechanism
2024 arXiv Yext SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering constantly changing and sometimes adverse conditions; Dynamically Reconfigurable Horizontal Scaling Framework; dynamically adjust resource allocation based on query requirements
2025 arXiv Berkeley Autellix: An Efficient Serving Engine for LLM Agents as General Programs formalize agentic programs as dynamic, non-deterministic DAGs; non-clairvoyant scheduler; simple load-balancing policy to balance data locality and KV-cache recomputation
2025 ICDCS SJTU LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications a DAG with regular stage, LLM stage, dynamic stage; bayesian network-based profiler; identify uncertainty-reducing stages 4 4 3

LLM Speculative Inference

Refer to non-LLM speculative execution.

Year Venue Authors Title Tags P E N
2024 arXiv F&M College AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration simultaneous and independent predictions; asynchronous speculative decoding; rollback mechanism
2024 arXiv Purdue Constrained Decoding with Speculative Lookaheads computational expense of generating lookaheads; speculated lookaheads; task specific reward function
2024 arXiv Rutgers Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface active user intervention; speculative planning algorithm; UI-level rescheduling algorithm
2024 arXiv USTC Parallel Speculative Decoding with Adaptive Draft Length adaptive draft length; pre-verify and post-verify; draft-then-verify framework; mutual waiting problem
2024 arXiv SEU SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding reasoning tree construction; parallel drafting with speculative decoding; FCFS queue verification
Spec + Others
Year Venue Authors Title Tags P E N
2025 arXiv Huawei Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling speculative MoE; speculative token shuffling; speculative expert pre-grouping
2025 INFOCOM UoA SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models internal neurons sparsification; model-agnostic acceleration framework; dynamic early-exit thresholds; multi-layered feature fusion

LLM Serving Outages and Incidents

Year Venue Authors Title Tags P E N
2025 arXiv Vrije Universiteit Amsterdam An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models empirical characterization of outages; failure recovery optimization; public LLM service reliability

Energy-Optimized LLM Scheduling

Year Venue Authors Title Tags P E N
2025 arXiv UvA GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation dynamic early exit; energy-aware code generation; reinforcement learning for llms

DNN Scheduling

Solution: optimizing data parallelism and model parallelism while minimizing communication overhead between nodes, effectively managing limited GPU memory and other resources to achieve scalability and high throughput.

Refer to LLM-Related Scheduling for the LLM-related scheduling algorithms.

Task Offloading

Year Venue Authors Title Tags P E N
2024 arXiv USTC Collaborative Inference for Large Models with Task Offloading and Early Exiting early exit mechanism; jointly optimize its offloading strategy and the confidence threshold; distributed task offloading algorithm

General optimizations for Deep Learning Systems

Solution: general optimizations for deep learning systems.

If the paper is focusing on an above-mentioned specific scene (e.g., memory, scheduling, IO, etc.), it will be put in the corresponding section.

LLM Training Systems

Solution: arrange model parameters and data across multiple devices, reduce the time spent communicating, scale up smoothly as models and data keep growing—all while staying efficient and speeding up training.

General Optimizations

Year Venue Authors Title Tags P E N
2025 arXiv THU Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism chronos-aware pipeline parallelism; temporal locality optimization; activation balancing
2025 arXiv NUS PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization selective offload strategy; memory offload optimization; pipeline parallelism scalability; lifespan-based offloading
2025 arXiv UCSD WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training workload-aware variable-length document packing; per-document sharding strategy; adaptive sharding selection mechanism; delay execution of extremely long documents 4 5 2
2025 EuroSys UToronto Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization fine-grained overlap-centric scheduling; symbolic-based performance analysis; imbalance-aware hierarchical tuning 4 4 2

Optimizations on Special Scene

Year Venue Authors Title Tags P E N
2025 arXiv HKU Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism Fully Sharded Sparse Data Parallelism (FSSDP); sparsely materializes MoE parameters; two sparse collective communications
2025 arXiv SJTU PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline dynamic interleaved pipeline; hierarchical schedule space for rapid pipeline schedule search; spatialtemporal subgraph reuse 3 4 2

Experiments

Year Venue Authors Title Tags P E N
2025 arXiv JSC Memory and Bandwidth are All You Need for Fully Sharded Data Parallel an extensive analysis of the FSDP training distribution strategy; a grid search methodology; both simulation and empirical results 2 4 1

Multi-Modal Optimizations

Challenge: multimodal data is more complex and requires more resources to train.

Year Venue Authors Title Tags P E N
2025 arXiv ByteDance OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training multimodal mini-batch imbalance; batch post-balancing algorithm; node-wise all-to-all communicator for practical rearrangement of mini-batches 4 4 3

Kernel-Level Optimizations

Year Venue Authors Title Tags P E N
2025 arXiv HUST CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures model segment profile-based cost model; communication-free tensor partition propagation property; extracting a set of unique model segments; Communication-Free Preserve 4 5 3

LLM Inference Systems

Focusing on the optimizations for LLM inference systems.

Year Venue Authors Title Tags P E N
2025 ISCA DeepSeek Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures software-hardware co-design for deepseek-v3; insight into hardware for ai architectures 5 5 4

SLO-Aware Systems

Challenge: providing service for users to meet specific latency requirements with limited resources.

Year Venue Authors Title Tags P E N
2025 arXiv Berkeley AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding fine-grained speculative decoding; token tree verification; slo customization
2025 arXiv UIUC HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location online-offline request co-location; interference-aware profiler; latency predictor; adaptive scheduler
2025 arXiv PKU Memory Offloading for Large Language Model Inference with Latency SLO Guarantees effectively captures the tension between meeting SLOs and maximizing host memory usage; dynamic offloading interval; per-bus coordinator
2025 arXiv Huawei Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization hybrid offline-online scheduling; preemptive scheduling for hardware utilization; lagrangian method for cost efficiency evaluation
2025 ASPLOS BUAA Past-Future Scheduler for LLM Serving under SLA Guarantees lightLLM; predict future system memory usage; reduce evict by better request scheduling 3 2 3

Surveys

System Optimization Surveys
Year Venue Authors Title Tags P E N
2024 arXiv NEU LLM Inference Serving: Survey of Recent Advances and Opportunities KV cache and memory management; LLM computation optimization; Cloud LLM deployment; focus on system-level enhancements
2024 arXiv CUHK A Survey on Inference Optimization Techniques for Mixture of Experts Models model compression; expert skip; expert merge; sparse to dense; expert parallel; expert offloading
2024 arXiv PolyU A Survey on Large Language Model Acceleration based on KV Cache Management cache selection; budget allocation; cache merging; cache quantization; cache low-rank decomposition; attention grouping and sharing; memory management; hardware-aware design
2025 arXiv THU Beyond A Single AI Cluster: A Survey of Decentralized LLM Training resource-driven paradigm; community-driven decentralization; organizational decentralization; decentralized LLM training taxonomy
2025 arXiv FIU Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions distributed solutions for LMs; workload imbalance in LLM training; M-ICL; model security enhancement
Application Surveys
Year Venue Authors Title Tags P E N
2024 arXiv PKU Retrieval-Augmented Generation for AI-Generated Content: A Survey Query Transformation; Data Augmentation; Recursive Retrieval; Chunk Optimization; Retriever Finetuning; Hybrid Retrieval; Re-ranking; Retrieval Transformation; Prompt Engineering; Decoding Tuning; Generator Finetuning; Output Rewrite; Adaptive Retrieval; Iterative RAG
2024 arXiv WHU A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges personalized characteristics; perceive environmental information; utilize memory mechanisms; mutual interaction; agent self-reflection
2024 arXiv PolyU Deploying Foundation Model Powered Agent Services: A Survey FM-powered agent services within the edge-cloud environment; low-level hardware perspective; high-level software perspective

Multimodal Systems

Year Venue Authors Title Tags P E N
2025 arXiv UW–Madison LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models query-block distributed exchange; shared visual token recomputation; sequence-parallelism with minimal communication overhead
2025 arXiv Microsoft Towards Efficient Large Multimodal Model Serving fine-grained stage-aware resource management; multimodal workload-specific scheduling; model architecture-specific optimizations
2025 arXiv Huawei Efficiently Serving Large Multimedia Models Using EPD Disaggregation encode-prefill-decode disaggregation; multimodal cache; intra-request parallel
2025 arXiv TU/e Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach Multimodal Parallel Split Learning; computation-efficient training; server-side loss aggregation mechanism
2025 arXiv HUST FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework resource-aware KV-cache memory pool; multimodal KV-cache compression; modality-specific compression

Mixture-of-Experts LLM Systems

Challenge: efficiently coordinating and scaling expert models across multiple nodes, leading to issues like uneven workload distribution, high communication overhead, and difficulty in fault tolerance.

Expert Offloading and Placement
Year Venue Authors Title Tags P E N
2025 DATE Berkeley DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference data-aware offloading; predictive pre-calculation; sequence-specific expert allocation
2025 arXiv Stevens Tech fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving expert map; iteration-level probability distributions; track fine-grained input semantic embeddings; semantic-based and trajectorybased
2025 arXiv Georgia Tech MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing ILP for expert placement; cross-layer dependencies; minimizing total dispatched token number
2025 EuroMLSys EPFL Accelerating MoE Model Inference with Expert Sharding expert sharding for load balancing; tensor sharding for moe experts; fused expert computations for reduced kernel launches
2025 DAC PKU HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference dynamically balances workloads across GPUs and CPUs; impact-driven prefetching; MoE-specialized cache management 3 4 2
Batching and Scheduling
Year Venue Authors Title Tags P E N
2025 arXiv Alibaba Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference statically batching irregular workloads; batch-task-tile partition; decompress the mapping and dispatch the workload
2025 arXiv Edinburgh MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching module-based batching; high-throughput MoE inference; full KV-cache offloading
2025 arXiv KTH Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference fine-grained preemption; priority-aware scheduling; per-expert queues; expert-level preemption
2025 arXiv UMich MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints two-stage performance modeling; analyzes the theoretical performance upper bound; captures how system execution mechanisms 4 4 2
2025 Arxiv Nvidia MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core decouples parallelization strategies for attention and MoE layers; flexible and efficient token-level dispatcher; 5-D hybrid parallelism 4 5 2
Memory and Communication Efficiency
Year Venue Authors Title Tags P E N
2025 arXiv ByteDance Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts fine-grained communication-computation overlapping for efficient MoE execution; dependency resolving method; adaptive workload assignment method; shared data buffers between communication and computation operations
2025 arXiv UVA eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference expert prediction; task-aware expert loading; task-aware request scheduling
2025 mobiCom HKUST D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving dually sparselygated Mixture-of-Experts; token-adaptive bit-width selection; matryoshka weight quantization; bit-width-aware I/O-compute pipeline 3 4 4
2025 ODSI SJTU Fast and Live Model Auto Scaling with O(1) Host Caching auto-scaling with minimal caching; optimize parameter loading; enabling fine-grained layer-level scale 3 3 2
Architectural Innovations
Year Venue Authors Title Tags P E N
2025 arXiv Shanghai AI Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts linear sequence modeling with MoE; sparse activation via moe layers; hybrid models combining linear-moe and transformer-moe layers
2025 arXiv Berkeley HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs zebra parallelism; attention-expert disaggregation; asymmetric expert assignment mechanism; gather and squeeze strategy 4 5 3
Compute-Kernel-Level Optimizations
Year Venue Authors Title Tags P E N
2025 arXiv SJTU Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores dual-side structured sparsity; sparse-sparse matrix multiplication kernel; vector-wise + 2:4 hybrid sparsity; token-aware activation compression

Long Sequence LLM Systems

Year Venue Authors Title Tags P E N
2024 OSDI SJTU & Alibaba Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache inefficient model parallelism intra-instance; inefficient resource management inter-instance; KV cache scheduling
2025 arXiv PKU ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs hybrid data parallelism; data-aware sharding; a heuristic algorithm that reorganizes data assignment based on the characteristics of data and pipeline parallelism
Sparse Attention

Solution: handle the prompt token by token introduce high latency, trying to use sparse attention to reduce the computation and memory burden. This can be achieved by not using the full attention matrix, but only the upper triangular part.

Year Venue Authors Title Tags P E N
2025 arXiv CWRU Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques sparse attention with graph computing perspective; work-optimal graph algorithms; achieve true sparsity
2025 MLSys MIT LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention unified sparse attention; hybrid static and dynamic sparsity; hierarchical kv cache management with query-centric pruning
Ring Computation

Solution: use the device layout to reduce the communication overhead. The key idea is to parallel the computation and communication.

Year Venue Authors Title Tags P E N
2023 Nips UCB Ring Attention with Blockwise Transformers for Near-Infinite Context divide the input into blocks and each block is processed by a single GPU; ring-type device layout 4 3 3
2024 arXiv SJTU TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication communication-oriented parallelism framework; inter-node P2P bidirectional communication bandwidth; optimization of attention block communication

P-D Disaggregated Systems

Year Venue Authors Title Tags P E N
2024 OSDI PKU DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving goodput-optimized; prefill-decoding interference;novel placement algorithm for p-d schema
2024 ISCA UW Splitwise: Efficient Generative LLM Inference Using Phase Splitting optimized cache context transfer; performance per dollar; performance per watt; exploration of homogeneous and heterogeneous cluster deployments
2024 arXiv CMU A System for Microserving of LLMs fine-grained sub-request level actions; dynamic reconfiguration according to workloads; unified KV cache abstraction
2025 arXiv PKU ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments two-level hierarchical optimization; tabu search algorithm for GPU partition; a lightweight re-scheduling mechanism
P-D Disaggregated System Optimizations
Year Venue Authors Title Tags P E N
2025 arXiv ByteDance KVDirect: Distributed Disaggregated LLM Inference tensor-centric communication mechanism; pull-based KV cache transfer; dynamic GPU resource scheduling via RDMA
2025 arXiv SYSU Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation attention disaggregation and offloading mechanism; low-latency decoding synchronization; resource-efficient prefill colocation; load-aware offloading scheduling 4 4 3
2025 arXiv Alibaba FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling analyze the communication patterns; KV cache structure adjustment method; load-aware scheduling 4 4 2
2025 arXiv NUS & USTC DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving a novel Tandem Serving execution model; two virtual subrequests; explicitly permit the two subrequests to execute on either GPU instance 3 4 2

Throughput-Optimized Systems

Year Venue Authors Title Tags P E N
2025 arXiv HKUST Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation sampling-then-simulation cost model; model-level pipeline parallelism; minimumtotal-latency application scheduling 4 4 3

Fair Serving Systems

Year Venue Authors Title Tags P E N
2024 arXiv Virginia Tech Ensuring Fair LLM Serving Amid Diverse Applications multi-tenant LLM platform; overload and interaction-driven throttling; weighted service counter
2025 arXiv UIUC Hierarchical Autoscaling for Large Language Model Serving with Chiron hierarchical backpressure; interactive requests and batch requests; mixed instances
2025 arXiv Berkeley Locality-aware Fair Scheduling in LLM Serving deficit-based longest prefix matching; distributed deficit-round coordination; prefix-aware fairness bound analysis

RLHF System

Challenge: RLHF system includes both training and inference. On top of that, multi agents(LLMs) when running in parallel, which makes the data flow more complex.

Year Venue Authors Title Tags P E N
2025 EuroSys HKU HybridFlow: A Flexible and Efficient RLHF Framework auto-mapping model placement; 3D-HybridEngine to reduce the communication overhead; hybrid programming 4 4 3

Communication-Computation Overlap

Challenge: effectively hiding communication latency by overlapping it with computation, which requires careful scheduling and resource management to avoid bottlenecks and ensure that both communication and computation proceed efficiently without stalling each other.

Year Venue Authors Title Tags P E N
2023 NSDI KAIST ARK: GPU-driven Code Execution for Distributed Deep Learning communication-motivated DL system; pipeline DMA engine; GPU-direct-controlled DMA
2024 ASPLOS PKU Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning communication partition abstraction; hybrid LLM training tasks; 3-level decompose
2024 ASPLOS UW–Madison T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives lightweight track and trigger; pre-programmed DMA commands; atomic memory update
2024 ASPLOS UIUC Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM distributed SpMM; sparsity-aware partition; Synchronous Stripes and Asynchronous Stripes
2024 arXiv AMD Optimizing ML Concurrent Computation and Communication with GPU DMA Engines concurrent computation and communication; compute and memory interference among concurrent kernels; schedule prioritization and careful resource partitioning

Configuration Optimization

Challenge: the configuration space is too large to be searched manually.

Year Venue Authors Title Tags P E N
2025 OSDI PKU Mirage: A Multi-Level Superoptimizer for Tensor Programs auto algebraically transfer tensor; using DAG to search configuration space; auto generate kernel function 4 4 3
2020 ASPLOS PKU FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System tvm auto schedule; RL based stragety find; auto optimizing in large configuration space 4 4 3