Distributed Systems¶
Distributed algorithms¶
Focusing on the distributed algorithms, such as consensus and replication, like RAFT.
Challenge: concurrency, synchronous and communication complexities across independent nodes
Solution: problems that require coordination, computation, and data management across multiple independent computer systems.
Computing Framework¶
Solution: Developing distributed algorithms requires a clear understanding of the computing framework, which scales small computing units to achieve a more clear data processing. The common computing frameworks are MapReduce, Spark, etc.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2004 | OSDI | MapReduce: simplified data processing on large clusters | divife the data processing into map and reduce stages; use master-worker architecture | 4 | 5 | 5 |
Parallel Strategies¶
Soultion: using the computation and memory resources of multiple processors to solve a problem.
Challenge: communication overhead and load balancing
Data Parallelism ¶
Solution: Data parallelism addresses scenarios where a single GPU can accommodate the model, but the dataset's size necessitates distribution across multiple GPUs for efficient processing and accelerated training.
Modern DNN acceleration systems commonly use the combination of data parallelism and model parallelism.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2012 | Nips | Large Scale Distributed Deep Networks | data parallel; use many model to optimize the same data; distributed model training | 3 | 4 | 3 | |
2014 | OSDI | CMU | Scaling Distributed Machine Learning with the Parameter Server | the foundation of tensor parallel; parameter server; pull-based data transfer | 4 | 5 | 3 |
2020 | SC | Miscrosoft | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | fix the problem that dp cannot reduce the memory usage on single GPU | 3 | 4 | 3 |
Model Parallelism ¶
Solution: Model parallelism addresses scenarios where the model's size exceeds the processing and memory capacity of a single GPU. There are two types of model parallelism:
-
Pipeline parallelism: divide the model as pipeline stages, each gpu processes one or more stages.
-
Tensor parallelism: divide the tensor into different GPUs.
Usually, pipeline parallelism and tensor parallelism are used together.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2019 | arXiv | NVIDIA | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | transformer based model parallel; pipeline parallel; divide model into different GPUs | 3 | 4 | 3 |
LLM-specific Parallel Strategies¶
Focusing on the parallel strategies for LLM-specific deep learning systems.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2022 | ACL | NUS | Sequence Parallelism: Long Sequence Training from System Perspective | splits input sequences into chunks; Ring Self-Attention; sparse attention | 3 | 4 | 3 |
Cloud computing platforms and architectures¶
Challenge: when providing services to users, facing scalability, resource management, fault tolerance, and cost-effectiveness for building and deploying large-scale distributed applications and services.
Cloud Platform LLM Scheduling¶
Challenge: meet the SLO when providing LLM service on cloud platform.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Azure | TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms | thermal/power property characterization; dynamically adjust in response to power or cooling failures; thermal- and poweraware manner |
Microservices¶
Focusing on the microservices.
Memory Management¶
Challenge: coordinating memory access and maintaining data consistency across multiple independent nodes with their own local memories, especially when dealing with shared data.
Remote Memory¶
Challenge: efficiently providing access to memory on a remote node while minimizing latency and overhead, and ensuring consistency and reliability despite network communication complexities and potential failures.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2020 | TC | Georgia Tech | Hierarchical Orchestration of Disaggregated Memory | XMemPod architecture for hierarchical memory orchestration; compressed swap page table (CSPT) for metadata management; hybrid swap-out algorithm for memory utilization; proactive swap-in optimization for performance; RDMA-based remote memory sharing for low-latency access |
Scratchpad Memory¶
Challenge: efficiently allocating and coordinating limited fast memory across distributed nodes to minimize access latency and contention, while ensuring data consistency and scalability.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | ASPLOS | Cornell | Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories | work-stealing based dynamic task parallelism; stack/task queue in SPM; read-only data duplication | 3 | 3 | 3 |
LLM Memory Management¶
Solution: efficient memory management can reduce memory usage, thus enable larger batch size and higher throughput.
Memory Management Algorithms¶
Solution: efficient memory management algorithms, like virtual memory, page table, etc.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | SOSP | UCB | Efficient Memory Management for Large Language Model Serving with PagedAttention | Paged KV-Cache management; Better memory management for larger batch size; Preemptive memory scheduling | 4 | 5 | 3 |
2022 | NIPS | Stanford | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Generalized Acceleration of Attention Mechanisms; Change attention to utilize the SRAM on GPU; use recompute to reduce IO burden | 4 | 5 | 4 |
General LLM Memory Management¶
Challenge: LLM memory management faces challenges like limited HBM memory, efficient KV Cache management, memory sharing between multiple GPUs, multi-level memory management.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | THU | Jenga: Effective Memory Management for Serving LLM with Heterogeneity | fixed-size embeddings; full-prefix dependency; two-level memory allocator | 4 | 4 | 3 |
2025 | FAST | THU | Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot | PD-disaggregate system; kv-cache centered; global kv-cache pool; dynamic SLO scheduler; paged KV-Cache storage | 3 | 4 | 2 |
KV Cache Reuse Systems¶
Solution: reduce redundant computation and high memory consumption during inference by allowing the reuse of previously computed key-value pairs for shared or repeated parts of input sequences.
Prefix Sharing¶
Solution: reuse KV Cache when the input sequence has shared or repeated parts, use prefix tree to manage KV Cache.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | Nips | Stanford | SGLang: Efficient Execution of Structured Language Model Programs | KV-Cache share; python-like DSL; compute graph; LRU cache management stragety | 4 | 4 | 3 |
2024 | ACL | Microsoft | ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition | prefix aware attention compute; manage kv-cache chunks as prefix tree; reduce kv-cache redundancy | 3 | 4 | 2 |
2024 | arXiv | Microsoft | BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching | global prefix tree ahead-of-time; request reorder; horizontal fusioned prefix-shared attention kernel | |||
2024 | arXiv | Berkeley | BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching | offline batch inference; resource-aware prefix tree; compute-intensive / memory-intensive requests | |||
2024 | arXiv | UChicago | DroidSpeak: Enhancing Cross-LLM Communication | selectively layer reuse; communication protocol for inter-agent exchanges; LLMs that share a common foundational model |
KV cache store¶
Solution: store the KV cache in the memory or other storage device, supporting multi-level storage.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | ATC | Huawei | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | store KV cache in the memory; multi level KV cache management; position mask modified | 3 | 3 | 3 |
Other Techniques¶
Solution: KV cache reuse techniques beyond prefix sharing. Prefix is a high requirement and is not always possible.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | Berkeley | Optimizing LLM Queries in Relational Workloads | prefix sharing maximization; KV cache hit rate; deduplication and cost estimation techniques | |||
2024 | arXiv | UChicago | CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion | multiple precomputed text chunks; selective KV recompute; sparsity of attention matrices |
KV Cache Storage Systems¶
Solution: efficiently storing and retrieving the key-value cache, thus reuse when needed.
Challenge: the prefetch and eviction of the KV cache, the balance between saving GPU memory and refetching time from the storage device.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | NVIDIA | FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving | block-sparse format; customizable attention template; dynamic load-balanced scheduling framework | |||
2025 | arXiv | PKU | FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference | imbalanced KV cache compression mitigation; fair-copying for load balancing; best-effort assignment |
KV Cache Evict Systems¶
Challenge: selectively discard the least important key-value pairs to free up memory for longer contexts or larger batch sizes without significantly degrading the model's generation quality or increasing computational overhead for the eviction process itself.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | NIPS | UT-Austin | H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | sparsity for small cache size; heavy-hitters; greedy algorithm for low-cost policy | |||
2024 | arXiv | Fujitsu | CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models | long measurement step; decay of the accumulated attention score; adjusting FIFO cache size |
Systems with Other Caches¶
Solution: use other caches (not just KV cache) to improve the performance of LLM inference.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | KAIST | Efficient LLM Inference with Activation Checkpointing and Hybrid Caching | activation checkpointing; KV-activation hybrid caching; balanced approach to determine the best ratio |
LLM Prefetching¶
Solution: prefetch to avoid memory transfer between different devices, reduce the latency of memory access.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Huawei Zurich | PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving | computational graph-based prefetching; prefetch KV cache to L2 cache |
Communication-Centric Optimization¶
Challenge: communication is a bottleneck of some distributed systems, trying to reduce the communication.
I/O Characterization and Optimization¶
Challenge: minimize data movement and maximize resource utilization across heterogeneous distributed environments.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2020 | ASPLOS | CMU | Livia: Data-Centric Computing Throughout the Memory Hierarchy | Memory service programming model; task graphs linked to data location; dynamic task/data scheduling for minimal movement | 2 | 4 | 3 |
2025 | arXiv | UOregon | Parallel I/O Characterization and Optimization on Large-Scale HPC Systems: A 360-Degree Survey | different HPC I/O stack layers; profiling and tracing tools; tuning echniques |
GPU-GPU Communication¶
Challenge: limited interconnect bandwidth between GPUs using nvLink, PCIe, synchronization delays in parallel workloads, load imbalance across GPUs
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Apple | SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models | sync-point drop; block-wise sensitivity analysis; attention output synchronization reduction |
Many-Core Systems¶
Challenge: the heterogeneity of cores, the load imbalance, and the communication overhead.
Workload Characterization¶
Challenge: dynamic workloads across numerous cores, resource contention for shared hardware.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2018 | SC | Intel | Many-Core Graph Workload Analysis | multicore simulator sniper; selective caching and prefetching; heterogeneous high-performance low-power cores | |||
2018 | DATE | UGA | Parallel Code Generation of Synchronous Programs for a Many-core Architecture | banked memory mapping; worst-case response time analysis | |||
2025 | IPDPS | UChicago | Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems | lock-less and concurrent task queue xqueue; distributed tree barrier; NUMA-aware redirect push/work stealing |
Fault Propagation¶
Challenge: one core or component can easily spread to others due to shared resources, leading to system-wide reliability issues. Core counts grow make it hard to predict, detect, and contain errors effectively.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2008 | ASPLOS | UIUC | Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design | stuck-at fault; bridging fault; software failure detection | |||
2010 | PRDC | UBC | Modeling the Propagation of Intermittent Hardware Faults in Programs | instruction based intermittent fault; dynamic dependency graph(DDG) based propagation modeling | |||
2015 | SC | IBM | Understanding the Propagation of Transient Errors in HPC Applications | fault propagation in MPI application; fault classification:V,ONA,WO,PEX,C; fault propagation speed factors | |||
2023 | ISCA | UChicago | Understanding and Mitigating Hardware Failures in Deep Learning Training Accelerator Systems | NVDLA based fault injection framework; re-execution based light-weight recovery technique; failure effects:SlowDegrade,SharpSlowDegrade,SharpDegrade,LowTestAccuracy |
Fault Injection Technique¶
Challenge: It is difficult to target specific components, reproduce realistic fault scenarios, and observe system behavior without disturbing normal operation, especially as system scale and complexity increase.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2008 | VLSI | DISCA | Enhancement of Fault Injection Techniques Based on the Modification of VHDL Code | saboteurs and mutants technique based fault injection; VHDL level fault-tolerance mechanism | |||
2014 | DSN | UBC | Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults | fault injection quantification; assembly level fault injection; LLVM compiler based fault injector |
Communication¶
Challenge: efficiently managing data exchange between a large number of cores, due to limited bandwidth, high latency, and contention in shared resources like interconnects and memory.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UCLM | Understanding intra-node communication in HPC systems and Datacenters | intra- and inter-node simulation model; intra-node network interface bottleneck; impacts of communication pattern |
Heterogeneous Systems¶
Heterogeneous systems are systems that have different types of processors, such as CPUs and GPUs.
Solution: ultilize the heterogeneous resources to improve the performance.
General Applications¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2013 | SOSP | MSR Silicon Valley | Dandelion: a Compiler and Runtime for Heterogeneous Systems | unified programming model; “single machine” abstraction; a rich object-oriented programming language for data-parallel computing | |||
2025 | EuroSys | SJTU | Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing | Bubble-less spatial-temporal sharing; kernel squad scheduling; fine-grained concurrent kernel management | 4 | 3 | 2 |
2025 | ISPASS | CMU | Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures | Effective regions for balanced utilization of PUs; Proximity-based kernel fusion recommendation; operator-kernel dependency graphs from PyTorch Profiler traces | 3 | 4 | 2 |
Decentralized Serving¶
Challenge: managing diverse hardware and software environments, balancing workloads across uneven resources, minimizing communication overhead, ensuring consistency without centralized control.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2019 | ASPLOS | USC | Hop: Heterogeneity-aware Decentralized Training | iteration gap; queue-based synchronization; backup workers and bounded staleness | |||
2020 | ASPLOS | USC | Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training | Partial All-Reduce to reduce synchronization cost; group scheduling to avoid conflicts | |||
2025 | arXiv | Berkeley | DeServe: Towards Affordable Offline LLM Inference via Decentralization | decentralized LLM inference; high-latency optimization; idle GPU utilization; modular on-chain integration | |||
2025 | arXiv | HKUST | DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization | partial synchronization based local SGD; DFS algorithm with pruned search space; enables the opportunity of overlapping communication and computation |
ML Training Systems¶
Solution: balance between faster training and high precision.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | SOSP | CMU | Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling | heterogeneity-aware and adaptivity-aware; ILP formulation for scheduling; bootstrapped from observing just a few mini-batches |
LLM Inference Heterogeneous Systems ¶
Solution: managing diverse hardware and software environments, balancing workloads across uneven resources, meeting the SLO.
Mobile & Edge-Network Serving¶
Challenge: limited computation, memory, power coupled with intermittent and unreliable network connectivity, making it difficult to perform computationally intensive training tasks, manage large datasets, and ensure efficient communication and synchronization across distributed edge nodes.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | UIC | Priority-Aware Model-Distributed Inference at Edge Networks | priority-aware model distributed inference algorithm; prioritization of ML inference tasks; model-distributed inferencing mechanism | |||
2024 | arXiv | Yonsei | Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models | hybrid language model; selectively skip uplink transmissions; uncertainty-aware | |||
2024 | arXiv | UMD | Distributed Mixture-of-Agents for Edge Inference with Large Language Models | Mixture-of-Agents; semantics of the data being gossiped and its timeliness; queuing stability | |||
2025 | arXiv | PKU | SplitLLM: Hierarchical Split Learning for Large Language Model over Wireless Network | hierarchical split learning; edge-cloud collaboration; LoRA adapter update | |||
2025 | arXiv | SJTU | HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators | both layer-level and tensor-level GPU-NPU parallelism; different tensor partition strategies; fast synchronization mechanism based on predictable kernel waiting times; tensor partition solver |
GPU-GPU Heterogeneous System¶
Solution: the system is composed of heterogeneous GPUs and not inferencing on CPU/ The system need to manage the heterogeneous GPUs' communication and memory.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | CMU | Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | LLM model placement as a max-flow problem; per-request pipeline; mixed integer linear programming | |||
2025 | ICLR | HKUST | HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment | a combination of graph partitioning and max-flow algorithm; TP and PP with disaggregation; bottleneck and underutilized edges; swap edges |
XPU-GPU Heterogeneous System¶
Challenge: effectively managing and coordinating diverse hardware (CPUs, TPUs, etc.), interconnects, and memory hierarchies
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | ICML | Stanford | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | dynamic offload tensor; quantize the weights to 4-bits; linear aggregation of the store and load operations | 4 | 4 | 3 |
2025 | arXiv | CMU | Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures | SKIP profiling tool; TKLQT metric for CPU/GPU boundedness; proximity score kernel fusion | 2 | 3 | 2 |
2025 | SPAA | Huawei | WindVE: Collaborative CPU-NPU Vector Embedding | seamless CPU-NPU collaboration for vector embedding; linear regression based estimator; high-throughput offloading vector embedding | 2 | 4 | 3 |
2025 | arXiv | Huawei | High-Throughput LLM inference on Heterogeneous Clusters | lightweight profiling while avoiding resource-intensive throughput benchmarks; a scheduler that accounts for both instance computational capacity and memory usage; exhaustive search method | 2 | 4 | 2 |
Heterogeneous Device Task Scheduling¶
Solution: assigning different parts of the LLM serving workload to the most suitable heterogeneous devices to maximize throughput and minimize latency.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | NUS | Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems | lightweight and input-aware framework; multiobjective and multi-constraint design space; dynamically creating optimal schedules | |||
2025 | arXiv | Georgia Tech | HARP: A Taxonomy for Heterogeneous and Hierarchical Processors for Mixed-reuse Workloads | a taxonomy to classify the heterogeneous and hierarchical accelerators; characterize hardware organization of different accelerators; classify based on relative location of sub-accelerators |
LLM Training Heterogeneous Systems¶
Solution: compared to LLM Inference Heterogeneous Systems, need to solve the backward compatibility and heterogeneity issues.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | PKU | Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences | data sampling imbalance; data packing imbalance; subgraph abstraction | |||
2024 | arXiv | Ant Group | EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models | Local Stochastic Gradient Descent (Local SGD); consistent stragglers within heterogeneous devices; hierarchical distribution strategy on a two-dimensional device mesh; layer by layer forward syncing; pseudo-gradient penalty method | |||
2024 | arXiv | ZJU | Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters | efficient and low-overhead task-to-cluster scheduling; bin-packing algorithms; seamless and user-friendly | |||
2025 | arXiv | OSU | Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning | low-bandwidth interconnects; three-level hierarchical partitioning strategy; improved hierarchical partitioning on top of ZeRO++ | |||
2025 | arXiv | PKU | Split Fine-Tuning for Large Language Models in Wireless Networks | split fine-tuning; device and server partition; novel compression scheme and resource management algorithm | |||
2025 | arXiv | Neuchatel | SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks | partial pipeline parallelism; stage skipping; path scheduling algorithm |
Schedule Optimization¶
Solution: develop task schedule algorithms, to achieve efficient overall system performance despite incomplete and evolving system state information.performance.
General Task Scheduling¶
Solution: optimizing the allocation and execution of diverse and dynamic workloads.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2019 | NSDI | MIT | Shinjuku: Preemptive Scheduling for µsecond-scale Tail Latency | preemptive scheduling; single-address space OS; hardware-supported virtualization | |||
2021 | SOSP | UPenn | When Idling is Ideal: Optimizing Tail-Latency for Heavy-Tailed Datacenter Workloads with Perséphone | reserve cores; non-conserving; request dispatching algorithm | |||
2017 | HPCA | UGent | Reliability-Aware Scheduling on Heterogeneous Multicore Processors | core reliability characteristics difference; system soft error rate; sampling-based reliability-aware scheduling algorithm | |||
2020 | TCAD | ASU | Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems | offline Oracle optimizaion strategy; hierarchical imitation learning based scheduling; two-level scheduling | |||
2023 | PACT | Yonsei | Virtual PIM: Resource-aware Dynamic DPU Allocation and Workload Scheduling Framework for Multi-DPU PIM Architecture | Virtual PIM framework; dynamic DPU allocation for multitasking;fine-grained scheduling | 3 | 2 | 2 |
Speculative Execution (Non-LLM) ¶
Solution: balancing the potential performance gains from speculative executions, including accurately predicting outcomes, handling incorrect speculations and their side effects across multiple nodes.
Refer to Speculative Execution for the speculative execution algorithms for LLM.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | MSR | Forerunner: Constraint-based Speculative Transaction Execution for Ethereum | constraint-based speculative transaction execution; many-future nature; specialized fast-path program | |||
2024 | arXiv | Politecnico di Milano | Minimizing speculation overhead in a parallel recognizer for regular texts | speculation overhead; chunk automaton; reduced-interface DFA |
LLM-Related Scheduling ¶
Challenge: efficiently managing the immense computational and memory demands of training and inference across numerous interconnected devices, requiring sophisticated strategies to partition massive models.
LLM Request Scheduling¶
Solution: develop intelligent strategies to route requests, prioritize urgent or critical tasks, handle varying input lengths and complexities, manage resource contention to meet the SLO requirements.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | UCSB | Multi-Bin Batching for Increasing LLM Inference Throughput | binning-based scheduling strategy; queueing-theoretical analysis; asymptotical throughput optimality | |||
2024 | arXiv | Yale | TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications | segmented generation; time-sensitive scheduling; latency-guided batch size selection | |||
2025 | arXiv | MSRI | Niyama : Breaking the Silos of LLM Inference Serving | QoS-driven LLM inference serving system; co-scheduling requests with diverse QoS targets on a shared rather than siloed infrastructure; allows graceful service degradation during overload conditions; deadline slack; a hybrid prioritization and an eager relegation policy | 4 | 4 | 3 |
2025 | arXiv | MIT | Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints | fluid dynamics approximation; Waiting for Accumulated Inference Threshold; a hierarchical framework comprising multiple segments | 3 | 4 | 2 |
2025 | arXiv | PKU | SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference | service-aware and latency-optimized scheduling algorithm; doubling budget (DB) scheduling algorithm; search-based placement algorithm | 3 | 4 | 2 |
LLM Application-Level Scheduling¶
Solution: to optimize the end-to-end latency of the application, including the scheduling of the LLM instances.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | OSDI | SJTU | Parrot: Efficient Serving of LLM-based Applications with Semantic Variable | Semantic Variable; application-level information; LLM applications as first-class citizens | |||
2024 | OSDI | CUHK | Teola: Towards End-to-End Optimization of LLM-based Applications | mismatch between request-level scheduling and end-to-end application performance; primitive-level dataflow graph; two-tier scheduling mechanism | |||
2024 | arXiv | Yext | SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering | constantly changing and sometimes adverse conditions; Dynamically Reconfigurable Horizontal Scaling Framework; dynamically adjust resource allocation based on query requirements | |||
2025 | arXiv | Berkeley | Autellix: An Efficient Serving Engine for LLM Agents as General Programs | formalize agentic programs as dynamic, non-deterministic DAGs; non-clairvoyant scheduler; simple load-balancing policy to balance data locality and KV-cache recomputation | |||
2025 | ICDCS | SJTU | LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications | a DAG with regular stage, LLM stage, dynamic stage; bayesian network-based profiler; identify uncertainty-reducing stages | 4 | 4 | 3 |
LLM Speculative Inference ¶
Refer to non-LLM speculative execution.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | F&M College | AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration | simultaneous and independent predictions; asynchronous speculative decoding; rollback mechanism | |||
2024 | arXiv | Purdue | Constrained Decoding with Speculative Lookaheads | computational expense of generating lookaheads; speculated lookaheads; task specific reward function | |||
2024 | arXiv | Rutgers | Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface | active user intervention; speculative planning algorithm; UI-level rescheduling algorithm | |||
2024 | arXiv | USTC | Parallel Speculative Decoding with Adaptive Draft Length | adaptive draft length; pre-verify and post-verify; draft-then-verify framework; mutual waiting problem | |||
2024 | arXiv | SEU | SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding | reasoning tree construction; parallel drafting with speculative decoding; FCFS queue verification |
Spec + Others¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Huawei | Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling | speculative MoE; speculative token shuffling; speculative expert pre-grouping | |||
2025 | INFOCOM | UoA | SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models | internal neurons sparsification; model-agnostic acceleration framework; dynamic early-exit thresholds; multi-layered feature fusion |
LLM Serving Outages and Incidents¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Vrije Universiteit Amsterdam | An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models | empirical characterization of outages; failure recovery optimization; public LLM service reliability |
Energy-Optimized LLM Scheduling¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UvA | GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation | dynamic early exit; energy-aware code generation; reinforcement learning for llms |
DNN Scheduling¶
Solution: optimizing data parallelism and model parallelism while minimizing communication overhead between nodes, effectively managing limited GPU memory and other resources to achieve scalability and high throughput.
Refer to LLM-Related Scheduling for the LLM-related scheduling algorithms.
Task Offloading¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | USTC | Collaborative Inference for Large Models with Task Offloading and Early Exiting | early exit mechanism; jointly optimize its offloading strategy and the confidence threshold; distributed task offloading algorithm |
General optimizations for Deep Learning Systems¶
Solution: general optimizations for deep learning systems.
If the paper is focusing on an above-mentioned specific scene (e.g., memory, scheduling, IO, etc.), it will be put in the corresponding section.
LLM Training Systems¶
Solution: arrange model parameters and data across multiple devices, reduce the time spent communicating, scale up smoothly as models and data keep growing—all while staying efficient and speeding up training.
General Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | THU | Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism | chronos-aware pipeline parallelism; temporal locality optimization; activation balancing | |||
2025 | arXiv | NUS | PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization | selective offload strategy; memory offload optimization; pipeline parallelism scalability; lifespan-based offloading | |||
2025 | arXiv | UCSD | WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training | workload-aware variable-length document packing; per-document sharding strategy; adaptive sharding selection mechanism; delay execution of extremely long documents | 4 | 5 | 2 |
2025 | EuroSys | UToronto | Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization | fine-grained overlap-centric scheduling; symbolic-based performance analysis; imbalance-aware hierarchical tuning | 4 | 4 | 2 |
Optimizations on Special Scene¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | HKU | Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism | Fully Sharded Sparse Data Parallelism (FSSDP); sparsely materializes MoE parameters; two sparse collective communications | |||
2025 | arXiv | SJTU | PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline | dynamic interleaved pipeline; hierarchical schedule space for rapid pipeline schedule search; spatialtemporal subgraph reuse | 3 | 4 | 2 |
Experiments¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | JSC | Memory and Bandwidth are All You Need for Fully Sharded Data Parallel | an extensive analysis of the FSDP training distribution strategy; a grid search methodology; both simulation and empirical results | 2 | 4 | 1 |
Multi-Modal Optimizations¶
Challenge: multimodal data is more complex and requires more resources to train.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | ByteDance | OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training | multimodal mini-batch imbalance; batch post-balancing algorithm; node-wise all-to-all communicator for practical rearrangement of mini-batches | 4 | 4 | 3 |
Kernel-Level Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | HUST | CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures | model segment profile-based cost model; communication-free tensor partition propagation property; extracting a set of unique model segments; Communication-Free Preserve | 4 | 5 | 3 |
LLM Inference Systems¶
Focusing on the optimizations for LLM inference systems.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | ISCA | DeepSeek | Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures | software-hardware co-design for deepseek-v3; insight into hardware for ai architectures | 5 | 5 | 4 |
SLO-Aware Systems¶
Challenge: providing service for users to meet specific latency requirements with limited resources.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Berkeley | AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding | fine-grained speculative decoding; token tree verification; slo customization | |||
2025 | arXiv | UIUC | HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location | online-offline request co-location; interference-aware profiler; latency predictor; adaptive scheduler | |||
2025 | arXiv | PKU | Memory Offloading for Large Language Model Inference with Latency SLO Guarantees | effectively captures the tension between meeting SLOs and maximizing host memory usage; dynamic offloading interval; per-bus coordinator | |||
2025 | arXiv | Huawei | Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization | hybrid offline-online scheduling; preemptive scheduling for hardware utilization; lagrangian method for cost efficiency evaluation | |||
2025 | ASPLOS | BUAA | Past-Future Scheduler for LLM Serving under SLA Guarantees | lightLLM; predict future system memory usage; reduce evict by better request scheduling | 3 | 2 | 3 |
Surveys¶
System Optimization Surveys¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | NEU | LLM Inference Serving: Survey of Recent Advances and Opportunities | KV cache and memory management; LLM computation optimization; Cloud LLM deployment; focus on system-level enhancements | |||
2024 | arXiv | CUHK | A Survey on Inference Optimization Techniques for Mixture of Experts Models | model compression; expert skip; expert merge; sparse to dense; expert parallel; expert offloading | |||
2024 | arXiv | PolyU | A Survey on Large Language Model Acceleration based on KV Cache Management | cache selection; budget allocation; cache merging; cache quantization; cache low-rank decomposition; attention grouping and sharing; memory management; hardware-aware design | |||
2025 | arXiv | THU | Beyond A Single AI Cluster: A Survey of Decentralized LLM Training | resource-driven paradigm; community-driven decentralization; organizational decentralization; decentralized LLM training taxonomy | |||
2025 | arXiv | FIU | Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions | distributed solutions for LMs; workload imbalance in LLM training; M-ICL; model security enhancement |
Application Surveys¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | PKU | Retrieval-Augmented Generation for AI-Generated Content: A Survey | Query Transformation; Data Augmentation; Recursive Retrieval; Chunk Optimization; Retriever Finetuning; Hybrid Retrieval; Re-ranking; Retrieval Transformation; Prompt Engineering; Decoding Tuning; Generator Finetuning; Output Rewrite; Adaptive Retrieval; Iterative RAG | |||
2024 | arXiv | WHU | A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges | personalized characteristics; perceive environmental information; utilize memory mechanisms; mutual interaction; agent self-reflection | |||
2024 | arXiv | PolyU | Deploying Foundation Model Powered Agent Services: A Survey | FM-powered agent services within the edge-cloud environment; low-level hardware perspective; high-level software perspective |
Multimodal Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | UW–Madison | LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models | query-block distributed exchange; shared visual token recomputation; sequence-parallelism with minimal communication overhead | |||
2025 | arXiv | Microsoft | Towards Efficient Large Multimodal Model Serving | fine-grained stage-aware resource management; multimodal workload-specific scheduling; model architecture-specific optimizations | |||
2025 | arXiv | Huawei | Efficiently Serving Large Multimedia Models Using EPD Disaggregation | encode-prefill-decode disaggregation; multimodal cache; intra-request parallel | |||
2025 | arXiv | TU/e | Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach | Multimodal Parallel Split Learning; computation-efficient training; server-side loss aggregation mechanism | |||
2025 | arXiv | HUST | FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework | resource-aware KV-cache memory pool; multimodal KV-cache compression; modality-specific compression |
Mixture-of-Experts LLM Systems¶
Challenge: efficiently coordinating and scaling expert models across multiple nodes, leading to issues like uneven workload distribution, high communication overhead, and difficulty in fault tolerance.
Expert Offloading and Placement¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | DATE | Berkeley | DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference | data-aware offloading; predictive pre-calculation; sequence-specific expert allocation | |||
2025 | arXiv | Stevens Tech | fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving | expert map; iteration-level probability distributions; track fine-grained input semantic embeddings; semantic-based and trajectorybased | |||
2025 | arXiv | Georgia Tech | MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing | ILP for expert placement; cross-layer dependencies; minimizing total dispatched token number | |||
2025 | EuroMLSys | EPFL | Accelerating MoE Model Inference with Expert Sharding | expert sharding for load balancing; tensor sharding for moe experts; fused expert computations for reduced kernel launches | |||
2025 | DAC | PKU | HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference | dynamically balances workloads across GPUs and CPUs; impact-driven prefetching; MoE-specialized cache management | 3 | 4 | 2 |
Batching and Scheduling¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Alibaba | Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference | statically batching irregular workloads; batch-task-tile partition; decompress the mapping and dispatch the workload | |||
2025 | arXiv | Edinburgh | MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching | module-based batching; high-throughput MoE inference; full KV-cache offloading | |||
2025 | arXiv | KTH | Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference | fine-grained preemption; priority-aware scheduling; per-expert queues; expert-level preemption | |||
2025 | arXiv | UMich | MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints | two-stage performance modeling; analyzes the theoretical performance upper bound; captures how system execution mechanisms | 4 | 4 | 2 |
2025 | Arxiv | Nvidia | MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core | decouples parallelization strategies for attention and MoE layers; flexible and efficient token-level dispatcher; 5-D hybrid parallelism | 4 | 5 | 2 |
Memory and Communication Efficiency¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | ByteDance | Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts | fine-grained communication-computation overlapping for efficient MoE execution; dependency resolving method; adaptive workload assignment method; shared data buffers between communication and computation operations | |||
2025 | arXiv | UVA | eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference | expert prediction; task-aware expert loading; task-aware request scheduling | |||
2025 | mobiCom | HKUST | D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving | dually sparselygated Mixture-of-Experts; token-adaptive bit-width selection; matryoshka weight quantization; bit-width-aware I/O-compute pipeline | 3 | 4 | 4 |
2025 | ODSI | SJTU | Fast and Live Model Auto Scaling with O(1) Host Caching | auto-scaling with minimal caching; optimize parameter loading; enabling fine-grained layer-level scale | 3 | 3 | 2 |
Architectural Innovations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | Shanghai AI | Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts | linear sequence modeling with MoE; sparse activation via moe layers; hybrid models combining linear-moe and transformer-moe layers | |||
2025 | arXiv | Berkeley | HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs | zebra parallelism; attention-expert disaggregation; asymmetric expert assignment mechanism; gather and squeeze strategy | 4 | 5 | 3 |
Compute-Kernel-Level Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | SJTU | Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores | dual-side structured sparsity; sparse-sparse matrix multiplication kernel; vector-wise + 2:4 hybrid sparsity; token-aware activation compression |
Long Sequence LLM Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | OSDI | SJTU & Alibaba | Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache | inefficient model parallelism intra-instance; inefficient resource management inter-instance; KV cache scheduling | |||
2025 | arXiv | PKU | ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs | hybrid data parallelism; data-aware sharding; a heuristic algorithm that reorganizes data assignment based on the characteristics of data and pipeline parallelism |
Sparse Attention¶
Solution: handle the prompt token by token introduce high latency, trying to use sparse attention to reduce the computation and memory burden. This can be achieved by not using the full attention matrix, but only the upper triangular part.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | CWRU | Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques | sparse attention with graph computing perspective; work-optimal graph algorithms; achieve true sparsity | |||
2025 | MLSys | MIT | LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention | unified sparse attention; hybrid static and dynamic sparsity; hierarchical kv cache management with query-centric pruning |
Ring Computation¶
Solution: use the device layout to reduce the communication overhead. The key idea is to parallel the computation and communication.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | Nips | UCB | Ring Attention with Blockwise Transformers for Near-Infinite Context | divide the input into blocks and each block is processed by a single GPU; ring-type device layout | 4 | 3 | 3 |
2024 | arXiv | SJTU | TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication | communication-oriented parallelism framework; inter-node P2P bidirectional communication bandwidth; optimization of attention block communication |
P-D Disaggregated Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | OSDI | PKU | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | goodput-optimized; prefill-decoding interference;novel placement algorithm for p-d schema | |||
2024 | ISCA | UW | Splitwise: Efficient Generative LLM Inference Using Phase Splitting | optimized cache context transfer; performance per dollar; performance per watt; exploration of homogeneous and heterogeneous cluster deployments | |||
2024 | arXiv | CMU | A System for Microserving of LLMs | fine-grained sub-request level actions; dynamic reconfiguration according to workloads; unified KV cache abstraction | |||
2025 | arXiv | PKU | ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments | two-level hierarchical optimization; tabu search algorithm for GPU partition; a lightweight re-scheduling mechanism |
P-D Disaggregated System Optimizations¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | ByteDance | KVDirect: Distributed Disaggregated LLM Inference | tensor-centric communication mechanism; pull-based KV cache transfer; dynamic GPU resource scheduling via RDMA | |||
2025 | arXiv | SYSU | Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation | attention disaggregation and offloading mechanism; low-latency decoding synchronization; resource-efficient prefill colocation; load-aware offloading scheduling | 4 | 4 | 3 |
2025 | arXiv | Alibaba | FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling | analyze the communication patterns; KV cache structure adjustment method; load-aware scheduling | 4 | 4 | 2 |
2025 | arXiv | NUS & USTC | DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving | a novel Tandem Serving execution model; two virtual subrequests; explicitly permit the two subrequests to execute on either GPU instance | 3 | 4 | 2 |
Throughput-Optimized Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | arXiv | HKUST | Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation | sampling-then-simulation cost model; model-level pipeline parallelism; minimumtotal-latency application scheduling | 4 | 4 | 3 |
Fair Serving Systems¶
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2024 | arXiv | Virginia Tech | Ensuring Fair LLM Serving Amid Diverse Applications | multi-tenant LLM platform; overload and interaction-driven throttling; weighted service counter | |||
2025 | arXiv | UIUC | Hierarchical Autoscaling for Large Language Model Serving with Chiron | hierarchical backpressure; interactive requests and batch requests; mixed instances | |||
2025 | arXiv | Berkeley | Locality-aware Fair Scheduling in LLM Serving | deficit-based longest prefix matching; distributed deficit-round coordination; prefix-aware fairness bound analysis |
RLHF System¶
Challenge: RLHF system includes both training and inference. On top of that, multi agents(LLMs) when running in parallel, which makes the data flow more complex.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | EuroSys | HKU | HybridFlow: A Flexible and Efficient RLHF Framework | auto-mapping model placement; 3D-HybridEngine to reduce the communication overhead; hybrid programming | 4 | 4 | 3 |
Communication-Computation Overlap¶
Challenge: effectively hiding communication latency by overlapping it with computation, which requires careful scheduling and resource management to avoid bottlenecks and ensure that both communication and computation proceed efficiently without stalling each other.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2023 | NSDI | KAIST | ARK: GPU-driven Code Execution for Distributed Deep Learning | communication-motivated DL system; pipeline DMA engine; GPU-direct-controlled DMA | |||
2024 | ASPLOS | PKU | Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning | communication partition abstraction; hybrid LLM training tasks; 3-level decompose | |||
2024 | ASPLOS | UW–Madison | T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives | lightweight track and trigger; pre-programmed DMA commands; atomic memory update | |||
2024 | ASPLOS | UIUC | Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM | distributed SpMM; sparsity-aware partition; Synchronous Stripes and Asynchronous Stripes | |||
2024 | arXiv | AMD | Optimizing ML Concurrent Computation and Communication with GPU DMA Engines | concurrent computation and communication; compute and memory interference among concurrent kernels; schedule prioritization and careful resource partitioning |
Configuration Optimization¶
Challenge: the configuration space is too large to be searched manually.
Year | Venue | Authors | Title | Tags | P | E | N |
---|---|---|---|---|---|---|---|
2025 | OSDI | PKU | Mirage: A Multi-Level Superoptimizer for Tensor Programs | auto algebraically transfer tensor; using DAG to search configuration space; auto generate kernel function | 4 | 4 | 3 |
2020 | ASPLOS | PKU | FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System | tvm auto schedule; RL based stragety find; auto optimizing in large configuration space | 4 | 4 | 3 |