Distributed Systems¶

Distributed algorithms¶

Focusing on the distributed algorithms, such as consensus and replication, like RAFT.

Challenge: concurrency, synchronous and communication complexities across independent nodes

Solution: problems that require coordination, computation, and data management across multiple independent computer systems.

Computing Framework¶

Solution: Developing distributed algorithms requires a clear understanding of the computing framework, which scales small computing units to achieve a more clear data processing. The common computing frameworks are MapReduce, Spark, etc.

Year	Venue	Authors	Title	Tags	P	E	N
2004	OSDI	Google	MapReduce: simplified data processing on large clusters	divife the data processing into map and reduce stages; use master-worker architecture	4	5	5

Domain Specific Computing Framework¶

Challenge: specific bounds of different situations

Year	Venue	Authors	Title	Tags	P	E	N
2024	PPoPP	NUDT	GraphCube: Interconnection Hierarchy-aware Graph Processing	interconnection hierarchy-aware; topology-aware graph partitioning; extreme-scale graph processing	4	5	5

Parallel Strategies¶

Soultion: using the computation and memory resources of multiple processors to solve a problem.

Challenge: communication overhead and load balancing

Data Parallelism ¶

Solution: Data parallelism addresses scenarios where a single GPU can accommodate the model, but the dataset's size necessitates distribution across multiple GPUs for efficient processing and accelerated training.

Modern DNN acceleration systems commonly use the combination of data parallelism and model parallelism.

Year	Venue	Authors	Title	Tags	P	E	N
2012	Nips	Google	Large Scale Distributed Deep Networks	data parallel; use many model to optimize the same data; distributed model training	3	4	3
2014	OSDI	CMU	Scaling Distributed Machine Learning with the Parameter Server	the foundation of tensor parallel; parameter server; pull-based data transfer	4	5	3
2020	SC	Miscrosoft	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	fix the problem that dp cannot reduce the memory usage on single GPU	3	4	3

Model Parallelism ¶

Solution: Model parallelism addresses scenarios where the model's size exceeds the processing and memory capacity of a single GPU. There are two types of model parallelism:

Pipeline parallelism: divide the model as pipeline stages, each gpu processes one or more stages.
Tensor parallelism: divide the tensor into different GPUs.

Usually, pipeline parallelism and tensor parallelism are used together.

Year	Venue	Authors	Title	Tags	P	E	N
2019	arXiv	NVIDIA	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	transformer based model parallel; pipeline parallel; divide model into different GPUs	3	4	3
2021	SC	NVIDIA	Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM	Megatron2; dive deep into tensor parallelism; how to train a LLM on 1000 GPUs	4	4	3
2022	arXiv	NVIDIA	Reducing Activation Recomputation in Large Transformer Models	Megatron3; sequence parallel; selective activation recomputation; reduce the amount of recomputed activation	3	4	3

LLM-specific Parallel Strategies¶

Focusing on the parallel strategies for LLM-specific deep learning systems.

Year	Venue	Authors	Title	Tags	P	E	N
2022	ACL	NUS	Sequence Parallelism: Long Sequence Training from System Perspective	splits input sequences into chunks; Ring Self-Attention; sparse attention	3	4	3

Cloud computing platforms and architectures¶

Challenge: when providing services to users, facing scalability, resource management, fault tolerance, and cost-effectiveness for building and deploying large-scale distributed applications and services.

Cloud Platform LLM Scheduling¶

Challenge: meet the SLO when providing LLM service on cloud platform.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Azure	TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms	thermal/power property characterization; dynamically adjust in response to power or cooling failures; thermal- and poweraware manner

Microservices¶

Focusing on the microservices.

Memory Management¶

Challenge: coordinating memory access and maintaining data consistency across multiple independent nodes with their own local memories, especially when dealing with shared data.

Remote Memory¶

Challenge: efficiently providing access to memory on a remote node while minimizing latency and overhead, and ensuring consistency and reliability despite network communication complexities and potential failures.

Year	Venue	Authors	Title	Tags	P	E	N
2020	TC	Georgia Tech	Hierarchical Orchestration of Disaggregated Memory	XMemPod architecture for hierarchical memory orchestration; compressed swap page table (CSPT) for metadata management; hybrid swap-out algorithm for memory utilization; proactive swap-in optimization for performance; RDMA-based remote memory sharing for low-latency access
2025	ATC	HUST	Fast Distributed Transactions for RDMA-based Disaggregated Memory	fast commit protocol by coalescing validation and commit phases; RDMA-enabled offloading for data synchronization; priority-based locking for mission-critical transactions	2	3	4

Scratchpad Memory¶

Challenge: efficiently allocating and coordinating limited fast memory across distributed nodes to minimize access latency and contention, while ensuring data consistency and scalability.

Year	Venue	Authors	Title	Tags	P	E	N
2023	ASPLOS	Cornell	Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories	work-stealing based dynamic task parallelism; stack/task queue in SPM; read-only data duplication	3	3	3

Memory Optimization for Graph Processing¶

Challenge: efficiently optimize huge memory requirement from graph processing.

Year	Venue	Authors	Title	Tags	P	E	N
2024	PPoPP	KAIST	INFINEL: An efficient GPU-based processing method for unpredictable large output graph queries	unpredictable large output queries; one-phase GPU graph processing; kernel stop/restart	4	4	3

LLM Memory Management¶

Solution: efficient memory management can reduce memory usage, thus enable larger batch size and higher throughput.

Memory Management Algorithms¶

Solution: efficient memory management algorithms, like virtual memory, page table, etc.

Year	Venue	Authors	Title	Tags	P	E	N
2023	SOSP	UCB	Efficient Memory Management for Large Language Model Serving with PagedAttention	Paged KV-Cache management; Better memory management for larger batch size; Preemptive memory scheduling	4	5	3
2025	ASPLOS	Miscrosoft	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	use cuda hardware page table instead of vllm's; hack cuda's driver to support page table modify	2	3	3
2025	arXiv	SJTU	eLLM: Elastic Memory Management Framework for Efficient LLM Serving	activation weight paged; all scene virtual memory; cpu memory swap	3	2	2

Tradeoff between compute and memory¶

Solution: Transformer is a compute-bound model. To improve the performance, sometimes can use recomputation to reduce the memory usage.

Year	Venue	Authors	Title	Tags	P	E	N
2022	NIPS	Stanford	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	Generalized Acceleration of Attention Mechanisms; Change attention to utilize the SRAM on GPU; use recompute to reduce IO burden	4	5	4
2023	ICLR	Stanford	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	optimize the thread block parallelization of attention; parallel memory access; reduce no-malmul operation	4	4	3
2024	Nips	Stanford	FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision	Hopper architecture based optimization; fp8 quantization; backward support	3	3	3

General LLM Memory Management¶

Challenge: LLM memory management faces challenges like limited HBM memory, efficient KV Cache management, memory sharing between multiple GPUs, multi-level memory management.

Year	Venue	Authors	Title	Tags	P	E	N
2022	SC	Miscrosoft	DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	kernel fusion; GPU-CPU-NVMe heterogeneous memory; PCIe-based memory prefetch	4	4	3
2025	arXiv	THU	Jenga: Effective Memory Management for Serving LLM with Heterogeneity	fixed-size embeddings; full-prefix dependency; two-level memory allocator	4	4	3
2025	FAST	THU	Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot	PD-disaggregate system; kv-cache centered; global kv-cache pool; dynamic SLO scheduler; paged KV-Cache storage	3	4	2

Application specific memory management¶

Solution: Memory management is actually the core for request scheduleing. Application specific memory management use the application information to manage the memory.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	UCSD	KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows	agent grapg; prefetch KV Cache from CPU for next agent; agent-aware prefix cache management	2	2	2

Solution: Memory management is actually the core for request scheduleing. Application specific memory management use the application information to manage the memory.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	UCSD	KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows	agent grapg; prefetch KV Cache from CPU for next agent; agent-aware prefix cache management	2	2	2

KV Cache Reuse Systems¶

Solution: reduce redundant computation and high memory consumption during inference by allowing the reuse of previously computed key-value pairs for shared or repeated parts of input sequences.

Solution: reuse KV Cache when the input sequence has shared or repeated parts, use prefix tree to manage KV Cache.

Year	Venue	Authors	Title	Tags	P	E	N
2023	Nips	Stanford	SGLang: Efficient Execution of Structured Language Model Programs	KV-Cache share; python-like DSL; compute graph; LRU cache management stragety	4	4	3
2024	ACL	Microsoft	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	prefix aware attention compute; manage kv-cache chunks as prefix tree; reduce kv-cache redundancy	3	4	2
2024	arXiv	Microsoft	BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching	global prefix tree ahead-of-time; request reorder; horizontal fusioned prefix-shared attention kernel
2024	arXiv	Berkeley	BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching	offline batch inference; resource-aware prefix tree; compute-intensive / memory-intensive requests
2024	arXiv	UChicago	DroidSpeak: Enhancing Cross-LLM Communication	selectively layer reuse; communication protocol for inter-agent exchanges; LLMs that share a common foundational model

KV cache store¶

Solution: store the KV cache in the memory or other storage device, supporting multi-level storage.

Year	Venue	Authors	Title	Tags	P	E	N
2024	ATC	Huawei	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	store KV cache in the memory; multi level KV cache management; position mask modified	3	3	3
2024	SIGCOMM	UChicago	CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving	efficient KV Cache streaming; KV Cache compression; knowledge delivery network; The transfer part of LMCache	3	4	3
2024	EuroSys	UChicago	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion	multiple precomputed text chunks; selective KV recompute; sparsity of attention matrices; The system intro of LMCache	3	4	3

Other Techniques¶

Solution: KV cache reuse techniques beyond prefix sharing. Prefix is a high requirement and is not always possible.

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	Berkeley	Optimizing LLM Queries in Relational Workloads	prefix sharing maximization; KV cache hit rate; deduplication and cost estimation techniques

KV Cache Storage Systems¶

Solution: efficiently storing and retrieving the key-value cache, thus reuse when needed.

Challenge: the prefetch and eviction of the KV cache, the balance between saving GPU memory and refetching time from the storage device.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	NVIDIA	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	block-sparse format; customizable attention template; dynamic load-balanced scheduling framework
2025	arXiv	PKU	FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference	imbalanced KV cache compression mitigation; fair-copying for load balancing; best-effort assignment

KV Cache Evict Systems¶

Challenge: selectively discard the least important key-value pairs to free up memory for longer contexts or larger batch sizes without significantly degrading the model's generation quality or increasing computational overhead for the eviction process itself.

Year	Venue	Authors	Title	Tags	P	E	N
2023	NIPS	UT-Austin	H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	sparsity for small cache size; heavy-hitters; greedy algorithm for low-cost policy
2024	arXiv	Fujitsu	CO2: Precise Attention Score Observation for improving KV Cache Replacement in Large Language Models	long measurement step; decay of the accumulated attention score; adjusting FIFO cache size

Systems with Other Caches¶

Solution: use other caches (not just KV cache) to improve the performance of LLM inference.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	KAIST	Efficient LLM Inference with Activation Checkpointing and Hybrid Caching	activation checkpointing; KV-activation hybrid caching; balanced approach to determine the best ratio

LLM Prefetching¶

Solution: prefetch to avoid memory transfer between different devices, reduce the latency of memory access.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Huawei Zurich	PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving	computational graph-based prefetching; prefetch KV cache to L2 cache

Communication-Centric Optimization¶

Challenge: communication is a bottleneck of some distributed systems, trying to reduce the communication.

I/O Characterization and Optimization¶

Challenge: minimize data movement and maximize resource utilization across heterogeneous distributed environments.

Year	Venue	Authors	Title	Tags	P	E	N
2020	ASPLOS	CMU	Livia: Data-Centric Computing Throughout the Memory Hierarchy	Memory service programming model; task graphs linked to data location; dynamic task/data scheduling for minimal movement	2	4	3
2025	arXiv	UOregon	Parallel I/O Characterization and Optimization on Large-Scale HPC Systems: A 360-Degree Survey	different HPC I/O stack layers; profiling and tracing tools; tuning echniques

GPU-GPU Communication¶

Challenge: limited interconnect bandwidth between GPUs using nvLink, PCIe, synchronization delays in parallel workloads, load imbalance across GPUs

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Apple	SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models	sync-point drop; block-wise sensitivity analysis; attention output synchronization reduction
2025	arXiv	Miscrosoft	Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters	heterogeneous pipeline stages with flexible GPU counts and types; CPU offloading of both parameters and activations	4	4	2

Many-Core Systems¶

Challenge: the heterogeneity of cores, the load imbalance, and the communication overhead.

Workload Characterization¶

Challenge: dynamic workloads across numerous cores, resource contention for shared hardware.

Year	Venue	Authors	Title	Tags	P	E	N
2015	VLDB	Intel	GraphMat: High performance graph analytics made productive	vertex program to sparse matrix mapping; generalized SPMV for graph analytics; single-node multicore framework	4	4	4
2018	SC	Intel	Many-Core Graph Workload Analysis	multicore simulator sniper; selective caching and prefetching; heterogeneous high-performance low-power cores
2018	DATE	UGA	Parallel Code Generation of Synchronous Programs for a Many-core Architecture	banked memory mapping; worst-case response time analysis
2025	IPDPS	UChicago	Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems	lock-less and concurrent task queue xqueue; distributed tree barrier; NUMA-aware redirect push/work stealing

Fault Propagation¶

Challenge: one core or component can easily spread to others due to shared resources, leading to system-wide reliability issues. Core counts grow make it hard to predict, detect, and contain errors effectively.

Year	Venue	Authors	Title	Tags
2008	ASPLOS	UIUC	Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design	stuck-at fault; bridging fault; software failure detection
2010	PRDC	UBC	Modeling the Propagation of Intermittent Hardware Faults in Programs	instruction based intermittent fault; dynamic dependency graph(DDG) based propagation modeling
2015	SC	IBM	Understanding the Propagation of Transient Errors in HPC Applications	fault propagation in MPI application; fault classification:V,ONA,WO,PEX,C; fault propagation speed factors
2023	ISCA	UChicago	Understanding and Mitigating Hardware Failures in Deep Learning Training Accelerator Systems	NVDLA based fault injection framework; re-execution based light-weight recovery technique; failure effects:SlowDegrade,SharpSlowDegrade,SharpDegrade,LowTestAccuracy

Fault Injection Technique¶

Challenge: It is difficult to target specific components, reproduce realistic fault scenarios, and observe system behavior without disturbing normal operation, especially as system scale and complexity increase.

Year	Venue	Authors	Title	Tags	P	E	N
2008	VLSI	DISCA	Enhancement of Fault Injection Techniques Based on the Modification of VHDL Code	saboteurs and mutants technique based fault injection; VHDL level fault-tolerance mechanism
2014	DSN	UBC	Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults	fault injection quantification; assembly level fault injection; LLVM compiler based fault injector

Communication¶

Challenge: efficiently managing data exchange between a large number of cores, due to limited bandwidth, high latency, and contention in shared resources like interconnects and memory.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	UCLM	Understanding intra-node communication in HPC systems and Datacenters	intra- and inter-node simulation model; intra-node network interface bottleneck; impacts of communication pattern

Heterogeneous Systems¶

Heterogeneous systems are systems that have different types of processors, such as CPUs and GPUs.

Solution: ultilize the heterogeneous resources to improve the performance.

General Applications¶

Year	Venue	Authors	Title	Tags	P	E	N
2013	SOSP	MSR Silicon Valley	Dandelion: a Compiler and Runtime for Heterogeneous Systems	unified programming model; “single machine” abstraction; a rich object-oriented programming language for data-parallel computing
2025	EuroSys	SJTU	Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing	Bubble-less spatial-temporal sharing; kernel squad scheduling; fine-grained concurrent kernel management	4	3	2
2025	ISPASS	CMU	Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures	Effective regions for balanced utilization of PUs; Proximity-based kernel fusion recommendation; operator-kernel dependency graphs from PyTorch Profiler traces	3	4	2

Decentralized Serving¶

Challenge: managing diverse hardware and software environments, balancing workloads across uneven resources, minimizing communication overhead, ensuring consistency without centralized control.

Year	Venue	Authors	Title	Tags
2019	ASPLOS	USC	Hop: Heterogeneity-aware Decentralized Training	iteration gap; queue-based synchronization; backup workers and bounded staleness
2020	ASPLOS	USC	Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training	Partial All-Reduce to reduce synchronization cost; group scheduling to avoid conflicts
2025	arXiv	Berkeley	DeServe: Towards Affordable Offline LLM Inference via Decentralization	decentralized LLM inference; high-latency optimization; idle GPU utilization; modular on-chain integration
2025	arXiv	HKUST	DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization	partial synchronization based local SGD; DFS algorithm with pruned search space; enables the opportunity of overlapping communication and computation

ML Training Systems¶

Solution: balance between faster training and high precision.

Year	Venue	Authors	Title	Tags	P	E	N
2023	SOSP	CMU	Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling	heterogeneity-aware and adaptivity-aware; ILP formulation for scheduling; bootstrapped from observing just a few mini-batches

LLM Inference Heterogeneous Systems ¶

Solution: managing diverse hardware and software environments, balancing workloads across uneven resources, meeting the SLO.

Mobile & Edge-Network Serving¶

Challenge: limited computation, memory, power coupled with intermittent and unreliable network connectivity, making it difficult to perform computationally intensive training tasks, manage large datasets, and ensure efficient communication and synchronization across distributed edge nodes.

Year	Venue	Authors	Title	Tags
2024	arXiv	UIC	Priority-Aware Model-Distributed Inference at Edge Networks	priority-aware model distributed inference algorithm; prioritization of ML inference tasks; model-distributed inferencing mechanism
2024	arXiv	Yonsei	Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models	hybrid language model; selectively skip uplink transmissions; uncertainty-aware
2024	arXiv	UMD	Distributed Mixture-of-Agents for Edge Inference with Large Language Models	Mixture-of-Agents; semantics of the data being gossiped and its timeliness; queuing stability
2025	arXiv	PKU	SplitLLM: Hierarchical Split Learning for Large Language Model over Wireless Network	hierarchical split learning; edge-cloud collaboration; LoRA adapter update
2025	arXiv	SJTU	HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators	both layer-level and tensor-level GPU-NPU parallelism; different tensor partition strategies; fast synchronization mechanism based on predictable kernel waiting times; tensor partition solver

GPU-GPU Heterogeneous System¶

Solution: the system is composed of heterogeneous GPUs and not inferencing on CPU/ The system need to manage the heterogeneous GPUs' communication and memory.

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	CMU	Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs	LLM model placement as a max-flow problem; per-request pipeline; mixed integer linear programming
2025	ICLR	HKUST	HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment	a combination of graph partitioning and max-flow algorithm; TP and PP with disaggregation; bottleneck and underutilized edges; swap edges

XPU-GPU Heterogeneous System¶

Challenge: effectively managing and coordinating diverse hardware (CPUs, TPUs, etc.), interconnects, and memory hierarchies

Year	Venue	Authors	Title	Tags	P	E	N
2023	ICML	Stanford	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	dynamic offload tensor; quantize the weights to 4-bits; linear aggregation of the store and load operations	4	4	3
2025	arXiv	CMU	Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures	SKIP profiling tool; TKLQT metric for CPU/GPU boundedness; proximity score kernel fusion	2	3	2
2025	SPAA	Huawei	WindVE: Collaborative CPU-NPU Vector Embedding	seamless CPU-NPU collaboration for vector embedding; linear regression based estimator; high-throughput offloading vector embedding	2	4	3
2025	arXiv	Huawei	High-Throughput LLM inference on Heterogeneous Clusters	lightweight profiling while avoiding resource-intensive throughput benchmarks; a scheduler that accounts for both instance computational capacity and memory usage; exhaustive search method	2	4	2
2025	ISCA	KAIST	EOD: Enabling Low Latency GNN Inference via Near-Memory Concatenate Aggregation	concatenated ZVC compression; precomputation for neighborhood explosion problem	2	3	2

Heterogeneous Device Task Scheduling¶

Solution: assigning different parts of the LLM serving workload to the most suitable heterogeneous devices to maximize throughput and minimize latency.

Year	Venue	Authors	Title	Tags	P	E	N
2023	PACT	Yonsei	Virtual PIM: Resource-aware Dynamic DPU Allocation and Workload Scheduling Framework for Multi-DPU PIM Architecture	dynamic DPU allocation for multitasking; fine-grained scheduling	3	2	2
2025	arXiv	NUS	Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems	lightweight and input-aware framework; multiobjective and multi-constraint design space; dynamically creating optimal schedules
2025	HPCA	Samsung	PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM	task scheduling algorithm across host and PIM; interleave-batched GEMM; data layout adjustment	2	3	3

Task Scheduling for specific tasks¶

Solution: In specific scene, the schedule goal is different. Assigning tasks to differnet devices can fix the gap between the characteristics of devices' and tasks'.

Year	Venue	Authors	Title	Tags	P	E	N
2023	HPCA	Princeton	Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications	distributed data-local tiled architecture; task-based programming for pointer indirection; traffic-aware task scheduling with headerless NoC	3	3	3
2025	arXiv	Georgia Tech	HARP: A Taxonomy for Heterogeneous and Hierarchical Processors for Mixed-reuse Workloads	a taxonomy to classify the heterogeneous and hierarchical accelerators; characterize hardware organization of different accelerators; classify based on relative location of sub-accelerators
2025	arXiv	PKU	Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC	agent application-specific scheduling on heterogeneous SoC; heterogeneous execution graph with eastic kernels; bandwidth-aware dispatch for NPU-iGPU contention mitigation	3	2	3

LLM Training Heterogeneous Systems¶

Solution: compared to LLM Inference Heterogeneous Systems, need to solve the backward compatibility and heterogeneity issues.

Year	Venue	Authors	Title	Tags
2024	arXiv	PKU	Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences	data sampling imbalance; data packing imbalance; subgraph abstraction
2024	arXiv	Ant Group	EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models	Local Stochastic Gradient Descent (Local SGD); consistent stragglers within heterogeneous devices; hierarchical distribution strategy on a two-dimensional device mesh; layer by layer forward syncing; pseudo-gradient penalty method
2024	arXiv	ZJU	Frenzy: A Memory-Aware Serverless LLM Training System for Heterogeneous GPU Clusters	efficient and low-overhead task-to-cluster scheduling; bin-packing algorithms; seamless and user-friendly
2025	arXiv	OSU	Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning	low-bandwidth interconnects; three-level hierarchical partitioning strategy; improved hierarchical partitioning on top of ZeRO++
2025	arXiv	PKU	Split Fine-Tuning for Large Language Models in Wireless Networks	split fine-tuning; device and server partition; novel compression scheme and resource management algorithm
2025	arXiv	Neuchatel	SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks	partial pipeline parallelism; stage skipping; path scheduling algorithm

Schedule Optimization¶

Solution: develop task schedule algorithms, to achieve efficient overall system performance despite incomplete and evolving system state information.performance.

General Task Scheduling¶

Solution: optimizing the allocation and execution of diverse and dynamic workloads.

Year	Venue	Authors	Title	Tags
2019	NSDI	MIT	Shinjuku: Preemptive Scheduling for µsecond-scale Tail Latency	preemptive scheduling; single-address space OS; hardware-supported virtualization
2021	SOSP	UPenn	When Idling is Ideal: Optimizing Tail-Latency for Heavy-Tailed Datacenter Workloads with Perséphone	reserve cores; non-conserving; request dispatching algorithm
2017	HPCA	UGent	Reliability-Aware Scheduling on Heterogeneous Multicore Processors	core reliability characteristics difference; system soft error rate; sampling-based reliability-aware scheduling algorithm
2020	TCAD	ASU	Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems	offline Oracle optimizaion strategy; hierarchical imitation learning based scheduling; two-level scheduling

Speculative Execution (Non-LLM) ¶

Solution: balancing the potential performance gains from speculative executions, including accurately predicting outcomes, handling incorrect speculations and their side effects across multiple nodes.

Refer to Speculative Execution for the speculative execution algorithms for LLM.

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	MSR	Forerunner: Constraint-based Speculative Transaction Execution for Ethereum	constraint-based speculative transaction execution; many-future nature; specialized fast-path program
2024	arXiv	Politecnico di Milano	Minimizing speculation overhead in a parallel recognizer for regular texts	speculation overhead; chunk automaton; reduced-interface DFA

Challenge: efficiently managing the immense computational and memory demands of training and inference across numerous interconnected devices, requiring sophisticated strategies to partition massive models.

LLM Request Scheduling¶

Solution: develop intelligent strategies to route requests, prioritize urgent or critical tasks, handle varying input lengths and complexities, manage resource contention to meet the SLO requirements.

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	UCSB	Multi-Bin Batching for Increasing LLM Inference Throughput	binning-based scheduling strategy; queueing-theoretical analysis; asymptotical throughput optimality
2024	arXiv	Yale	TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications	segmented generation; time-sensitive scheduling; latency-guided batch size selection
2025	arXiv	MSRI	Niyama : Breaking the Silos of LLM Inference Serving	QoS-driven LLM inference serving system; co-scheduling requests with diverse QoS targets on a shared rather than siloed infrastructure; allows graceful service degradation during overload conditions; deadline slack; a hybrid prioritization and an eager relegation policy	4	4	3
2025	arXiv	MIT	Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints	fluid dynamics approximation; Waiting for Accumulated Inference Threshold; a hierarchical framework comprising multiple segments	3	4	2
2025	arXiv	PKU	SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference	service-aware and latency-optimized scheduling algorithm; doubling budget (DB) scheduling algorithm; search-based placement algorithm	3	4	2

Info Predict Scheduling¶

Challenge: The general schedule if for better batching and meeting the SLO requirements. By predicting the information of the requests, we can make the schedule more efficient.

Year	Venue	Authors	Title	Tags	P	E	N
2023	Nips	Harvard	S3: Increasing GPU Utilization during Generative Inference for Higher Throughput	predict the length of LLM request to fixed types; Orca based dynamic batching	3	2	3
2024	ASPLOS	UIUC	Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction	length prediction; left time prediction; bert-based proxy model	4	3	2

LLM Application-Level Scheduling¶

Solution: to optimize the end-to-end latency of the application, including the scheduling of the LLM instances.

Year	Venue	Authors	Title	Tags	P	E	N
2024	OSDI	SJTU	Parrot: Efficient Serving of LLM-based Applications with Semantic Variable	Semantic Variable; application-level information; LLM applications as first-class citizens
2024	OSDI	CUHK	Teola: Towards End-to-End Optimization of LLM-based Applications	mismatch between request-level scheduling and end-to-end application performance; primitive-level dataflow graph; two-tier scheduling mechanism
2024	arXiv	Yext	SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering	constantly changing and sometimes adverse conditions; Dynamically Reconfigurable Horizontal Scaling Framework; dynamically adjust resource allocation based on query requirements
2025	arXiv	Berkeley	Autellix: An Efficient Serving Engine for LLM Agents as General Programs	formalize agentic programs as dynamic, non-deterministic DAGs; non-clairvoyant scheduler; simple load-balancing policy to balance data locality and KV-cache recomputation
2025	ICDCS	SJTU	LLMSched: Uncertainty-Aware Workload Scheduling for Compound LLM Applications	a DAG with regular stage, LLM stage, dynamic stage; bayesian network-based profiler; identify uncertainty-reducing stages	4	4	3
2025	arXiv	SJTU	Efficient Serving of LLM Applications with Probabilistic Demand Modeling	DAG-based scheduling; dynamic excution; cpu excutor warmup	3	1	1

LLM Speculative Inference ¶

Refer to non-LLM speculative execution.

Year	Venue	Authors	Title	Tags
2024	arXiv	F&M College	AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration	simultaneous and independent predictions; asynchronous speculative decoding; rollback mechanism
2024	arXiv	Purdue	Constrained Decoding with Speculative Lookaheads	computational expense of generating lookaheads; speculated lookaheads; task specific reward function
2024	arXiv	Rutgers	Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface	active user intervention; speculative planning algorithm; UI-level rescheduling algorithm
2024	arXiv	USTC	Parallel Speculative Decoding with Adaptive Draft Length	adaptive draft length; pre-verify and post-verify; draft-then-verify framework; mutual waiting problem
2024	arXiv	SEU	SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding	reasoning tree construction; parallel drafting with speculative decoding; FCFS queue verification

Spec + Others¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Huawei	Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling	speculative MoE; speculative token shuffling; speculative expert pre-grouping
2025	INFOCOM	UoA	SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models	internal neurons sparsification; model-agnostic acceleration framework; dynamic early-exit thresholds; multi-layered feature fusion
2025	arXic	SUST	FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference	SPEC on memory limited dedvices; Efficient draft management with tree pruning and early stop reduces redundancy and maintains causal relationships	3	3	3

LLM Serving Outages and Incidents¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Vrije Universiteit Amsterdam	An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models	empirical characterization of outages; failure recovery optimization; public LLM service reliability

Energy-Optimized LLM Scheduling¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	UvA	GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation	dynamic early exit; energy-aware code generation; reinforcement learning for llms

Multi-LLM Scheduling¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	UCLA	Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving	Long-tail model popularity; Frequent idle periods; Rapid workload fluctuations	3	4	2

DNN Scheduling¶

Solution: optimizing data parallelism and model parallelism while minimizing communication overhead between nodes, effectively managing limited GPU memory and other resources to achieve scalability and high throughput.

Refer to LLM-Related Scheduling for the LLM-related scheduling algorithms.

Task Offloading¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	USTC	Collaborative Inference for Large Models with Task Offloading and Early Exiting	early exit mechanism; jointly optimize its offloading strategy and the confidence threshold; distributed task offloading algorithm
2025	ISCA	ETHZ	OptiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming	integer linear programming for offload optimization; PIM-friendly mapping representation; accurate cost modeling for data layout	4	2	3

General optimizations for Deep Learning Systems¶

Solution: general optimizations for deep learning systems.

If the paper is focusing on an above-mentioned specific scene (e.g., memory, scheduling, IO, etc.), it will be put in the corresponding section.

LLM Training Systems¶

Solution: arrange model parameters and data across multiple devices, reduce the time spent communicating, scale up smoothly as models and data keep growing—all while staying efficient and speeding up training.

General Optimizations¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	THU	Enhancing Memory Efficiency in Large Language Model Training Through Chronos-aware Pipeline Parallelism	chronos-aware pipeline parallelism; temporal locality optimization; activation balancing
2025	arXiv	NUS	PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization	selective offload strategy; memory offload optimization; pipeline parallelism scalability; lifespan-based offloading
2025	arXiv	UCSD	WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training	workload-aware variable-length document packing; per-document sharding strategy; adaptive sharding selection mechanism; delay execution of extremely long documents	4	5	2
2025	EuroSys	UToronto	Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization	fine-grained overlap-centric scheduling; symbolic-based performance analysis; imbalance-aware hierarchical tuning	4	4	2

Optimizations on Special Scene¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	HKU	Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism	Fully Sharded Sparse Data Parallelism (FSSDP); sparsely materializes MoE parameters; two sparse collective communications
2025	arXiv	SJTU	PipeWeaver: Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline	dynamic interleaved pipeline; hierarchical schedule space for rapid pipeline schedule search; spatialtemporal subgraph reuse	3	4	2

Experiments¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	JSC	Memory and Bandwidth are All You Need for Fully Sharded Data Parallel	an extensive analysis of the FSDP training distribution strategy; a grid search methodology; both simulation and empirical results	2	4	1

Challenge: multimodal data is more complex and requires more resources to train.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	ByteDance	OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training	multimodal mini-batch imbalance; batch post-balancing algorithm; node-wise all-to-all communicator for practical rearrangement of mini-batches	4	4	3
2025	arXiv	ICT	ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism	unified prefix cache fusing vision and text tokens; modality-aware load balancer for bursty vision traffic	2	3	2

Kernel-Level Optimizations¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	HUST	CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures	model segment profile-based cost model; communication-free tensor partition propagation property; extracting a set of unique model segments; Communication-Free Preserve	4	5	3

LLM Inference Systems¶

Focusing on the optimizations for LLM inference systems.

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	DeepSeek	Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures	software-hardware co-design for deepseek-v3; insight into hardware for ai architectures	5	5	4
2024	Mlsys	SJTU	FlashDecoding++: Faster Large Language Model Inference on GPUs	asynchronized softmax with unified max value; flat GEMM optimization with double buffering; heuristic dataflow with hardware resource adaptation	4	4	3

SLO-Aware Systems¶

Challenge: providing service for users to meet specific latency requirements with limited resources.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Berkeley	AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding	fine-grained speculative decoding; token tree verification; slo customization
2025	arXiv	UIUC	HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location	online-offline request co-location; interference-aware profiler; latency predictor; adaptive scheduler
2025	arXiv	PKU	Memory Offloading for Large Language Model Inference with Latency SLO Guarantees	effectively captures the tension between meeting SLOs and maximizing host memory usage; dynamic offloading interval; per-bus coordinator
2025	arXiv	Huawei	Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization	hybrid offline-online scheduling; preemptive scheduling for hardware utilization; lagrangian method for cost efficiency evaluation
2025	ASPLOS	BUAA	Past-Future Scheduler for LLM Serving under SLA Guarantees	lightLLM; predict future system memory usage; reduce evict by better request scheduling	3	2	3

Surveys¶

System Optimization Surveys¶

Year	Venue	Authors	Title	Tags
2024	arXiv	NEU	LLM Inference Serving: Survey of Recent Advances and Opportunities	KV cache and memory management; LLM computation optimization; Cloud LLM deployment; focus on system-level enhancements
2024	arXiv	CUHK	A Survey on Inference Optimization Techniques for Mixture of Experts Models	model compression; expert skip; expert merge; sparse to dense; expert parallel; expert offloading
2024	arXiv	PolyU	A Survey on Large Language Model Acceleration based on KV Cache Management	cache selection; budget allocation; cache merging; cache quantization; cache low-rank decomposition; attention grouping and sharing; memory management; hardware-aware design
2025	arXiv	THU	Beyond A Single AI Cluster: A Survey of Decentralized LLM Training	resource-driven paradigm; community-driven decentralization; organizational decentralization; decentralized LLM training taxonomy
2025	arXiv	FIU	Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions	distributed solutions for LMs; workload imbalance in LLM training; M-ICL; model security enhancement

Application Surveys¶

Year	Venue	Authors	Title	Tags
2024	arXiv	PKU	Retrieval-Augmented Generation for AI-Generated Content: A Survey	Query Transformation; Data Augmentation; Recursive Retrieval; Chunk Optimization; Retriever Finetuning; Hybrid Retrieval; Re-ranking; Retrieval Transformation; Prompt Engineering; Decoding Tuning; Generator Finetuning; Output Rewrite; Adaptive Retrieval; Iterative RAG
2024	arXiv	WHU	A survey on LLM-based multi-agent systems: workflow, infrastructure, and challenges	personalized characteristics; perceive environmental information; utilize memory mechanisms; mutual interaction; agent self-reflection
2024	arXiv	PolyU	Deploying Foundation Model Powered Agent Services: A Survey	FM-powered agent services within the edge-cloud environment; low-level hardware perspective; high-level software perspective

Multimodal Systems¶

Year	Venue	Authors	Title	Tags
2025	arXiv	UW–Madison	LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models	query-block distributed exchange; shared visual token recomputation; sequence-parallelism with minimal communication overhead
2025	arXiv	Microsoft	Towards Efficient Large Multimodal Model Serving	fine-grained stage-aware resource management; multimodal workload-specific scheduling; model architecture-specific optimizations
2025	arXiv	Huawei	Efficiently Serving Large Multimedia Models Using EPD Disaggregation	encode-prefill-decode disaggregation; multimodal cache; intra-request parallel
2025	arXiv	TU/e	Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach	Multimodal Parallel Split Learning; computation-efficient training; server-side loss aggregation mechanism
2025	arXiv	HUST	FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework	resource-aware KV-cache memory pool; multimodal KV-cache compression; modality-specific compression

Mixture-of-Experts LLM Systems¶

Challenge: efficiently coordinating and scaling expert models across multiple nodes, leading to issues like uneven workload distribution, high communication overhead, and difficulty in fault tolerance.

Expert Offloading and Placement¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	DATE	Berkeley	DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference	data-aware offloading; predictive pre-calculation; sequence-specific expert allocation
2025	arXiv	Stevens Tech	fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving	expert map; iteration-level probability distributions; track fine-grained input semantic embeddings; semantic-based and trajectorybased
2025	arXiv	Georgia Tech	MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing	ILP for expert placement; cross-layer dependencies; minimizing total dispatched token number
2025	EuroMLSys	EPFL	Accelerating MoE Model Inference with Expert Sharding	expert sharding for load balancing; tensor sharding for moe experts; fused expert computations for reduced kernel launches
2025	DAC	PKU	HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference	dynamically balances workloads across GPUs and CPUs; impact-driven prefetching; MoE-specialized cache management	3	4	2

Batching and Scheduling¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Alibaba	Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference	statically batching irregular workloads; batch-task-tile partition; decompress the mapping and dispatch the workload
2025	arXiv	Edinburgh	MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching	module-based batching; high-throughput MoE inference; full KV-cache offloading
2025	arXiv	KTH	Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference	fine-grained preemption; priority-aware scheduling; per-expert queues; expert-level preemption
2025	arXiv	UMich	MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints	two-stage performance modeling; analyzes the theoretical performance upper bound; captures how system execution mechanisms	4	4	2
2025	Arxiv	Nvidia	MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core	decouples parallelization strategies for attention and MoE layers; flexible and efficient token-level dispatcher; 5-D hybrid parallelism	4	5	2

Memory and Communication Efficiency¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	ByteDance	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	fine-grained communication-computation overlapping for efficient MoE execution; dependency resolving method; adaptive workload assignment method; shared data buffers between communication and computation operations
2025	arXiv	UVA	eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference	expert prediction; task-aware expert loading; task-aware request scheduling
2025	mobiCom	HKUST	D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving	dually sparselygated Mixture-of-Experts; token-adaptive bit-width selection; matryoshka weight quantization; bit-width-aware I/O-compute pipeline	3	4	4
2025	ODSI	SJTU	Fast and Live Model Auto Scaling with O(1) Host Caching	auto-scaling with minimal caching; optimize parameter loading; enabling fine-grained layer-level scale	3	3	2
2023	ASPLOS	Google	TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators	hybrid heuristic-solver memory allocator for ML accelerators; contention-aware phased allocation strategy	4	4	3

Architectural Innovations¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Shanghai AI	Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts	linear sequence modeling with MoE; sparse activation via moe layers; hybrid models combining linear-moe and transformer-moe layers
2025	arXiv	Berkeley	HeterMoE: Efficient Training of Mixture-of-Experts Models on Heterogeneous GPUs	zebra parallelism; attention-expert disaggregation; asymmetric expert assignment mechanism; gather and squeeze strategy	4	5	3

Compute-Kernel-Level Optimizations¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	SJTU	Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores	dual-side structured sparsity; sparse-sparse matrix multiplication kernel; vector-wise + 2:4 hybrid sparsity; token-aware activation compression

Long Sequence LLM Systems¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	OSDI	SJTU & Alibaba	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache	inefficient model parallelism intra-instance; inefficient resource management inter-instance; KV cache scheduling
2025	arXiv	PKU	ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs	hybrid data parallelism; data-aware sharding; a heuristic algorithm that reorganizes data assignment based on the characteristics of data and pipeline parallelism
2025	ICML	ByteDance	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	offload value cache to CPU and keep outliers on GPU; landmark-guided sparse KV selection per chunk	3	3	3

Sparse Attention¶

Solution: handle the prompt token by token introduce high latency, trying to use sparse attention to reduce the computation and memory burden. This can be achieved by not using the full attention matrix, but only the upper triangular part.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	CWRU	Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques	sparse attention with graph computing perspective; work-optimal graph algorithms; achieve true sparsity
2025	MLSys	MIT	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention	unified sparse attention; hybrid static and dynamic sparsity; hierarchical kv cache management with query-centric pruning

Ring Computation¶

Solution: use the device layout to reduce the communication overhead. The key idea is to parallel the computation and communication.

Year	Venue	Authors	Title	Tags	P	E	N
2023	Nips	UCB	Ring Attention with Blockwise Transformers for Near-Infinite Context	divide the input into blocks and each block is processed by a single GPU; ring-type device layout	4	3	3
2024	arXiv	SJTU	TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication	communication-oriented parallelism framework; inter-node P2P bidirectional communication bandwidth; optimization of attention block communication

P-D Disaggregated Systems¶

Year	Venue	Authors	Title	Tags
2024	OSDI	PKU	DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving	goodput-optimized; prefill-decoding interference;novel placement algorithm for p-d schema
2024	ISCA	UW	Splitwise: Efficient Generative LLM Inference Using Phase Splitting	optimized cache context transfer; performance per dollar; performance per watt; exploration of homogeneous and heterogeneous cluster deployments
2024	arXiv	CMU	A System for Microserving of LLMs	fine-grained sub-request level actions; dynamic reconfiguration according to workloads; unified KV cache abstraction
2025	arXiv	PKU	ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments	two-level hierarchical optimization; tabu search algorithm for GPU partition; a lightweight re-scheduling mechanism

P-D Disaggregated System Optimizations¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	ByteDance	KVDirect: Distributed Disaggregated LLM Inference	tensor-centric communication mechanism; pull-based KV cache transfer; dynamic GPU resource scheduling via RDMA
2025	arXiv	SYSU	Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation	attention disaggregation and offloading mechanism; low-latency decoding synchronization; resource-efficient prefill colocation; load-aware offloading scheduling	4	4	3
2025	arXiv	Alibaba	FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling	analyze the communication patterns; KV cache structure adjustment method; load-aware scheduling	4	4	2
2025	arXiv	NUS & USTC	DynaServe: Unified and Elastic Tandem-Style Execution for Dynamic Disaggregated LLM Serving	a novel Tandem Serving execution model; two virtual subrequests; explicitly permit the two subrequests to execute on either GPU instance	3	4	2

Throughput-Optimized Systems¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	HKUST	Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation	sampling-then-simulation cost model; model-level pipeline parallelism; minimumtotal-latency application scheduling	4	4	3

Fair Serving Systems¶

Year	Venue	Authors	Title	Tags
2024	arXiv	Virginia Tech	Ensuring Fair LLM Serving Amid Diverse Applications	multi-tenant LLM platform; overload and interaction-driven throttling; weighted service counter
2025	arXiv	UIUC	Hierarchical Autoscaling for Large Language Model Serving with Chiron	hierarchical backpressure; interactive requests and batch requests; mixed instances
2025	arXiv	Berkeley	Locality-aware Fair Scheduling in LLM Serving	deficit-based longest prefix matching; distributed deficit-round coordination; prefix-aware fairness bound analysis

RLHF System¶

Challenge: RLHF system includes both training and inference. On top of that, multi agents(LLMs) when running in parallel, which makes the data flow more complex.

Year	Venue	Authors	Title	Tags	P	E	N
2025	EuroSys	HKU	HybridFlow: A Flexible and Efficient RLHF Framework	auto-mapping model placement; 3D-HybridEngine to reduce the communication overhead; hybrid programming	4	4	3
2025	arXiv	Alibaba	Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library	bind many LLMs in one device cluster; fix the batch problem of long tail requests; reuse many utils in HybridFlow	4	4	2

Communication-Computation Overlap¶

Challenge: effectively hiding communication latency by overlapping it with computation, which requires careful scheduling and resource management to avoid bottlenecks and ensure that both communication and computation proceed efficiently without stalling each other.

Year	Venue	Authors	Title	Tags
2023	NSDI	KAIST	ARK: GPU-driven Code Execution for Distributed Deep Learning	communication-motivated DL system; pipeline DMA engine; GPU-direct-controlled DMA
2024	ASPLOS	PKU	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	communication partition abstraction; hybrid LLM training tasks; 3-level decompose
2024	ASPLOS	UW–Madison	T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives	lightweight track and trigger; pre-programmed DMA commands; atomic memory update
2024	ASPLOS	UIUC	Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM	distributed SpMM; sparsity-aware partition; Synchronous Stripes and Asynchronous Stripes
2024	arXiv	AMD	Optimizing ML Concurrent Computation and Communication with GPU DMA Engines	concurrent computation and communication; compute and memory interference among concurrent kernels; schedule prioritization and careful resource partitioning

Configuration Optimization¶

Challenge: the configuration space is too large to be searched manually.

Year	Venue	Authors	Title	Tags	P	E	N
2025	OSDI	PKU	Mirage: A Multi-Level Superoptimizer for Tensor Programs	auto algebraically transfer tensor; using DAG to search configuration space; auto generate kernel function	4	4	3
2020	ASPLOS	PKU	FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System	tvm auto schedule; RL based stragety find; auto optimizing in large configuration space	4	4	3