Performance Modeling and Analysis¶

Hardware Performance Counter¶

Challenge: Software performance analysis and optimization is often limited by the lack of accurate and detailed information about the underlying hardware behavior.

Solution: Use hardware performance counters to gather data on CPU usage; memory access patterns; cache hits/misses; branch predictions; and other metrics that can help analyze the performance of software applications and hardware systems.

Survey¶

Year	Venue	Authors	Title	Tags	P	E	N
2013	TODAES	Crete	A Survey and Taxonomy of On-Chip Monitoring of Multicore Systems-on-Chip	debugging/performance/QoS monitor; physical parameter monitor; methodology based taxonomy	2	4	1
2016	CSUR	Oak Ridge Lab	Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods	external/internal power measurement; HPC based power model; GPU power simulation	3	3	1
2019	SP	UNC-Chapel Hill	SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security	non-determinism and overcounting effects; performance monitoring interrupt	3	4	1

Specific Application¶

Year	Venue	Authors	Title	Tags	P	E	N
2000	SC	UT	A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters	portable and machine-dependent layers based architecture; eventset for group management; counter multiplexing	2	4	2
2004	SC	UMD	Using Hardware Counters to Automatically Improve Memory Performance	two-phase dynamic page migration algorithm; sun fire link counter	3	4	3
2013	ISPASS	UTAustin	Non-determinism and overcount on modern hardware performance counter implementations	nondeterministic hardware interrupts; float point unit related overcount; retired instruction overcount	2	4	2
2020	CONECCT	IIIT	Power, Performance And Thermal Management Using Hardware Performance Counters	fine-grained dynamic voltage and frequency scaling; PMC-based power and temperature correlation model; thermal zone and partition-based management	2	4	2

Architecture Design¶

Challenge: Existing hardware performance counters provide limited information; expansion is needed to support more hardware behavior data.

Year	Venue	Authors	Title	Tags	P	E	N
2006	ASPLOS	UW–Madison	A Performance Counter Architecture for Computing Accurate CPI Components	interval analysis based performance model; frontend miss table(FMT); shared FMT	3	3	2
2014	ISPASS	Intel	A Top-Down Method for Performance Analysis and Counters Architecture	top-down bottleneck analysis method; frontend bound; bad speculation; retiring; backend bound; top-down performance events	3	5	3
2015	ISCA	ANU	Computer Performance Microscopy with SHIM	double-time error correction; sample periods randomizing; CMP core sampling for low overhead	4	4	3

Dataflow Architecture¶

Year	Venue	Authors	Title	Tags
2022	OSDI	UCB	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	inter-operator parallelisms; intra-operator parallelisms; ILP and DP hierarchical optimization
2023	MICRO	PKU	TileFlow: A Framework for Modeling Fusion Dataflow via Tree-based Analysis	3D design space of fusion dataflow; tree-based description; tile-centric notation
2024	ISCA	Stanford	The Dataflow Abstract Machine Simulator Framework	communicating sequential processes; event-queue free execution; context-channel based description; asynchronous distributed time

Connection Architecture¶

Year	Venue	Authors	Title	Tags
2014	JPDC	Inria	Versatile, scalable, and accurate simulation of distributed applications and platforms	API based communication&computation description; informed model of TCP for moderate size grids; file based modular network representation technique
2020	MICRO	Georgia Tech; NVIDIA	MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings	data-centric mapping; data reuse analysis; TemperalMap; SpatialMap; analytical cost model
2023	ISPASS	Georgia Tech	ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale	graph-based training-loop execution; multi-dimensional heterogeneous topology construction; analytical network backend
2024	ATC	THU	Evaluating Chiplet-based Large-Scale Interconnection Networks via Cycle-Accurate Packet-Parallel Simulation	packet-centric simulation; critical resources recorading for process-order-induced deviations; unimportant stages elimination
2025	arXiv	UCLM	Understanding Intra-Node Communication in HPC Systems and Datacenters	Intra-/inter-node communication interference; Packet-level simulation (OMNeT++); PCIe/NVLink modeling; LLM communication patterns (DP, TP, PP) impact

Redundancy Detection¶

Challenge: Redundant zeros in data can lead to inefficiencies in software performance; making it important to detect and eliminate them.

Year	Venue	Authors	Title	Tags
2020	SC	NC State	ZeroSpy: Exploring Software Inefficiency with Redundant Zeros	code-centric analysis for instruction detection; data-centric analysis for data detection
2020	SC	NC State	GVPROF: A Value Profiler for GPU-Based Clusters	temporal/spatial load/store redundancy; hierarchical sampling for reducing monitoring overhead; bidirectional search algorithm on dependency graph
2022	ASPLOS	NC State	ValueExpert: Exploring Value Patterns in GPU-accelerated Applications value-related inefficiencies	data value pattern recoginition; value flow graph; parallel intervals merging algorithm
2022	SC	NC State	Graph Neural Networks Based Memory Inefficiency Detection Using Selective Sampling	dead store; silent store; silent load; assembly-level procedural control-flow embedding; dynamic value semantic embedding; relative positional encoding for different compilation options

Variation Impact¶

Solution: Characterize sources of variation (hardware; software; environment); develop models to predict variation impact; implement techniques to reduce variation (e.g., dynamic voltage and frequency scaling, adaptive scheduling).

Year	Venue	Authors	Title	Tags	P	E	N
2009	HPCMP	UCSD	Measuring and Understanding Variation in Benchmark Performance	MPI communication variation; distribution of performance variation
2016	SC	UNM	Understanding Performance Interference in Next-Generation HPC Systems	extreme value theory; bulk-synchronous parallel based modeling; gang/earliest deadline first scheduling

Stall Attribution¶

Challenge: Stall can be caused by hardware or software; identifying the root cause of stalls and their impact on performance is crucial for performance optimization.

Year	Venue	Authors	Title	Tags	P	E	N
2023	ICPE	NC State University	DrGPU: A Top-Down Profiler for GPU	device memory stall; synchronization stall; instruction related stall; shared memory related stall
2024	MICRO	NUDT	HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs	memory/compute structual/data stall; synchronization stall; control stall; idle stall; cooperative thread array-sets based SM sampling algorithm