Skip to content

Security and Reliability

Error Pattern

Manycore Architecture

Year Venue Authors Title Tags P E N
2009 MICRO UIUC mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems selective Triple Modular Redundant(TMR) replay method; symptom based fault detection; permanent/transient fault
2015 IEEE TSM NTU Wafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets wafer map failure pattern; wafer map similarity ranking; radon/geometry-based feature extraction; WM-811K wafer map dataset

System Level

Year Venue Authors Title Tags P E N
2017 SC Argonne National Lab Run-to-run Variability on Xeon Phi based Cray XC Systems OS noise based core-level variability; tile-level varibility; memory mode varibility
2018 FAST UChicago Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems conversion among fail-stop/slow/trasient; permanent/transient/partial slowdown; internal/external root causes

Hardware Fault

Year Venue Authors Title Tags P E N
2014 DTIS LIRMM A Survey on Simulation-Based Fault Injection Tools for Complex Systems runtime fault injection; compile-time fault injection
2021 ASPLOS UIUC BayesPerf: Minimizing Performance Monitoring Errors using Bayesian Statistics microarchitectural relationship incorporation; measurement uncertainty quantification; high-frequency sampling reduction 3 4 3
2024 arXiv GWU Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults stack-at-0/1 faults; weight register fault; invertible scaling and shifting technique; elementary tile operations for mantissa fault
2025 arXiv NUDT FlexStep: Enabling Flexible Error Detection in Multi/Many-core Real-time Systems register checkpoints based error detection; memory access log unit; data buffering and channelling unit
2025 DAC SEU MEEK: Re-thinking Heterogeneous Parallel Error Detection Architecture for Real-World OoO Superscalar Processors data extraction unit; bespoke forwarding fabric; little core upgrade 3 4 3

NoC Fault

Year Venue Authors Title Tags P E N
2006 IOLTS UBC & WSU On-line Fault Detection and Location for NoC Interconnects code-disjoint based error detection algorithm; code-disjoint switch design 2 2 2
2011 ASPDAC NTHU On the Design and Analysis of Fault Tolerant NoC Architecture Using Spare Routers shift-and-replace allocation algorithm; defect-awareness-path allocation algorithm 3 2 2
2013 TVLSI NUDT Addressing Transient and Permanent Faults in NoC With Efficient Fault-Tolerant Deflection Router link-level error control scheme; on-line fault diagnosis mechanism;RL based fault-tolerant deflection routing 4 2 2
2017 TECS NTUA SoftRM: Self-Organized Fault-Tolerant Resource Management for Failure Detection and Recovery in NoC Based Many-Cores permanent fault; tweaked perfect failure detector; paxos algorithm to recover fault 2 4 2
2017 DDECS TTU From Online Fault Detection to Fault Management in Network-on-Chips: A Ground-Up Approach data-path fault detection; control part fault detection; assertion vector based fault localization 3 1 2

Fail-Slow

Challenge: Fail-slow faults can cause performance degradation without complete failure; making them difficult to detect and diagnose than the fail-stop failure.

Year Venue Authors Title Tags P E N
2019 ATC UChicago IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services slowdown detection based on peer score; sub-root causes for five kinds of root causes
2022 ATC SJTU & Alibaba NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow hardware infant mortality; write amplification factor; intra-node/rock failure 3 4 2
2023 FAST SJTU & Alibaba PERSEUS: A Fail-Slow Detection Framework for Cloud Storage Systems outlier data detection; regression model for detection threshold; risk evaluating algorithm 4 4 3
2025 ASPDAC Xiamen University A Fail-Slow Detection Framework for HBM Devices outlier data detection; regression model for detection threshold; risk evaluating algorithm 2 4 2

Physical Effects

RRAM

Challenge: Non-ideal effects of RRAM devices (e.g. device-to-device variation; cycle-to-cycle variation; etc.) can cause significant performance degradation.

Solution: Data types; training algorithm; SRAM for compensation.

Year Venue Authors Title Tags P E N
2019 DAC UCF Noise Injection Adaption: End-to-End ReRAM Crossbar Non-ideal Effect Adaption for Neural Network Mapping stuck-at-fault; crossbar wire resistance based IR drop; thermal noise model; shot noise; random telegraph noise
2019 DATE Georgia Tech Design of Reliable DNN Accelerator with Un-reliable ReRAM dynamical fixed point data representation format; device variation aware training methodology
2020 DAC ASU Accurate Inference with Inaccurate RRAM Devices: Statistical Data, Model Transfer, and On-line Adaptation introduce statistical variations in knowledge distillation; On-line sparse adaptation with a small SRAM array
2020 DATE SJTU Go Unary: A Novel Synapse Coding and Mapping Scheme for Reliable ReRAM-based Neuromorphic Computing unary coding; priority mapping*
2022 TCAD ASU Hybrid RRAM/SRAM in-Memory Computing for Robust DNN Acceleration integrates an RRAM-based IMC macro with a digital SRAM macro using a programmable shifter to compensate for RRAM variations; ensemble learning
2023 ISCAS TAMU Memristor-based Offset Cancellation Technique in Analog Crossbars peripheral circuitry to remove the systematic offset of crossbar
2024 LATS AMU Analysis of Conductance Variability in RRAM for Accurate Neuromorphic Computing analyzation and quantification of conductance variability in RRAMs; analysis of conductance variation over multiple cycles
2025 arXiv AMU Energy-Efficient RRAM-Based Neuromorphic Computing with Adaptive Voltage and Frequency Scaling energy-efficient RRAM-based neuromorphic computing; adaptive voltage and frequency scaling; energy-efficient RRAM-based neuromorphic computing 2 4 3

DRAM

Challenge: DRAM devices are sensitive to temperature and voltage variations; which can lead to performance degradation and reliability issues.

Year Venue Authors Title Tags P E N
2015 RACS NTU Thermal/Performance Characterization of CMPs with 3D-stacked DRAMs under Synergistic Voltage-Frequency Control of Cores and DRAMs coordinate dynamic voltage and frequency scaling; thermal efficiency quantification 3 2 2
2017 IEEE Access Yuan Ze University Thermal- and Performance-Aware Address Mapping for the Multi-Channel Three-Dimensional DRAM Systems inter-channel bank swapping; inter-channel bank reordering 3 3 2
2020 TCAD BUAA Temperature-Aware DRAM Cache Management—Relaxing Thermal Constraints in 3-D Systems temperature-safe cache operation; exploration on cache remapping; write-back optimization 4 3 2
2024 TCAD IIT 3D-TemPo: Optimizing 3-D DRAM Performance Under Temperature and Power Constraints reward-based dynamic power budgeting; adjacency awareness; DRAM low-power-based DTM 3 3 2

3DIC

Year Venue Authors Title Tags P E N
2004 ICCAD UCLA A thermal-driven floorplanning algorithm for 3D ICs combined bucket and 2D array; tile stack based model; horizontal and vertical heat flow analysis
2016 IJHMT UCR Analysis of critical thermal issues in 3D integrated circuits thermal hotspots; impact of thermal interface materials; power distribution; processor pitch and area

Fault-Tolerant Cache

Year Venue Authors Title Tags P E N
2009 ICCD NUS The Salvage Cache: A fault-tolerant cache architecture for next-generation memory technologies fault-bit protection for divisions; victim map based division replacement
2011 CASES UCSD FFT-Cache: A Flexible Fault-Tolerant Cache Architecture for Ultra Low Voltage Operation flexible defect map for faulty block; FDM configuration algorithm; non-functional lines minimization