Meta Shares How It Detects Silent Data Corruptions In Its Data Centres
SDCs are data errors that do not leave any record or trace in system logs. Sources of SDCs include datapath dependencies, temperature variance, and age, among other silicon factors. Since these data errors are silent, they can stay undetected within workloads and propagate across several services. The data error can affect memory, storage, networking, as well as computer CPUs and cause data loss and corruption. Meta engineers started testing three years ago as they had a difficult time detecting SDCs once components had already gone into one of its production data centre fleets....