Papers presented at the recent International Reliability Physics Symposium (IRPS) showed the growing importance of lifetime monitoring to the problem of handling components as they age.
In his keynote at IRPS, Intel CTO Mike Mayberry talked about the growing problem of determining whether an individual SoC is likely to survive long-term using just manufacturing tests. For some industries, such as automotive, manufacturers may simply have to err on the side of conservatism and sideline parts that show any sign of being prone to early failures even at the risk of binning parts that show false positives. But in environments such as data centres where regular maintenance is an expected cost, monitoring provides a way of maximising output and lifetime data throughput on the understanding that some parts will fail early. Or they will shift operation to hidden or lightly used cores that take over the heavy lifting once a primary has worn out. Early failures also provide clues that can guide design and manufacturing changes in the fab and assembly.
Because of the complex interactions between device behavior and reliability, machine learning is likely to form an underpinning for the management regime. One user working on methods for onchip monitoring that employ machine learning is Cisco Systems. Engineers from the networking company described how they take information from onchip analog sensors up to system-level controllers.
One experiment described at IRPS covered the use of sensors focused on the behavior of memory chips in optical-router line cards that had suffered early failures. The systems have memory for onboard failure logging (OBFL) that on average had collected 5000 records capturing data from 75 sensors. Those sensors monitor temperature, currents and voltages from key points on the card. The OBFL entries record excursions from normality. To try to understand the data, an isolation forest algorithm isolated modules with strong anomalous behavior, which in turn, pointed to a common failure mode caused by hot carrier injection. The team found memory tests, which need to be conducted during start-up because they are intrusive, could predict failures in currently working cards. Another experiment showed that monitoring across a system can help detect transient failures caused by insufficient margins that affect the operation of a router and lead to an excessive number of dropped packets.
The experiments reported at IRPS focused on centralised data collection, training and evaluation. However, the amount of data the monitoring subsystems produce can be enormous, so they are looking at ways in which to perform much of the analysis as close to the source system as possible, with smaller subsets of the data passed to central servers for in-depth analysis.
Onchip monitoring networks
IP suppliers are now positioning themselves to provide core that SoC designers can incorporate into subsystems that actively look for aging effects. Earlier in the year, yield specialist PDF Solutions signed a deal with onchip-debug company UltraSoC to use onchip monitors to help guide manufacturing. Israeli startup ProteanTecs, which described its approach to cloud-based machine learning with onchip sensors at IRPS, is focused primarily on predicting failures due to component degradation. Like Mayberry, the company is taking the position that reliability engineering is moving from accelerated lifetime tests on a sample of devices to in-field monitoring and prediction across the entire manufacturing output.
The Proteus system described by the group captures data during both manufacturing testing as well as system operation to provide a wider range of data points to the machine-learning models. In addition to monitoring temperature and voltage using conventional sensors to look for causes of device stress, the Proteus architecture uses “agent” sensors that focus on timing and proxies for process-dependent variables such as threshold voltage and leakage current. The timing agents monitor changes in setup margins to look for the increase in combinatorial-logic delay caused by aging. Similar to test insertion, placement algorithms focus on high-value paths that are good candidates for in-field measurement. Typically, these are paths likely to be sensitive to timing changes as well as those that provide a high degree of coverage, such as high fan-in trees.
A particular focus for ProteanTecs is in high-bandwidth memory stacks, which are likely to be prone to connection failures and which are very difficult to test conventionally during manufacturing. For this, the company has developed a signal-integrity probe IP core akin to the onchip logic analyzer probes found in field-programmable gate arrays (FPGAs). The software platform uses data from probes placed at strategic points to determine good candidates to replace data lanes that are likely to fail in the near future, showing one way in which reliability monitoring will help tune system layout for maximum resilience.