NOCS '21: Proceedings of the 15th IEEE/ACM International Symposium on Networks-on-ChipFull Citation in the ACM Digital Library
SESSION: NoCs for DNN accelerators
Increasing deployment of Deep Neural Networks (DNNs) in a myriad of applications, has recently fueled interest in the development of specific accelerator architectures capable of meeting their stringent performance and energy consumption requirements.
DNN accelerators use three separate NoCs within the accelerator, namely distribution, multiplier and reduction networks (or DN, MN and RN, respectively) between the global buffer(s) and compute units (multipliers/adders). These NoCs enable data delivery, and more importantly, on-chip reuse of operands and outputs to minimize the expensive off-chip memory accesses.
Among them, the RN, used to generate and reduce the partial sums produced during DNN processing, is what implies the largest fraction of chip area (25% of the total chip area in some cases) and power dissipation (38% of the total chip power budget), thus representing a first-order driver of the energy efficiency of the accelerator.
RNs can be orchestrated to exploit a Temporal, Spatial or Spatio-Temporal reduction dataflow. Among these, the latter is the one that has shown superior performance. However, as we demonstrate in this work, a state-of-the-art implementation of the Spatio-Temporal reduction dataflow, based on the addition of Accumulators (Ac) to the RN (i.e. RN+Ac strategy), can result into significant area and energy expenses. To cope with this important issue, we propose STIFT (that stands for Spatio-Temporal Integrated Folding Tree) that implements the Spatio-Temporal reduction dataflow entirely on the RN hardware substrate (i.e. without the need of the extra accumulators). STIFT results into significant area and power savings regarding the more complex RN+Ac strategy, at the same time its performance advantage is preserved.
Deep neural networks (DNNs) algorithms are expected to be core components of next-generation applications. These high performance sensing and recognition algorithms are key enabling technologies of smarter systems that make appropriate decisions about their environment. The integration of these compute-intensive and memory-hungry algorithms into embedded systems will require the use of specific energy-efficient hardware accelerators. The intrinsic parallelism of DNNs algorithms allows for the use of a large number of small processing elements, and the tight exploitation of data reuse can significantly reduce power consumption. To meet these features, many dataflow models and on-chip communication proposals have been studied in recent years. This paper proposes a comprehensive study of on-chip communication properties based on the analysis of application-specific features, such as data reuse and communication models, as well as the results of mapping these applications to architectures of different sizes. In addition, the influence of mechanisms such as broadcast and multicast on performance and energy efficiency is analyzed. This study leads to the definition of overarching features to be integrated into next-generation on-chip communication infrastructures for CNN accelerators.
Conventional AI accelerators are limited by von-Neumann bottlenecks for edge workloads. Domain-specific accelerators (often neuromorphic) solve this by applying near/in-memory computing, NoC-interconnected massive-multicore setups, and data-flow computation. This requires an effective mapping of neural networks (i.e, an assignment of network layers to cores) to balance resources/memory, computation, and NoC traffic. Here, we introduce a mapping called Snake for the predominant convolutional neural networks (CNNs). It utilizes the feed-forward nature of CNNs by folding layers to spatially adjacent cores. We achieve a total NoC bandwidth improvement of up to 3.8X for MobileNet and ResNet vs. random mappings. Furthermore, NEWROMAP is proposed that continues to optimize Snake mapping through a meta-heuristic; it also simulates the NoC traffic and can work with TensorFlow models. The communication is further optimized with up to 22.52% latency improvement vs. pure snake mapping shown in simulations.
SESSION: Security and NoC routing
With the advancement of VLSI technology, Tiled Chip Multicore Processors (TCMP) with packet switched Network-on-Chip (NoC) have been emerged as the backbone of the modern data intensive parallel systems. Due to tight time-to-market constraints, manufacturers are exploring the possibility of integrating several third-party Intellectual Property (IP) cores in their TCMP designs. Presence of malicious Hardware Trojan (HT) in the NoC routers can adversely affect communication between tiles leading to degradation of overall system performance. In this paper, we model an HT mounted on the input buffers of NoC routers that can alter the destination address field of selected NoC packets. We study the impact of such HTs and analyse its first and second order impacts at the core level, cache level, and NoC level both quantitatively and qualitatively. Our experimental study shows that the proposed HT can bring application to a complete halt by stalling instruction issue and can significantly impact the miss penalty of L1 caches. The impact of re-transmission techniques in the context of HT impacted packets getting discarded is also studied. We also expose the unrealistic assumptions and unacceptable latency overheads of existing mitigation techniques for packet header attacks and emphasise the need for alternative cost effective HT management techniques for the same.
Network-on-Chip (NoC) is widely used as an efficient communication architecture in multi-core and many-core System-on-Chips (SoCs). However, the shared communication resources in NoCs, e.g., channels, buffers, and routers might be used to conduct attacks compromising the security of NoC-based SoCs. Almost all of the proposed encryption-based protection methods in the literature need to leave some parts of the packet unencrypted to allow the routers to process/forward packets accordingly. This uncovers the source/destination information of the packet to malicious routers, which can be used in various attacks. In this paper, we propose the idea of secure anonymous routing with minimal hardware overhead to hide the source/destination information while exchanging secure information over the network. The proposed method uses a novel source-routing algorithm that works with encrypted destination addresses and prevents malicious routers from discovering the source/destination of secure packets. To support our proposal, we have designed and implemented a new NoC architecture that works with encrypted addresses. The conducted hardware evaluations show that the proposed security solution combats the security threats at an affordable cost of 1% area and 10% power overheads chip-wide.
State-of-the-art System-on-Chip (SoC) designs consist of many Intellectual Property (IP) cores that interact using a Network-on-Chip (NoC) architecture. SoC designers increasingly rely on global supply chains for obtaining third-party IPs. In addition to inherent vulnerabilities associated with utilizing third-party IPs, NoC based SoCs enable attackers to exploit the distributed nature of NoC and its connectivity with various IPs to launch a plethora of attacks. Specifically, Denial-of-Service (DoS) attacks pose a serious threat in degrading the SoC performance by flooding the NoC with unnecessary packets. In this paper, we present a machine learning-based runtime monitoring mechanism to detect DoS attacks. The models are statically trained and used for runtime attack detection leading to minimum runtime performance overhead. Our approach is capable of detecting DoS attacks with high accuracy, even in the presence of unpredictable NoC traffic patterns caused by various application mappings. We extensively explore machine learning models and features to provide a comprehensive study on how to use machine learning for DoS attack detection in NoC-based SoCs.
SESSION: NoC design for modern systems
The integration of many processing elements per die makes it more difficult to provide low latency in the Network-on-Chip (NoC). Multihop bypass proposals, such as SMART, attack this problem by allowing flits to skip multiple routers in the path in a single cycle, drastically reducing latency while preserving a regular tiled layout. However, multihop bypass routers are more complex and relatively different from traditional NoC routers, since they rely on global broadcast signals and global allocation mechanisms. Additionally, the maximum number of nodes that can be bypassed within a single cycle is limited by the Critical Path Delay (CPD) of the NoC. Hence, a practical multihop bypass mechanism must also minimize this delay.
To simplify the design of multihop bypass mechanisms, this work introduces PlugSMART, an open-source pluggable Verilog module that extends a traditional router to support multihop bypass. PlugSMART follows a black box approach, requiring minimal modifications from the original router. As an application of PlugSMART, we introduce ProSMART, a multihop bypass extension of the efficient NoC router ProNoC. ProSMART is evaluated using simulations, FPGA, and ASIC synthesis. Results show that it is more performant and requires significantly fewer resources than previous open-source designs. The comparison with OpenSMART++, the most recent state-of-the-art SMART-based NoC, shows up to a 50% reduction in both area and CPD. Overall, PlugSMART constitutes a simple alternative for fast and efficient upgrading of existing NoC routers, allowing to implement multihop bypass and significantly improve performance while preserving the original characteristics of the router design.
The performance of graphics processing units (GPU) workloads can be sensitive to the various clock domains which are dynamically tunable in modern GPUs. In this work, we observe that GPU application performance is sensitive towards NoC clock frequencies and the sensitivity varies during the execution of GPU kernels. We note that this heterogeneity is not adapted well by traditional dynamic voltage frequency scaling (DVFS) techniques. To that end, we introduce DUB, <u>D</u>ynamic <u>U</u>nderclocking and <u>B</u>ypassing technique, for such heterogeneous GPU workloads. We enable bypassing re-timer flops and routers while underclocking the NoC frequency thus enabling high power savings at minimal performance loss. Compared to baseline we observe a 26% improvement in power savings with only 3% degradation in performance beating oracular DVFS techniques.
The recent line of Versal FPGA devices from Xilinx Inc. includes a hard Network-On-Chip (NoC) embedded in the programmable logic, designed to be a high-performance system-level interconnect. While the target markets for Versal devices include applications with real-time constraints, such as automotive driver assist, the associated development tools only provide figures for "structural latencies" of data packets, which assume that the network is otherwise idle. In a realistic setting, this information is not enough to ensure deadlines are met, as different packets can contend for NoC switch outputs, which causes packet contents to be buffered while in transit, increasing their latency. In this work, we present a formal description of the NPS switches that compose the Versal NoC from a flit (or packet) scheduling perspective, based on the available cycle-accurate switch simulation code. We then analyze a scenario where network clients transfer data periodically over a single switch, and propose a method for calculating worst-case communication times in this scenario.
Synchoros VLSI design style has been proposed as an alternative to the standard cell-based design style; the word synchoros is derived from the Greek word choros for space. Synchoricity discretises space with a virtual grid, the way synchronicity discretises time with clock ticks. SiLago (Silicon Lego) blocks are atomic synchoros building blocks like Lego bricks. SiLago blocks absorb all metal layer details, i.e., all wires, to enable composition by abutment of valid; valid in the sense of being technology design rules compliant, timing clean and OCV ruggedized. Effectively, composition by abutment eliminates logic and physical synthesis for the end user. Like Lego system, synchoricity does need a finite number of SiLago block types to cater to different types of designs. Global NoCs are important system level design components. In this paper, we show, how with a small library of SiLago blocks for global NoCs, it is possible to automatically synthesize arbitrary global NoCs of different types, dimensions, and topology. The synthesized global NoCs are not only valid VLSI designs, but their cost metrics (area, latency, and energy) are known with post-layout accuracy in linear time. We argue that this is essential to be able to do chip-level design space exploration. We show how the abstract timing model of such global NoC SiLago blocks can be built and used to analyse the timing of global NoC links with post layout accuracy and in linear time. We validate this claim by subjecting the same VLSI designs of global NoC to commercial EDA's static timing analysis and show that the abstract timing analysis enabled by synchoros VLSI design gives the same results as the commercial EDA tools.
SESSION: Secure NoC-based systems
SoC security has become essential with devices now pervasive in critical infrastructure in homes and businesses. Today's embedded SoCs are becoming increasingly high-performance and complex, comprising multiple cores, accelerators, and IP blocks interconnected with a Network-on-Chip (NoC). As these IPs can originate from diverse sources, they cannot be trusted to form the root of trust in SoCs. However, the NoC itself, being the communication backbone linking all IPs, is naturally positioned to be the basis for a secure SoC. Therefore, there is a need for an efficient solution that both meets the stringent requirements of modern embedded SoC designs, while maintaining a high level of security.
In this paper, we demonstrate how statically-scheduled NoCs inherently enforce traffic isolation and non-interference of communication. The time-division multiplexing (TDM) of NoC links across applications provably ensures that security properties are fulfilled. However, conventional TDM NoCs are still vulnerable to side-channel attacks. We thus propose temporal and data obfuscation schemes that can be embedded within static TDM NoCs, randomizing source-destination communication patterns and switching activity over the links. Our proposed statically-scheduled Sentry-NoC links up untrusted IP blocks to form a secure SoC. Sentry-NoC targets key security properties to effectively mitigate side-channel attacks with an extremely low overhead, reducing average temporal correlation by 81% and average data correlation by 91%.
Network-on-Chip (NoC) Firewall provides memory protection and process isolation. In this paper, we design, implement and validate hierarchical Linux security primitives on top of a custom NoC Firewall module embedded on the ARM-based Xilinx Zedboard FPGA. Our open-source, multi-layer security protocols aim to protect the privacy of application keys stored in non-cacheable BRAM. Experimental results derived from integrating security within a soft real-time electrocardiogram monitoring, analysis, and visualization application allow evaluating the software overhead of the proposed security primitives. Preliminary results indicate that the performance overhead for supporting data privacy is acceptable for one-time authentication schemes. However, relative to the processing requirements of the E-Health application, security overheads are large, and cannot sustain continuous authentication schemes.