Energy-minimal dataflow architecture with programmable on-chip network

Inventors

Lucia, BrandonBeckmann, NathanGobieski, Graham

Interested in licensing this patent?

MTEC can help explore whether this patent might be available for licensing for your application.

Assignees

Carnegie Mellon University

Carnegie Mellon University is a global research institution based in Pittsburgh, Pennsylvania, recognized for interdisciplinary education, research, and innovation in science, engineering, arts, technology, and social sciences. The university leads advancements in artificial intelligence, robotics, digital health, and performing arts. Located in a technology-driven and culturally rich city, CMU powers real-world impact through research centers, industry engagement, workforce training, and initiatives that shape regional and global communities.

Publication Number

US-12541349-B2

Patent

Publication Date

2026-02-03

Expiration Date


Abstract

Disclosed herein is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency. The architecture includes a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. The architecture is able to realize both energy and area savings over prior art implementations by offloading most control operations into a programmable on-chip network where they can re-use existing network switches.

Core Innovation

The invention relates to a co-designed energy-minimal CGRA and compiler that compiles a program to support arbitrary control flow and irregular memory access. Control-flow primitives are inserted during compilation from an intermediate representation into a dataflow graph, enabling execution on a coarse-grained reconfigurable array. The approach uses ordering-operator-based dataflow concepts to represent dependencies and to drive control flow through the CGRA and its on-chip network.

A memory-ordering mechanism is enforced on the intermediate representation to create an ordering graph encoding dependencies between memory operations as ordering arcs. The compiler iteratively prunes the ordering graph to eliminate redundant ordering arcs already enforced by data and control dependencies, including path-sensitive transductive reduction that transforms a potentially cyclic ordering graph through strongly-connected components.

Control-flow operators are translated into the dataflow graph by inserting control-flow operators including Steer, Carry, Invariant, Merge, Order, and stream operator, and mapping operators onto the CGRA. The control-flow operators utilize logic of an on-chip network of the CGRA to control flow of the program by reusing NoC switches with added control-flow modules, in a bufferless 2D-torus CGRA fabric with heterogeneous processing elements.

Claims Coverage

The independent claim describes a system with a processor and compiler software that performs a compilation and mapping pipeline including intermediate representation generation, memory-ordering via an ordering graph with pruning, conversion to a dataflow graph with inserted control-flow operators, and mapping onto a CGRA where control-flow operators use on-chip network logic. The independent claim’s inventive features are focused on memory-order enforcement with ordering-arc pruning, translation to a dataflow graph with inserted control-flow operators, and CGRA mapping in which control-flow operators utilize on-chip network logic for program control.

High-level to intermediate representation compilation

The compiler software compiles a program written in a high-level language to an intermediate representation.

Memory-order enforcement via ordering graph arcs

The compiler software enforces memory ordering on the intermediate representation to create an ordering graph encoding dependencies between memory operations as ordering arcs.

Iterative pruning of redundant ordering arcs

The compiler software iteratively prunes the ordering graph to eliminate redundant ordering arcs already enforced by data and control dependencies.

Intermediate representation to dataflow graph translation

The compiler software translates the intermediate representation to a dataflow graph.

Insertion of control-flow operators into the dataflow graph

The compiler software inserts control-flow operators into the dataflow graph.

CGRA mapping using on-chip network logic for control flow

The compiler software maps operators in the dataflow graph to a coarse-grained reconfigurable array (CGRA) wherein the control-flow operators utilize logic of an on-chip network of the CGRA to control flow of the program.

CGRA with heterogeneous processing elements and bufferless 2D-torus on-chip network

The system includes a CGRA with a fabric of heterogeneous processing elements connected through a bufferless 2D-torus on-chip network, along with a RISC-V scalar core and SRAM main memory.

Path-sensitive transductive reduction for ordering-graph acyclicity

The system applies path-sensitive transductive reduction to an ordering graph to transform a potentially cyclic ordering graph into an acyclic graph outside of strongly-connected components.

Control-flow operators comprising Steer, Carry, Invariant, Merge, Order, and stream operator

The system defines control-flow operators comprising a steer operator, a carry operator, an invariant operator, a merge operator, an order operator, and a stream operator.

Dataflow graph optimization before mapping

The system optimizes the dataflow graph before mapping the operators in the dataflow graph.

C language to LLVM intermediate representation via clang

The system further limits that the high-level language is C and the intermediate representation is LLVM produced by clang.

Overall, the independent claim’s coverage centers on enforcing memory ordering through an ordering graph of ordering arcs with iterative pruning, converting the intermediate representation to a dataflow graph with inserted control-flow operators, and mapping onto a CGRA in which control-flow operators use on-chip network logic to control program flow. The refinements further specify a heterogeneous CGRA fabric with bufferless 2D-torus connectivity and concrete operator types, along with path-sensitive transductive reduction for handling cyclic ordering graphs, optimization of the dataflow graph prior to mapping, and constraints to compiling C to LLVM via clang.

Stated Advantages

Reported energy savings compared with prior CGRAs when control-flow is mapped in the NoC (e.g., ~40% energy).

Reported area savings compared with prior CGRAs when control-flow is mapped in the NoC (e.g., ~22% area).

Reported less energy from ILP mapping (e.g., ILP ~4.3% less energy).

Reported fast mapping solutions using SAT (e.g., SAT <3 min).

Documented Applications

Mapping and compilation of programs from a high-level language into a CGRA execution flow that supports arbitrary control flow and irregular memory access.

JOIN OUR MAILING LIST

Stay Connected with MTEC

Keep up with active and upcoming solicitations, MTEC news and other valuable information.