Guy Lemieux


Relevant Thesis-Based Degree Programs


Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Machine learning based techniques for routing interconnects in very large scale integrated (VLSI) circuits (2022)

Global routing is a significant challenge in Integrated Circuit (IC) designs due to circuits' increasing number of metal layers, transistors, and the resulting growth in design complexity. Congestion is a crucial factor contributing to routing completion because required interconnects of a design with no congestion can be easily routed. A circuit with congestion will have challenges during routing and may require a re-placement, which lengthens the time to complete the design and may delay time to market. Congestion also affects routing complexity, which may increase wire length and the number of vias and detours in a layout, affecting overall circuit performance. Furthermore, congested areas in a layout may cause manufacturing yield and reliability problems. Congested areas have a higher potential for creating shorts and opens which can eventually lead to unworkable chips. Traditional congestion estimation algorithms use simple, fixed models which do not change as the technology nodes scale to finer dimensions. To address this shortcoming, we investigate Machine Learning (ML) based congestion estimation approaches. By training from previously routed circuits, an ML-based estimator implicitly learns from the advanced design rules of a particular technology node as well as from the routing behaviours of routers. In this investigation, three ML-based approaches for congestion estimation are explored. First, a regression model to estimate congestion is developed, which is in average 9X faster than traditional approaches while maintaining a similar quality of routing solution. Second, a Generative Adversarial Network (GAN) based model is developed to accurately predict congestion of fixed-size tiles of a circuit. Third, a customized Convolutional Neural Network (CNN) is designed for congestion estimation which uses a sliding-window approach to smooth tile-based discontinuities. This CNN produces the best correlations with actual post-routing congestion compared with other state-of-the-art academic routers. Furthermore, an improved global routing heuristic is developed with which congestion estimators can be merged. Results show that my work achieves 14% reduction in runtimes on average compared with other routers and achieves significantly lower runtimes on difficult-to-route circuits. Overall, this work demonstrates the feasibility of using ML-based approaches to improve routing algorithms for complex IC implemented in nanometer technologies.

View record

Automated space/time scaling of streaming task graphs on field-programmable gate arrays (2019)

Parallel computing platforms provide good performance for streaming applications within a limited power budget. However, these platforms can be difficult to program. Moreover, when the size of the computing platform target changes, users must manually reallocate resources and parallelism. This thesis provides a framework to retarget applications described by a Streaming Task Graph (STG) for implementation on different platforms, where the framework can automatically scale the solution size to fit available resource or performance targets. First, we explore automated space/time scaling for STGs targeting a pipelined coarse-grained architecture. We produce a tool that analyzes the degrees of parallelism in a general stream application and finds possible bottlenecks. We introduce possible optimization strategies for STGs, and two algorithmic approaches: a classical approach based upon Integer Linear Programming (ILP), and a novel heuristic approach which introduces a new optimization and produces better results (using 30% less area) with a shorter run-time. Second, we explore automated space/time scaling for STGs targeting a fine-grained architecture (Field-Programmable Gate Array (FPGA)). We propose a framework on top of a commercial High-Level Synthesis (HLS) tool which adds the ability to automatically meet a defined area budget or target throughput. Within the framework, we use similar ILP and heuristic approaches. The heuristic automatically achieves over 95% of the target area budget on average while improving throughput over the ILP. It can also meet the same throughput targets as the ILP while saving 19% area.Third, we investigate automated space/time scaling of STGs in a hybrid architecture consisting of a Soft Vector Processor (SVP) and select custom instructions. To achieve higher performance, we investigate using dynamic Partial Reconfiguration (PR) by time-sharing the FPGA resources. The performance results achieve speedups far beyond what a plain SVP can accomplish: an 8-lane SVP achieves a speedup of 5.3 on the Canny-blur application, whereas the PR version is another 3.5 times faster, with a net speedup of 18.5 over the ARM Cortex A9 processor. By increasing the dynamic PR rate beyond what is available today, we also show that a further 5.7 times speedup can be achieved (105.9x speedup versus the Cortex A9).

View record

Architecture of Block-RAM-Based Massively Parallel Memory Structures: Multi-Ported Memories and Content-Addressable Memories (2016)

Since they were first introduced three decades ago, Field-Programmable Gate Arrays (FPGAs) have evolved from being merely used as glue-logic to implementing entire compute accelerators. These massively parallel systems demand highly parallel memory structures to keep pace with their concurrent nature since memories are usually the bottleneck of computation performance. However, the vast majority of FPGA devices provide dual-ported SRAM blocks only. In this dissertation, we propose new ways to build area-efficient, high-performance SRAM-based parallel memory structures in FPGAs, specifically Multi-Ported Random Access Memory and Content-Addressable Memory (CAM). While parallel computation demands more RAM ports, leading Multi-Ported Random Access Memory techniques in FPGAs have relatively large overhead in resource usage. As a result, we have produced new design techniques that are near-optimal in resource overhead and have several practical advantages. The suggested method reduces RAM usage by over 44% and improves clock speed by over 76% compared to the best of previous approaches. Furthermore, we propose a novel switched-ports technique that allows further area reduction if some RAM ports are not simultaneously active. A memory compiler is proposed to generalize the previous approach and allow generating Multi-Switched-Ports Random Access Memory. Content-Addressable Memories (CAMs), the hardware implementation of associative arrays, are capable of searching the entire memory space for a specific value within a single clock cycle. CAMs are massively parallel search engines accessing all memory content to compare with the searched pattern simultaneously. CAMs are used in a variety of scientific fields requiring high-speed associative searches. Despite their importance, FPGAs lack an area-efficient CAM implementation. We propose a series of scalable, area-efficient, and high-performance Binary Content-Addressable Memories (BCAMs) based on hierarchical search and data compression methods. Compared to current RAM-based BCAM architectures, our BCAMs require a maximum of 18% the RAM storage while enhancing clock speed by 45% on average, hence exhibiting a superior single-cycle search rate. As a result, we can build faster and more cost-effective accelerators to solve some of the most important computational problems.

View record

CAD algorithms and performance of Malibu: An FPGA with time-multiplexed coarse-grained elements (2011)

Modern Field-Programmable Gate Arrays (FPGAs) are used to implement a wide range of ever-larger circuits, many of which have both coarse-grained and fine-grained components. Past research into coarse-grained FPGAs optimized for such circuits have only demonstrated a 10% density advantage. In contrast, time-multiplexing of fine-grained FPGAs has demonstrated a 14x density improvement. This leaves an open question whether a time-multiplexed, coarse-grained FPGA can provide a similar density advantage. Even more important is whether the coarse-grained circuit structure can be exploited by Computer-Aided Design (CAD) tools to significantly reduce compile times.This thesis investigates a new type of FPGA in which coarse-grained, time-multiplexed resources are added to a traditional FPGA. Through time-multiplexing, density and compile time are improved. By retaining fine-grained logic and routing resources, performance does not suffer as much as in past attempts.This thesis also presents two CAD flows, M-CAD and M-HOT, to compile Verilog for this new FPGA. Both flows speed up compile times by more than 10x, which has not been demonstrated with any other flow (even flows that sacrifice quality). They can also achieve a circuit density greater than modern FPGAs, and can trade density for performance, something most FPGA CAD flows cannot do. At maximum density, M-HOT flow achieves a 26.1x compile time speedup, 2.5x the density, and 0.5x the performance of a commercial FPGA and CAD tool. At maximum performance, M-HOT achieves 1.0x density, and 0.7x performance.In contrast, M-CAD is a bit faster than M-HOT but achieves a lower quality result. In M-CAD, there are situations where the placer needs temporal information from the scheduler to make good decisions. Instead, M-HOT divides the circuit into heights to keep the integrated placement, routing, and scheduling problem tractable, but compile time suffers if there are too few heights. Although we show there is at most a theoretical 1.6x or 2.0x clock frequency improvement still remaining in M-HOT or M-CAD, respectively, the amount achievable may be far less. Future work should focus on improving the front-end synthesis, the coarse/fine-grained interface, and the coarse/fine-grained partitioning to provide higher quality input to the back-end CAD flow.

View record

Impact of Custom Interconnect Masks on Costs and Performance of Structured ASICs (2011)

As process technology scales, the design effort and Non-Recurring Engineering(NRE) costs associated with the development of Integrated Circuits (ICs) is becoming extremely high. One of the main reasons is the high cost of preparing and processing IC fabrication masks. The design effort and cost can be reduced by employing Structured Application Specific ICs (Structured ASICs). Structured ASICs are partially fabricated ICs that require only a subset of design-specific custom masks for their completion.In this dissertation, we investigate the impact of design-specific masks on the area, delay, power, and die-cost of Structured ASICs. We divide Structured ASICs into two categories depending on the types of masks (metal and/or via masks) needed for customization: Metal-Programmable Structured ASICs (MPSAs) that require custom metal and via masks; and Via-Programmable Structured ASICs (VPSAs) that only require via masks for customization. We define the metal layers used for routing that can be configured by one or more via, or metal-and-via masks as configurable layers. We then investigate the area, delay, power, and cost trends for MPSAs and VPSAs as a function of configurable layers.The results show that the number of custom masks has a significant impact ondie-cost. In small MPSAs (area 100 is achieved with three or four configurable layers. The lowest cost in VPSAs is also obtained with four configurable layers. The best delay and power in MPSAs is achieved with three or four configurable layers. In VPSAs, four configurable layers lead to lowest power and delay, except when logic blocks have a large layout area. MPSAs have up to 5x, 10x, and 3.5x better area, delay, and power than VPSAs respectively. However, VPSAs can be up to 50% less expensive than MPSAs. The results also demonstrate that VPSAs are very sensitive to the architecture of fixed-metal segments; VPSAs with different fixed-metal architectures can have a gap of up to 60% in area, 89% in delay, 85% in power, and 36% in die-cost.

View record

Recycling clock network energy in high-performance digital designs using on-chip DC-DC converters (2008)

Power consumption of CMOS digital logic designs has increased rapidly for the last several years. It has become an important issue, not only in battery-powered applications, but also in high-performance digital designs because of packaging and cooling requirements. At multi-GHz clock rates in use today, charging and discharging CMOS gates and wires, especially in clocks with their relatively large capacitances, leads to significant power consumption. Recovering and recycling the stored charge or energy about to be lost when these nodes are discharged to ground is a potentially good strategy that must be explored for use in future energy-efficient design methodologies.This dissertation investigates a number of novel clock energy recycling techniques to improve the overall power dissipation of high-performance logic circuits. If efficient recycling energy of the clock network can be demonstrated, it might be used in many high-performance chip designs, to lower power and save energy. A number of chip prototypes were designed and constructed to demonstrate that this energy can be successfully recycled or recovered in different ways:• Recycling clock network energy by supplying a secondary DC-DC power converter: the output of this power converter can be used to supply another region of the chip, thereby avoiding the need to draw additional energy from the primary supply. One test chip demonstrates energy in the final clock load can be recycled, while another demonstrates that clock distribution energy can be recycled.• Recovering clock network energy and returning it back to the power grid: each clock cycle, a portion of the energy just drawn from the supply is transferred back at the end of the cycle, effectively reducing the power consumption of the clock network. The recycling methods described in this thesis are able to preserve the more ideal square clock shape which has been a limitation of previous work in this area. Overall, the results provided in this thesis demonstrate that energy recycling is very promising and must be pursued in a number of other areas of the chip in order to obtain an energy-efficient design.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Design and implementation of RVV-Lite : a layered approach to the official RISC-V vector ISA (2023)

The open-source RISC-V Vector extension (RVV), whose specification was frozen in 2021, comprises over 400 instructions, with four integer and two floating-point data types. Its purpose is to accelerate applications in high-performance computing. Since it is the largest optional extension to the RISC-V ecosystem, it is prudent to assess the implementation cost of these instructions with respect to their value add, particularly in cost-sensitive embedded and domain-specific applications. Some instructions are never used by certain applications, while others create unique and potentially costly hardware requirements. They can instead be replaced by simpler instruction sequences.We propose RVV-Lite, a partitioning of RVV. This partitioning allows users to deploy a smaller implementation with a predefined subset of the instructions, which is often needed in embedded or domain-specific applications with limited area or power. The rationale behind the instruction groupings is explained, and implementation results are shown to help make informed choices about the cost of these incremental implementations. To simplify software management, we suggest reducing the number of possible configurations by subdividing primary instructions into 7 layers, while presenting 5 orthogonal extensions that can be added at any point. Instructions excluded from the primary and orthogonal instruction subsets are placed into a final layer of instructions, bridging the gap between RVV-Lite and RVV 1.0.With all layers and options, RVV-Lite consists of 280 instructions, resulting in the exclusion of 127 instructions from the Zve* embedded options proposed by the RVV specification. To demonstrate efficacy of this subset, we present cycle counts for common benchmarks and compare these, along with area and timing results, to other RISC-V Vector engines.

View record

Using mixed low-precision formats in multiply-accumulate (MAC) units for DNN training (2022)

Due to limited size, cost and power, embedded devices do not offer the same computational throughput as graphics processing units (GPUs) for training Deep Neural Networks (DNNs). The most compute-intensive stage of multilayer perceptron (MLP) and convolutional neural network (CNN) training is the general matrix multiply (GEMM) kernel which is executed three times per layer in each iteration: once for forward-propagation and twice for back-propagation. To reduce the number of operations, techniques such as distillation (to reduce model size) and pruning (to introduce sparsity) are commonly applied. This thesis considers another technique, where the computational effort of each operation is reduced using low-precision arithmetic.While the use of small data types is common in DNN inference, this is not yet common in DNN training. Previous work in the area is somewhat limited, sometimes only considering 16-bit floating-point formats or overlooking implementation details, such as the area and accuracy tradeoffs from exact digital multiplier designs.This thesis considers the use of mixed-precision operations (MPO) within the GEMM kernel for DNN training. To conduct these investigations, we have implemented a complete DNN training framework for embedded systems, Archimedes-MPO. Starting with the C++ library TinyDNN, we have abstracted each layer to use custom data types and accelerated the GEMM stage with CUDA and Vitis HLS to produce bit-accurate GPU and FPGA implementations. This framework allows us to exactly measure the impact of various multiplier and accumulator designs on area and accuracy.Compared to 32-bit floating-point multiplier, as few as 2 mantissa bits attain similar accuracy. Accuracy losses are reduced with adaptive loss scaling and the removal of hardware for rounding and not-a-number (NaN) representations. Removal of subnormals saves area as well, but hurts accuracy, so we propose a custom subnormal encoding as a compromise. For accumulation, 12-bit floating-point and 21-bit fixed-point formats work similarly. Fixed-point accumulation seems to have an area advantage, but the impact of a wider output data size can be costly on downstream logic. While precise results depend upon the model and dataset used, the observed trends and framework can help the design of future GEMM-based hardware accelerators for DNNs.

View record

Dynamic race detection for non-coherent accelerators (2020)

Modern System-on-Chip (SoC) designs are increasingly complex and heterogeneous, featuring specialized hardware accelerators and processor cores that access shared memory. Non-coherent accelerators are one common memory coherence model. The advantage of non-coherence in accelerators is that it cut costs on extra hardware that would have been used for memory coherence. However, the disadvantage is that the programmer must know where and when to synchronize between the accelerator and the processor. Getting this synchronization correct can be difficult.We propose a novel approach to find data races in software for non-coherent accelerators in heterogeneous systems. The intuition underlying our approach is that a sufficiently precise abstraction of the hardware's behaviour is straightforward to derive, based on common idioms for memory access. To demonstrate this, we derived simple rules and axioms to model the interaction between a processor and a massively parallel accelerator provided by a commercial FPGA-based accelerator vendor. We implemented these rules in a prototype tool for dynamic race detection, and performed experiments on eleven examples provided by the vendor. The tool is able to detect previously unknown races in two of the eleven examples, and additional races if we introduce bugs in the synchronization in the code.

View record

Real-time computer vision in software using custom vector overlays (2018)

Real-time computer vision places stringent performance requirements on embeddedsystems. Often, dedicated hardware is required. This is undesirable as hardwaredevelopment is time-consuming, requires extensive skill, and can be difficultto debug. This thesis presents three case studies of accelerating computer vision algorithmsusing a software-driven approach, where only the innermost computationis performed with dedicated hardware. As a baseline, the algorithms are initiallyrun on a scalar host processor. Next, the software is sped up using an existing vectoroverlay implemented in the FPGA fabric, manually rewriting the code to usevectors. Finally, the overlay is customized to accelerate the critical inner loops byadding hardware-assisted custom vector instructions. Collectively, the custom instructionsrequire very few lines of RTL code compared to what would be neededto implement the entire algorithm in dedicated hardware.This keeps design complexity low and yields a significant performance boost.For example, in one system, we measured a performance advantage of 2.4× to3.5× over previous state-of-the-art dedicated hardware systems while using farless custom hardware

View record

The DEVBOX development education platform : an environment for introducing Verilog to young students (2016)

Hardware description languages are considered to be challenging to learn, as are the logic design concepts required to effectively use them. University logic design courses are considered by many to be much more difficult than their software development counterparts and the subject is not generally addressed by pre-university curricula. Introducing hardware description earlier in a student’s career will improve their chances of success in future logic design courses. The barrier-to-entry faced in introducing hardware description, caused by complex development environments and the learning of fundamental logic design concepts, must be overcome in order to facilitate a self-directed and interactive educational environment for young students. DEVBOX, an embedded System-on-Chip programming education platform, simplifies the learning process with its bare-bones development tools and interactive instructional material. It has been designed to be a self-contained, installation-free educational device with a browser-based interface that is self explanatory and easy to use. By including Verilog, a popular hardware description language, alongside the software programming languages in its repertoire, DEVBOX caters to the needs of introductory and intermediate logic design students in a dynamic self-directed learning environment.

View record

Rapid Overlay Builder for Xilinx FPGAs (2014)

A field-programmable gate array (FPGA) is a type of programmable hardware, where a logic designer must create a specific hardware design and then "compile" it into a bitstream that "configures" the device for a specific function at power-up. This compiling process, known as place-and-route (PAR), can take hours or even days, a duration which discourages the use of FPGAs for solving compute-oriented problems. To help mitigate this and other problems, overlays are emerging as useful design patterns in solving compute-oriented problems. An overlay consists of a set of compiler-like tools and an architecture written in a hardware design language like VHDL or Verilog. This cleanly separates the compiling problem into two phases: at the front end, high-level language compilers can quickly map a compute task into the overlay architecture, which is now serving as an intermediate layer. Unfortunately, the back-end of the process, where an overlay architecture is compiled into an FPGA device, remains a very time-consuming task. Many attempts have been made to accelerate the PAR process, ranging from using multicore processors, making quality/runtime tradeoffs, and using hard macros, with limited success. We introduce a new hard-macro methodology, called Rapid Overlay Builder, and demonstrate a run-time improvement up to 22 times compared to a regular unaccelerated flow using Xilinx ISE. In addition, compared to prior work, ROB continues to work well even with high logic utilization levels of 89%, and it consistently maintains high clock rates. By applying this methodology, we anticipate that overlays can be implemented much more quickly and with lower area and speed overheads than would otherwise be possible. This will greatly improve the usability of FPGAs, allowing them to be used as a replacement for CPUs in a greater variety of applications.

View record

Coarse and fine grain programmable overlay architectures for FPGAs (2013)

Overlay architectures are programmable logic systems that are compiled on top of a traditional FPGA.These architectures give designers flexibility, and have a number of benefits, such as being designed or optimized for specific application domains, making it easier or more efficient to implement solutions, being independent of platform,allowing the ability to do partial reconfiguration regardless of the underlying architecture, and allowing compilation without using vendor tools,in some cases with fully open source tool chains.This thesis describes the implementation of two FPGA overlay architectures, ZUMA and CARBON.These overlay implementations include optimizations to reduce area and increase speed which may be applicable to many other FPGAs and also ASIC systems. ZUMA is a fine-grain overlay which resembles a modern commercial FPGA, and is compatible with the VTR open source compilation tools. The implementation includes a number of novel features tailored to efficient FPGA implementation, including the utilization of reprogrammable LUTRAMs, a novel two-stage local routing crossbar, and an area efficient configuration controller. CARBON is a coarse-grain, time-multiplexed architecture, that directly implements the coarse-grain portion of the MALIBU architecture. MALIBU is a hybrid fine-grain and coarse-grain FPGA architecture that can be built using the combination of both CARBON and ZUMA, but this thesis focuses on their individual implementations. Time-multiplexing in CARBON greatly reduces performance, so it is vital to be optimized for delay. To push the speed of CARBON beyond the normal bound predicted by static timing analysis tools, this thesis has applied the Razor dynamic timing error tolerance system inside CARBON. This can dynamically push the clock frequency yet maintain correct operation. This required developing an extension of the Razor system from its original 1D feed-forward pipeline to a 2D bidirectional pipeline.

View record

Accelerator Compiler for the VENICE Vector Processor (2012)

This thesis describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for the Microsoft Accelerator, a high-level data parallel library in C/C++ and C\#. This allows automatic compilation from high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to human-optimized VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 2.53GHz Intel Xeon E5540 processor, beating it in performance on four of six benchmarks by up to 3.2x. The compiler also delivers near-linear scaling performance on five of six benchmarks, which exceed scalability of the Multi-core target of Accelerator.

View record

Scalable and deterministic timing-driven parallel placement for FPGAs (2011)

This thesis describes a parallel implementation of the timing-driven VPR 5.0 simulated-annealing placement engine. By partitioning the grid into regions and allowing distant data to grow stale, it is possible to consider a large number of non-conflicting moves in parallel and achieve a deterministic result. The full timing-driven placement algorithm is parallelized, including swap evaluation, bounding-box calculation and the detailed timing-analysis updates. The partitioned region approach slightly degrades the placement quality, but this is necessary to expose greater parallelism. We also suggest a method to recover the lost quality.In simulated annealing, runtime can be shortened at the expense of quality. Using this method, the serial placer can achieve a maximum speedup of 100X while quality metrics degrades as much as 100%. In contrast, the parallel placer can scale beyond 500X with all quality metrics degrading by less than 30%. Specifically, at the point where the parallel placer begins to dominate over the serial placer, the post-routing minimum channel width, wirelength and critical-path delay degrades 13%, 10% and 7% respectively on average compared to VPR’s original algorithm,while achieving a 140X to 200X speedup 25 threads. Finally, it is shown that the amount of degradation in the parallel placer is independent of the number of threads used.

View record

VIPERS II: A soft-core vector processor with single-copy data scratchpad memory (2010)

Previous work has demonstrated soft-core vector processors in FPGAs can be applied to speed up data-parallel embedded applications, while providing the users an easy-to-use platform to tradeoff performance and area. However, its performance is limited by load and store latencies, requiring extra software design effort to optimize performance. This thesis presents VIPERS II, a new vector ISA and the corresponding microarchitecture, in which the vector processor reads and writes directly to a scratchpad memory instead of the vector register file. With this approach, the load and store operations and their inherent latencies can often be eliminated if the working set of data fits in the vector scratchpad memory. Moreover, with the removal of load/store latencies, the user doesn't have to use loop unrolling to enhance performance, reducing the amount of software effort required and making the vectorized code more compact. The thesis shows the new architecture has the potential to achieve performance similar to that of the unrolled versions of the benchmarks, without actually unrolling the loop. Hardware performance results of VIPERS II demonstrated up to 47x speedup over a Nios II processor with only 13x more resources used.

View record


If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.


Learn about our faculties, research and more than 300 programs in our Graduate Viewbook!