Guy Lemieux

Professor

Relevant Degree Programs

 

Graduate Student Supervision

Doctoral Student Supervision (Jan 2008 - May 2019)
Automated space/time scaling of streaming task graphs on field-programmable gate arrays (2019)

Parallel computing platforms provide good performance for streaming applications within a limited power budget. However, these platforms can be difficult to program. Moreover, when the size of the computing platform target changes, users must manually reallocate resources and parallelism. This thesis provides a framework to retarget applications described by a Streaming Task Graph (STG) for implementation on different platforms, where the framework can automatically scale the solution size to fit available resource or performance targets. First, we explore automated space/time scaling for STGs targeting a pipelined coarse-grained architecture. We produce a tool that analyzes the degrees of parallelism in a general stream application and finds possible bottlenecks. We introduce possible optimization strategies for STGs, and two algorithmic approaches: a classical approach based upon Integer Linear Programming (ILP), and a novel heuristic approach which introduces a new optimization and produces better results (using 30% less area) with a shorter run-time. Second, we explore automated space/time scaling for STGs targeting a fine-grained architecture (Field-Programmable Gate Array (FPGA)). We propose a framework on top of a commercial High-Level Synthesis (HLS) tool which adds the ability to automatically meet a defined area budget or target throughput. Within the framework, we use similar ILP and heuristic approaches. The heuristic automatically achieves over 95% of the target area budget on average while improving throughput over the ILP. It can also meet the same throughput targets as the ILP while saving 19% area.Third, we investigate automated space/time scaling of STGs in a hybrid architecture consisting of a Soft Vector Processor (SVP) and select custom instructions. To achieve higher performance, we investigate using dynamic Partial Reconfiguration (PR) by time-sharing the FPGA resources. The performance results achieve speedups far beyond what a plain SVP can accomplish: an 8-lane SVP achieves a speedup of 5.3 on the Canny-blur application, whereas the PR version is another 3.5 times faster, with a net speedup of 18.5 over the ARM Cortex A9 processor. By increasing the dynamic PR rate beyond what is available today, we also show that a further 5.7 times speedup can be achieved (105.9x speedup versus the Cortex A9).

View record

Architecture of block-RAM-based massively parallel memory structures : multi-ported memories and content-addressable memories (2016)

Since they were first introduced three decades ago, Field-Programmable Gate Arrays (FPGAs) have evolved from being merely used as glue-logic to implementing entire compute accelerators. These massively parallel systems demand highly parallel memory structures to keep pace with their concurrent nature since memories are usually the bottleneck of computation performance. However, the vast majority of FPGA devices provide dual-ported SRAM blocks only. In this dissertation, we propose new ways to build area-efficient, high-performance SRAM-based parallel memory structures in FPGAs, specifically Multi-Ported Random Access Memory and Content-Addressable Memory (CAM). While parallel computation demands more RAM ports, leading Multi-Ported Random Access Memory techniques in FPGAs have relatively large overhead in resource usage. As a result, we have produced new design techniques that are near-optimal in resource overhead and have several practical advantages. The suggested method reduces RAM usage by over 44% and improves clock speed by over 76% compared to the best of previous approaches. Furthermore, we propose a novel switched-ports technique that allows further area reduction if some RAM ports are not simultaneously active. A memory compiler is proposed to generalize the previous approach and allow generating Multi-Switched-Ports Random Access Memory. Content-Addressable Memories (CAMs), the hardware implementation of associative arrays, are capable of searching the entire memory space for a specific value within a single clock cycle. CAMs are massively parallel search engines accessing all memory content to compare with the searched pattern simultaneously. CAMs are used in a variety of scientific fields requiring high-speed associative searches. Despite their importance, FPGAs lack an area-efficient CAM implementation. We propose a series of scalable, area-efficient, and high-performance Binary Content-Addressable Memories (BCAMs) based on hierarchical search and data compression methods. Compared to current RAM-based BCAM architectures, our BCAMs require a maximum of 18% the RAM storage while enhancing clock speed by 45% on average, hence exhibiting a superior single-cycle search rate. As a result, we can build faster and more cost-effective accelerators to solve some of the most important computational problems.

View record

CAD algorithms and performance of Malibu : an FPGA with time-multiplexed coarse-grained elements (2011)

Modern Field-Programmable Gate Arrays (FPGAs) are used to implement a wide range of ever-larger circuits, many of which have both coarse-grained and fine-grained components. Past research into coarse-grained FPGAs optimized for such circuits have only demonstrated a 10% density advantage. In contrast, time-multiplexing of fine-grained FPGAs has demonstrated a 14x density improvement. This leaves an open question whether a time-multiplexed, coarse-grained FPGA can provide a similar density advantage. Even more important is whether the coarse-grained circuit structure can be exploited by Computer-Aided Design (CAD) tools to significantly reduce compile times.This thesis investigates a new type of FPGA in which coarse-grained, time-multiplexed resources are added to a traditional FPGA. Through time-multiplexing, density and compile time are improved. By retaining fine-grained logic and routing resources, performance does not suffer as much as in past attempts.This thesis also presents two CAD flows, M-CAD and M-HOT, to compile Verilog for this new FPGA. Both flows speed up compile times by more than 10x, which has not been demonstrated with any other flow (even flows that sacrifice quality). They can also achieve a circuit density greater than modern FPGAs, and can trade density for performance, something most FPGA CAD flows cannot do. At maximum density, M-HOT flow achieves a 26.1x compile time speedup, 2.5x the density, and 0.5x the performance of a commercial FPGA and CAD tool. At maximum performance, M-HOT achieves 1.0x density, and 0.7x performance.In contrast, M-CAD is a bit faster than M-HOT but achieves a lower quality result. In M-CAD, there are situations where the placer needs temporal information from the scheduler to make good decisions. Instead, M-HOT divides the circuit into heights to keep the integrated placement, routing, and scheduling problem tractable, but compile time suffers if there are too few heights. Although we show there is at most a theoretical 1.6x or 2.0x clock frequency improvement still remaining in M-HOT or M-CAD, respectively, the amount achievable may be far less. Future work should focus on improving the front-end synthesis, the coarse/fine-grained interface, and the coarse/fine-grained partitioning to provide higher quality input to the back-end CAD flow.

View record

Impact of Custom Interconnect Masks on Costs and Performance of Structured ASICs (2011)

No abstract available.

Recycling clock network energy in high-performance digital designs using on-chip DC-DC converters (2008)

Power consumption of CMOS digital logic designs has increased rapidly for the last several years. It has become an important issue, not only in battery-powered applications, but also in high-performance digital designs because of packaging and cooling requirements. At multi-GHz clock rates in use today, charging and discharging CMOS gates and wires, especially in clocks with their relatively large capacitances, leads to significant power consumption. Recovering and recycling the stored charge or energy about to be lost when these nodes are discharged to ground is a potentially good strategy that must be explored for use in future energy-efficient design methodologies.This dissertation investigates a number of novel clock energy recycling techniques to improve the overall power dissipation of high-performance logic circuits. If efficient recycling energy of the clock network can be demonstrated, it might be used in many high-performance chip designs, to lower power and save energy. A number of chip prototypes were designed and constructed to demonstrate that this energy can be successfully recycled or recovered in different ways:• Recycling clock network energy by supplying a secondary DC-DC power converter: the output of this power converter can be used to supply another region of the chip, thereby avoiding the need to draw additional energy from the primary supply. One test chip demonstrates energy in the final clock load can be recycled, while another demonstrates that clock distribution energy can be recycled.• Recovering clock network energy and returning it back to the power grid: each clock cycle, a portion of the energy just drawn from the supply is transferred back at the end of the cycle, effectively reducing the power consumption of the clock network. The recycling methods described in this thesis are able to preserve the more ideal square clock shape which has been a limitation of previous work in this area. Overall, the results provided in this thesis demonstrate that energy recycling is very promising and must be pursued in a number of other areas of the chip in order to obtain an energy-efficient design.

View record

Master's Student Supervision (2010 - 2018)
Real-time computer vision in software using custom vector overlays (2018)

Real-time computer vision places stringent performance requirements on embeddedsystems. Often, dedicated hardware is required. This is undesirable as hardwaredevelopment is time-consuming, requires extensive skill, and can be difficultto debug. This thesis presents three case studies of accelerating computer vision algorithmsusing a software-driven approach, where only the innermost computationis performed with dedicated hardware. As a baseline, the algorithms are initiallyrun on a scalar host processor. Next, the software is sped up using an existing vectoroverlay implemented in the FPGA fabric, manually rewriting the code to usevectors. Finally, the overlay is customized to accelerate the critical inner loops byadding hardware-assisted custom vector instructions. Collectively, the custom instructionsrequire very few lines of RTL code compared to what would be neededto implement the entire algorithm in dedicated hardware.This keeps design complexity low and yields a significant performance boost.For example, in one system, we measured a performance advantage of 2.4× to3.5× over previous state-of-the-art dedicated hardware systems while using farless custom hardware

View record

The DEVBOX development education platform : an environment for introducing Verilog to young students (2016)

Hardware description languages are considered to be challenging to learn, as are the logic design concepts required to effectively use them. University logic design courses are considered by many to be much more difficult than their software development counterparts and the subject is not generally addressed by pre-university curricula. Introducing hardware description earlier in a student’s career will improve their chances of success in future logic design courses. The barrier-to-entry faced in introducing hardware description, caused by complex development environments and the learning of fundamental logic design concepts, must be overcome in order to facilitate a self-directed and interactive educational environment for young students. DEVBOX, an embedded System-on-Chip programming education platform, simplifies the learning process with its bare-bones development tools and interactive instructional material. It has been designed to be a self-contained, installation-free educational device with a browser-based interface that is self explanatory and easy to use. By including Verilog, a popular hardware description language, alongside the software programming languages in its repertoire, DEVBOX caters to the needs of introductory and intermediate logic design students in a dynamic self-directed learning environment.

View record

Rapid Overlay Builder for Xilinx FPGAs (2014)

A field-programmable gate array (FPGA) is a type of programmable hardware, where a logic designer must create a specific hardware design and then "compile" it into a bitstream that "configures" the device for a specific function at power-up. This compiling process, known as place-and-route (PAR), can take hours or even days, a duration which discourages the use of FPGAs for solving compute-oriented problems. To help mitigate this and other problems, overlays are emerging as useful design patterns in solving compute-oriented problems. An overlay consists of a set of compiler-like tools and an architecture written in a hardware design language like VHDL or Verilog. This cleanly separates the compiling problem into two phases: at the front end, high-level language compilers can quickly map a compute task into the overlay architecture, which is now serving as an intermediate layer. Unfortunately, the back-end of the process, where an overlay architecture is compiled into an FPGA device, remains a very time-consuming task. Many attempts have been made to accelerate the PAR process, ranging from using multicore processors, making quality/runtime tradeoffs, and using hard macros, with limited success. We introduce a new hard-macro methodology, called Rapid Overlay Builder, and demonstrate a run-time improvement up to 22 times compared to a regular unaccelerated flow using Xilinx ISE. In addition, compared to prior work, ROB continues to work well even with high logic utilization levels of 89%, and it consistently maintains high clock rates. By applying this methodology, we anticipate that overlays can be implemented much more quickly and with lower area and speed overheads than would otherwise be possible. This will greatly improve the usability of FPGAs, allowing them to be used as a replacement for CPUs in a greater variety of applications.

View record

Coarse and fine grain programmable overlay architectures for FPGAs (2013)

Overlay architectures are programmable logic systems that are compiled on top of a traditional FPGA.These architectures give designers flexibility, and have a number of benefits, such as being designed or optimized for specific application domains, making it easier or more efficient to implement solutions, being independent of platform,allowing the ability to do partial reconfiguration regardless of the underlying architecture, and allowing compilation without using vendor tools,in some cases with fully open source tool chains.This thesis describes the implementation of two FPGA overlay architectures, ZUMA and CARBON.These overlay implementations include optimizations to reduce area and increase speed which may be applicable to many other FPGAs and also ASIC systems. ZUMA is a fine-grain overlay which resembles a modern commercial FPGA, and is compatible with the VTR open source compilation tools. The implementation includes a number of novel features tailored to efficient FPGA implementation, including the utilization of reprogrammable LUTRAMs, a novel two-stage local routing crossbar, and an area efficient configuration controller. CARBON is a coarse-grain, time-multiplexed architecture, that directly implements the coarse-grain portion of the MALIBU architecture. MALIBU is a hybrid fine-grain and coarse-grain FPGA architecture that can be built using the combination of both CARBON and ZUMA, but this thesis focuses on their individual implementations. Time-multiplexing in CARBON greatly reduces performance, so it is vital to be optimized for delay. To push the speed of CARBON beyond the normal bound predicted by static timing analysis tools, this thesis has applied the Razor dynamic timing error tolerance system inside CARBON. This can dynamically push the clock frequency yet maintain correct operation. This required developing an extension of the Razor system from its original 1D feed-forward pipeline to a 2D bidirectional pipeline.

View record

Accelerator compiler for the VENICE vector processor (2012)

This thesis describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for the Microsoft Accelerator, a high-level data parallel library in C/C++ and C\#. This allows automatic compilation from high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to human-optimized VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 2.53GHz Intel Xeon E5540 processor, beating it in performance on four of six benchmarks by up to 3.2x. The compiler also delivers near-linear scaling performance on five of six benchmarks, which exceed scalability of the Multi-core target of Accelerator.

View record

Scalable and deterministic timing-driven parallel placement for FPGAs (2011)

This thesis describes a parallel implementation of the timing-driven VPR 5.0 simulated-annealing placement engine. By partitioning the grid into regions and allowing distant data to grow stale, it is possible to consider a large number of non-conflicting moves in parallel and achieve a deterministic result. The full timing-driven placement algorithm is parallelized, including swap evaluation, bounding-box calculation and the detailed timing-analysis updates. The partitioned region approach slightly degrades the placement quality, but this is necessary to expose greater parallelism. We also suggest a method to recover the lost quality.In simulated annealing, runtime can be shortened at the expense of quality. Using this method, the serial placer can achieve a maximum speedup of 100X while quality metrics degrades as much as 100%. In contrast, the parallel placer can scale beyond 500X with all quality metrics degrading by less than 30%. Specifically, at the point where the parallel placer begins to dominate over the serial placer, the post-routing minimum channel width, wirelength and critical-path delay degrades 13%, 10% and 7% respectively on average compared to VPR’s original algorithm,while achieving a 140X to 200X speedup 25 threads. Finally, it is shown that the amount of degradation in the parallel placer is independent of the number of threads used.

View record

VIPERS II: A soft-core vector processor with single-copy data scratchpad memory (2010)

No abstract available.

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.