Project deliverables (details)

2025 COMP4601 Projects (Release: 6/06/2025)

COMP4601 projects aim to provide experience at accelerating compute-intensive software using programmable logic. Project groups choose a typical algorithm to investigate, profile its performance in software, and employ techniques acquired during their studies to reduce the execution time by partitioning the computation between a processor and tightly-coupled programmable logic. Challenges include: maintaining focus on a well-defined compute-bound process, efficiently exchanging data between processor and logic, and managing the complexity of unfamiliar hardware platforms and software tools with tight deadlines.

Outline

In 2025, students are expected to form teams of 4 people to accelerate the kernel of a software task implemented on the Kria KV260 board. Teams will profile the task running as software, partition the task into components that could be accelerated, implement these in programmable logic using HLS, and measure overall improvements in performance and energy consumption of the task. An initial investigation is to be conducted in the period leading to a presentation of the project plan during Week 7. Final project reports are to be provided by the end of Week 10. A personal reflection is to be written and delivered at the start of Week 11.

Please choose a team representative to email Oliver with the names of the students working together and the project you are planning to work on by Wednesday 18 June.

Teams are permitted to work on any sufficiently well defined and focussed algorithm that promises to show speedup or power reduction through acceleration and that can be studied within the time allowed. The project list below is intended to act as a prompt or guide, but should not limit your imagination. Creativity and initiative in choosing your own project and focus in completing the milestones suggested in the flow outlined below are encouraged and rewarded. Please discuss your project ideas and progress with course staff regularly to ensure you obtain guidance when needed and stay on track with your development. This is particularly important while you are choosing a project to ensure that you do not attempt a problem that is too ambitious.

Please use the linked QR code to register for a Kria board prior to picking a board up from the School Admin Office in K17-111D. All students may borrow a board. Please remember to return these when you complete the course in Week 11.

Suggested project development flow

  1. Organize your team and develop your project plan without delay! Time is of the essence.
    1. Write or obtain C/C++ code that encapsulates the compute intensive task you are aiming to accelerate; verify its correctness
    2. Get the code running on the Kria's processing system (PS); pursue this goal in parallel as it will demand significant research and testing
    1. Profile your code on the development platform you have used to write the code in order to identify the "hotspots" or compute-intensive kernels/loop nests you will accelerate
    2. Profile the code on the Kria's PS
    1. Partition the code so that the kernels are contained within functions that will lend themselves directly to acceleration using HLS
    2. Do this for the Kria
    1. Obtain baseline performance and utilization data for the kernels. The emphasis at this stage is on correctness, not performance.
    2. Do this for the Kria and then implement the baseline (unaccelerated) kernels as IP cores in programmable logic; take care to use a sensible communications approach (AXI-lite for control, AXI-streaming for FIFO data coming from the code executing on the ARM processors, or DMA for large data transfers from off-chip DRAM); the I/O strategy you choose may significantly affect performamce due to bandwidth constraints; discuss the approach you intend to take with the demonstrators and commence work on this aspect early because it will take time to get it right
    1. Decide upon and implement a strategy for transforming the kernel code into a high-performing core using HLS; report on changes in performance and utilization as you progress your strategy
    2. Do so on the Kria, updating your comms strategy as required
    1. Report on the techniques used to improve the performance of the core using HLS and compute the resulting overall improvement in performance for the code obtained in step 2a.
    2. Do so for the Kria and explain differences between the calculated improvement and the actual improvement in performance

Deliverables/reporting requirements

The project deliverables are listed in detail on a separate page.

In outline the following reports and presentations are expected to be made:

Project list

  1. Accelerate Convolutional Neural Network

    CNNs provide state of the art performance on image recognition tasks, with the cost of very high computational load. The CNN algorithm involves a large number of convolutions and matrix multiplications executed in parallel. On desktop platforms, GPU and ASICs like Google's TPU accelerate CNN algorithms by executing operations in parallel. And on the Zynq devices available on Kria/ZedBoard, acceleration is possible with the acceleration circuit being configured on the programmable logic.

    In this project, you'll develop FPGA cores to accelerate the execution of CNNs on Zynq devices. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the processing system (PS).

    CNN background:

    Opportunities of acceleration in a CNN:

    Steps to approach this task:

    1. Study the CNN algorithm carefully. Take note of parallel operations.
    2. Develop a software implementation. Test it thoroughly before continuing.
    3. Profile the implementation to identify performance bottlenecks
    4. Ideally, you should be able to create your HLS design using C code for the PS implementation.

  2. Accelerated Histogram of Oriented Gradients

    In traditional machine learning techniques (non-deep-learning ones), image features extracted from input images replace raw images for more efficient learning and inferencing. As this example demonstrates, Histogram of Oriented Gradients (HoG) descriptors coupled with SVM can be very powerful in image classification tasks. In this project, you'll focus on accelerating the computation of HoG descriptors. For the sake of simplicity, only consider 8-bins HoG in this project. (https://medium.com/@basu369victor/handwritten-digits- recognition-d3d383431845)

    You'll use your HoG implementation to compute the HoG descriptors for a batch of images. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the PS.

    HoG algorithm background:

    Steps to approach this task:

    1. Study the HoG algorithm carefully. Take note of parallel operations.
    2. Develop a software implementation. Test it thoroughly before continuing.
    3. Profile the implementation to identify performance bottlenecks
    4. Ideally, you should be able to create your HLS design using C code for the PS implementation.

    Here's a test set (10000) images from the MNIST dataset. Junning has converted them into bitmap format, so it should be ready to use with no need further conversions.

  3. Bitcoin Miner

    For this project you will implement a basic hardware accelerated Bitcoin miner. The Bitcoin mining algorithm is defined as:

     ```python def mine_bitcoin(header):
      # Loop until we found a good nonce or we run out of time or killed.
      while(timeout == False):
        # Increment the nonce (garbage field we are allowed to set to
        # influence the output hash)
        header.nonce += 1
        # Return our golden nonce if we find it!
        if (sha256d(header) < header.target): break return header
    
    def sha256d(header): sha256(sha256(header)) ``` 

    NOTES:

    Alternative hash cores:
    Three extra hash functions suitable for substituting SHA256D. When integrated into the miner will mine a hypothetical coin based on these hash functions.

    We choose hash functions that produce 256-bit digests instead of 512-bit, as that allows for more exploration of unroll factors on the relatively resource constrained Kria/Zedboards.

    We will implement these with the assumption that the input size is fixed at 80-bytes (640-bits), the size of the Bitcoin block header.

  4. Blake3 implementation and evaluation

    This function is quite new, proposed in Jan 10, 2020 by the people behind Blake2, which was a final contestant for the SHA3 standard (KECCAK won out) but is still widely used today as an alternative to SHA3. See the attached specification.

    Tony Wu: "Due to it being so new, very few public implementations exist for FPGAs, if any. I believe this is doable within the COMP4601 project timeframe, especially with application of HLS."

    Students can read more about it and find a reference implementation written in Rust; https://github.com/BLAKE3- team/BLAKE3/

  5. Encryption/decryption

    This is an important application for FPGAs because the datawidth can be matched and the computation unrolled and pipelined to gain a good speedup over a CPU. Try designing and implementing a general solution that can process text as it is streamed through the FPGA part, potentially as keys change. Choose from AES, DES, RSA, ...

    Tony, who is a crypto expert on FPGAs, suggests the following:

    AES Accelerator

    Design an AES accelerator that operates in ECB mode.

    NOTE: Please read up on the difference between ECB and CBC mode.

    SHA3 Accelerator

    Design a SHA3 accelerator that can take an arbitrary input size.

    Milestones will be similar to AES Accelerator.

  6. Text/audio/image compression/decompression

    Decide on medium and algorithm; focus on either compression or decompression and perform the opposite operation in software for verification purposes. Given the limitations on time, stick to problems that use simple data structures!

  7. Machine learning

    Opportunities abound in this HOT area. Consider accelerating "k-means clustering", or "k-nearest neighbours", or "support vector machine" or "Naive Bayes classifier" or "neural net".

  8. Real-valued Matrix operations at scale

    At the heart of many machine learning, inference, big data analytics and scientific applications are matrix-matrix and matrix-vector multiplications. In this project I would suggest parameterizing the datatypes and matrix/vector sizes. Compared with a baseline algorithm using float type data, how do speeds, utilization and accuracy compare as the problem size scales?

  9. Pattern matching

    String and regular expression matching has frequently been targetted for FPGA acceleration.

  10. Roll your own project

    Propose a project to Oliver...the problem should have the potential to be spedup using FPGA hardware.

    Potential applications for acceleration include: