2025 COMP4601 Projects (Release: 6/06/2025)

COMP4601 projects aim to provide experience at accelerating compute-intensive software using programmable logic. Project groups choose a typical algorithm to investigate, profile its performance in software, and employ techniques acquired during their studies to reduce the execution time by partitioning the computation between a processor and tightly-coupled programmable logic. Challenges include: maintaining focus on a well-defined compute-bound process, efficiently exchanging data between processor and logic, and managing the complexity of unfamiliar hardware platforms and software tools with tight deadlines.

Outline

In 2025, students are expected to form teams of 4 people to accelerate the kernel of a software task implemented on the Kria KV260 board. Teams will profile the task running as software, partition the task into components that could be accelerated, implement these in programmable logic using HLS, and measure overall improvements in performance and energy consumption of the task. An initial investigation is to be conducted in the period leading to a presentation of the project plan during Week 7. Final project reports are to be provided by the end of Week 10. A personal reflection is to be written and delivered at the start of Week 11.

Please choose a team representative to email Oliver with the names of the students working together and the project you are planning to work on by Wednesday 18 June.

Teams are permitted to work on any sufficiently well defined and focussed algorithm that promises to show speedup or power reduction through acceleration and that can be studied within the time allowed. The project list below is intended to act as a prompt or guide, but should not limit your imagination. Creativity and initiative in choosing your own project and focus in completing the milestones suggested in the flow outlined below are encouraged and rewarded. Please discuss your project ideas and progress with course staff regularly to ensure you obtain guidance when needed and stay on track with your development. This is particularly important while you are choosing a project to ensure that you do not attempt a problem that is too ambitious.

Please use the linked QR code to register for a Kria board prior to picking a board up from the School Admin Office in K17-111D. All students may borrow a board. Please remember to return these when you complete the course in Week 11.

Suggested project development flow

Organize your team and develop your project plan without delay! Time is of the essence.
1. Write or obtain C/C++ code that encapsulates the compute intensive task you are aiming to accelerate; verify its correctness
2. Get the code running on the Kria's processing system (PS); pursue this goal in parallel as it will demand significant research and testing
1. Profile your code on the development platform you have used to write the code in order to identify the "hotspots" or compute-intensive kernels/loop nests you will accelerate
2. Profile the code on the Kria's PS
1. Partition the code so that the kernels are contained within functions that will lend themselves directly to acceleration using HLS
2. Do this for the Kria
1. Obtain baseline performance and utilization data for the kernels. The emphasis at this stage is on correctness, not performance.
2. Do this for the Kria and then implement the baseline (unaccelerated) kernels as IP cores in programmable logic; take care to use a sensible communications approach (AXI-lite for control, AXI-streaming for FIFO data coming from the code executing on the ARM processors, or DMA for large data transfers from off-chip DRAM); the I/O strategy you choose may significantly affect performamce due to bandwidth constraints; discuss the approach you intend to take with the demonstrators and commence work on this aspect early because it will take time to get it right
1. Decide upon and implement a strategy for transforming the kernel code into a high-performing core using HLS; report on changes in performance and utilization as you progress your strategy
2. Do so on the Kria, updating your comms strategy as required
1. Report on the techniques used to improve the performance of the core using HLS and compute the resulting overall improvement in performance for the code obtained in step 2a.
2. Do so for the Kria and explain differences between the calculated improvement and the actual improvement in performance

Deliverables/reporting requirements

The project deliverables are listed in detail on a separate page.

In outline the following reports and presentations are expected to be made:

Initial investigation and project plan, presented during Week 7 lecture periods - Oliver to schedule during Week 6
Final presentation, during lecture periods in Week 10 - Oliver to schedule during Week 9
Demo video, linked to project web pages by 5pm Friday 8 August (Week 10)
Project report, linked to project web page by 5pm Friday 8 August (Week 10)
Code repository, linked to project web page by 5pm Friday 8 August (Week 10)
Personal reflection, emailed to Oliver by 5pm, Monday 11 August (Week 11)

Project list

Accelerate Convolutional Neural Network

CNNs provide state of the art performance on image recognition tasks, with the cost of very high computational load. The CNN algorithm involves a large number of convolutions and matrix multiplications executed in parallel. On desktop platforms, GPU and ASICs like Google's TPU accelerate CNN algorithms by executing operations in parallel. And on the Zynq devices available on Kria/ZedBoard, acceleration is possible with the acceleration circuit being configured on the programmable logic.

In this project, you'll develop FPGA cores to accelerate the execution of CNNs on Zynq devices. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the processing system (PS).

CNN background:
- https://towardsdatascience.com/an-introduction-to-convolutional-neural- networks-eb0b60b58fd7#:~:text=A%20Convolutional%20neural%20network%20( CNN,a%20filter%20over%20the%20input
- https://towardsdatascience.com/convolutional- neural-networks-explained-9cc5188c4939
Opportunities of acceleration in a CNN:
- Parallelize the execution of convolution operators (big nest loops here)
- Speed up the execution of dense layers (lots of multiply and accumulate here)
Steps to approach this task:
1. Study the CNN algorithm carefully. Take note of parallel operations.
2. Develop a software implementation. Test it thoroughly before continuing.
3. Profile the implementation to identify performance bottlenecks
4. Ideally, you should be able to create your HLS design using C code for the PS implementation.
Accelerated Histogram of Oriented Gradients

In traditional machine learning techniques (non-deep-learning ones), image features extracted from input images replace raw images for more efficient learning and inferencing. As this example demonstrates, Histogram of Oriented Gradients (HoG) descriptors coupled with SVM can be very powerful in image classification tasks. In this project, you'll focus on accelerating the computation of HoG descriptors. For the sake of simplicity, only consider 8-bins HoG in this project. (https://medium.com/@basu369victor/handwritten-digits- recognition-d3d383431845)

You'll use your HoG implementation to compute the HoG descriptors for a batch of images. The speedup can be measured by the hardware-accelerated design's latency and/or throughput compared with a software-only implementation on the PS.

HoG algorithm background:
Steps to approach this task:
1. Study the HoG algorithm carefully. Take note of parallel operations.
2. Develop a software implementation. Test it thoroughly before continuing.
3. Profile the implementation to identify performance bottlenecks
4. Ideally, you should be able to create your HLS design using C code for the PS implementation.
Here's a test set (10000) images from the MNIST dataset. Junning has converted them into bitmap format, so it should be ready to use with no need further conversions.
Bitcoin Miner

For this project you will implement a basic hardware accelerated Bitcoin miner. The Bitcoin mining algorithm is defined as:
```
 ```python def mine_bitcoin(header):
  # Loop until we found a good nonce or we run out of time or killed.
  while(timeout == False):
    # Increment the nonce (garbage field we are allowed to set to
    # influence the output hash)
    header.nonce += 1
    # Return our golden nonce if we find it!
    if (sha256d(header) < header.target): break return header

def sha256d(header): sha256(sha256(header)) ``` 
```
NOTES:
- Talk to Tony before embarking on this project!
- We will not build a profitable miner. However we will hopefully demonstrate a significant speedup when compared to a software implementation.
- We will not implement the network protocols in order to connect our miner to a real mining pool.
Alternative hash cores:
Three extra hash functions suitable for substituting SHA256D. When integrated into the miner will mine a hypothetical coin based on these hash functions.

We choose hash functions that produce 256-bit digests instead of 512-bit, as that allows for more exploration of unroll factors on the relatively resource constrained Kria/Zedboards.
- KECCAK-256 (SHA3-256):
  The specification is defined here: https://keccak .team/files/Keccak-reference-3.0.pdf
  Reference pseudo-code can be found here: https://keccak.team/keccak.html
- BLAKE2S:
  The specification is defined here: https://www.blake2.net/blake2. pdf
  Reference C implementations can be found here: https://github.com/BLAKE2/BLAKE2
- GROESTL-256:
  The specification is defined here: http://www.groestl.info/ Groestl.pdf
  Implementation guide: http:// www.groestl.info/groestl-implementation-guide.pdf
We will implement these with the assumption that the input size is fixed at 80-bytes (640-bits), the size of the Bitcoin block header.
Blake3 implementation and evaluation

This function is quite new, proposed in Jan 10, 2020 by the people behind Blake2, which was a final contestant for the SHA3 standard (KECCAK won out) but is still widely used today as an alternative to SHA3. See the attached specification.
Tony Wu: "Due to it being so new, very few public implementations exist for FPGAs, if any. I believe this is doable within the COMP4601 project timeframe, especially with application of HLS."
Students can read more about it and find a reference implementation written in Rust; https://github.com/BLAKE3- team/BLAKE3/
Encryption/decryption

This is an important application for FPGAs because the datawidth can be matched and the computation unrolled and pipelined to gain a good speedup over a CPU. Try designing and implementing a general solution that can process text as it is streamed through the FPGA part, potentially as keys change. Choose from AES, DES, RSA, ...
Tony, who is a crypto expert on FPGAs, suggests the following:

AES Accelerator

Design an AES accelerator that operates in ECB mode.
- Milestone 1: Design an AES encryption core that can take a 128-bit plaintext, 128-bit key and generate a 128-bit ciphertext. Verify this works with a C testbench.
- Milestone 2: Connect the HLS core to the ARM processor on the Zynq SoC. (You will need to think about how best to transfer the data). Write a test program that runs on the ARM processor to send plaintext to the AES core and receive the cipher test. What is the performance? Are there any bottlenecks?
- Milestone 3: Optimize your HLS core or interconnect to alleviate bottlenecks. What is the performance now?
- Milestone 4: Compare your hardware accelerated AES core to one you implemented in purely software (run your software miner on the ARM processor AND your x86 laptop/desktop). Is there anything surprising?
- Bonus: Change your AES core such that it operates in CBC mode. Make it go as fast as possible!
NOTE: Please read up on the difference between ECB and CBC mode.

SHA3 Accelerator

Design a SHA3 accelerator that can take an arbitrary input size.

Milestones will be similar to AES Accelerator.
Text/audio/image compression/decompression

Decide on medium and algorithm; focus on either compression or decompression and perform the opposite operation in software for verification purposes. Given the limitations on time, stick to problems that use simple data structures!
Machine learning

Opportunities abound in this HOT area. Consider accelerating "k-means clustering", or "k-nearest neighbours", or "support vector machine" or "Naive Bayes classifier" or "neural net".
Real-valued Matrix operations at scale

At the heart of many machine learning, inference, big data analytics and scientific applications are matrix-matrix and matrix-vector multiplications. In this project I would suggest parameterizing the datatypes and matrix/vector sizes. Compared with a baseline algorithm using float type data, how do speeds, utilization and accuracy compare as the problem size scales?
Pattern matching

String and regular expression matching has frequently been targetted for FPGA acceleration.
Roll your own project

Propose a project to Oliver...the problem should have the potential to be spedup using FPGA hardware.

Potential applications for acceleration include:
- sorting
- filtering
- encoding/decoding
- video processing
- image processing
- graph processing
- database operations e.g. joins
- an acceleration problem one or more of you is working on for your thesis