Update: I have graduated from MIT in June 2021 and have joined Waymo as a software engineer.
Hi, I’m Maleen. I am interested in novel software/hardware techniques to accelerate computationally challenging applications. I build C++ simulators to validate architectural ideas, write RTL to estimate complexity, and build FPGA solutions to accelerate real-world applications.
At MIT, I work with Daniel Sanchez on the Swarm project, where we investigate how to use speculative execution to accelerate certain hard-to-parallelize applications such as shortest path computations and discrete event simulations. My most recent project, Chronos, shows that these ideas can be used to build FPGA accelerators that runs 5x faster than 40-threaded CPUs.
Before starting grad school, I did my undergrad at University or Moratuwa in Sri Lanka. For my thesis, I was part of a 4-person team that designed and built an FPGA accelerator for HEVC/H.265 video decoding. This work led to the founding of Paraqum Technologies, which provides FPGA-based acceleration solutions to clients in the Asia-Pacific region.
In addition, I did consulting work for Wave computing, helping write new benchmarks to validate their CGRA architecture, and I have also interned at Microsoft Research’s Brainwave project, a deep-learning accelerator for FPGAs.
I am looking to graduate in summer 2021, and is looking for full-time industry roles.
PhD in Computer Science, Expected June 2021
MIT
BSc in Electronic and Telecommunication Engineering, 2013
University of Moratuwa, Sri Lanka
Chronos is a framework to build accelerators for applications with speculative parallelism. These applications consist of atomic tasks, sometimes with order constraints, and need speculative execution to extract parallelism. We demonstrate Chronos’s feasibility and benefits by building FPGA accelerators for four such applications. When run on commodity AWS FPGA instances, these accelerators outperform state-of-the-art software versions running on a higher-priced multicore instance by 3.5× to 15.3×.
This work studies the interplay between multithreaded cores and speculative parallelism. These techniques are often used together, yet they have been developed independently, causing major performance pathologies. This paper presents SAM, a simple instruction issue policy that addresses these pathologies by focusing execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently. On a system with 64, 8-threaded SMT cores, SAM reduces wasted work by 52%.
Fractal is a new execution model that supports unordered and timestamp-ordered nested parallelism. Fractal lets programmers seamlessly compose speculative parallel algorithms, and lets the architecture exploit parallelism at all levels. Our approach sidesteps the issues of nested parallel HTMs and uncovers abundant fine-grain parallelism. As a result, Fractal outperforms prior speculative architectures by up to 88× at 256 cores.
This work presents spatial hints, a technique that leverages program knowledge to reveal and exploit locality in speculative parallel programs. We show it is easy to modify programs to convey locality through hints. We design simple hardware techniques that allow a state-of-the-art, tiled speculative architecture to exploit hints. Hints make speculative parallelism practical on large-scale systems: at 256 cores, hints achieve nearlinear scalability on nine challenging applications, improving performance over hint-oblivious scheduling by 3.3× gmean and by up to 16×.
This work presents an FPGA-based harware implementation of a real-time 4K 30 fps HEVC decoder. Achieving such high perfromance on a low 150 MHz frequency required many architectural novelties, such as exploitation of the sparsity of transformed coefficient matrix, a single-cycle reference pixel processing in intra prediction, and flexible 8 × 8 block ordering in DBF/SAO.