Projects

Developed a programming model (Gmap) and runtime system (Gmachine) for scalable graph processing.
Created Graph primitives such as traversals and set intersections with hardware support for accelerating graph processing.
Engineered dynamic object relocation for dynamic load balancing and contention mitigation.
Architected hardware accelerated routing mechanisms using TCAM support in routers via longest prefix matching
Created global address space using object-based addressing rather than the conventional byte-addressing

Project Name: AGILE | Funding Agency: IARPA and U.S. Army: Competing as one of six performers among Intel, AMD, Qualcomm, etc. in the country to develop a novel architecture simulation for the world’s fastest supercomputer with current enabling technology to get a speedup of about 200x from conventional supercomputers..
Developing a proof of concept behavioral emulator in python and Structural Simulation Toolkit along with a cycle accurate FPGA-based for active memory architecture to run experiments for reducing architectural overheads prevalent in conventional hardware for general-purpose computing, specialized for dynamic graph processing for applications in the field of AI and ML, n-body simulations, and Adaptive Mesh Refinement
Enhancing performance by reducing starvation, latency, overheads, and contention via a ParalleX based execution model, hardware mechanisms for global namespace translations, adaptive routing and reordering of a message based runtime system, and graph primitive operations
Projected an instance of CCA to yield 600x peak performance improvement, 300x increase in memory bandwidth, and 95% reduction of physical footprint compared to Sunway TaihuLight

High Performance Computing (OpenMP, MPI, C/C++); Master’s Thesis

Reduced time to solution by 90% with a message driven runtime system (like Charm++) or Graph500 by conducting a comparative analysis on scaling results for graph processing algorithms like single source shortest path (SSSP) algorithm
Slashed 39% execution time on graph processing by creating a parallel variant of a graph algorithm (Dijkstra’s algorithm) on shared memory processors using OpenMP, MPI and parallel boost graph library on a graph size of 100GB

Natural Language Processing (C++11, Python, NLTK, SciKit, Alexa Skills Kit, SpaCy, coreNLP, Stardog, Neo4J)

Developed a speech aided NLP based artificial intelligence bot with an end-to-end response time of ~300ms capable of storing information from simple English sentences and respond the questions with keyword search about the information already stored in the system using Stardog and Neo4J
Built a genre detection tool with 91% accuracy for text classification (as sci-fi, history, physics, art etc.). Topic modelling was done using Tf-Idf and K-nearest neighbors (KNN) for classification. Other techniques including dependency parsing, bigram models, deep learning constructs like CNN and RNN, and ensemble approach with multiple weak-voting classifiers, were used but performed poorly in terms of accuracy

Advanced Operating Systems: Embedded OS Development in C (C, XINU, LINUX)

Implemented virtual memory and a lightweight file system to enhance security and reliability of the memory management unit in the XINU operating system on an embedded SoC (BeagleBone Black) which is used in handheld gaming consoles and IoT devices
Engineered process synchronization mechanisms using semaphores, promises & futures