Projects
Arxiv Sanity XL - Distributed Search Engine
Tech Stack: Java, Apache Zookeeper, FastAPI, PgVector, Postgres, Redis, Celery, Protobuf
- An event-driven, and highly available distributed document search engine leveraging leader election algorithm to elect cluster leaders for efficient workload distribution.
- Implemented the TF-IDF, and BM25 algorithm, generated document-chunk embeddings, and leveraged PgVector ANN retrieval to fetch required arxiv papers based on given keywords across a corpus of 30K+ research papers.
CodETL
Tech Stack: PySpark, Flask, Celery, Postgres, AWS S3, AWS EC2, Apache Airflow
- Spearheaded the development of a platform-agnostic ETL engine based on a standardized transformation map to spawn workers supporting 25+ transformations on the landing and staging layers.
- Created and integrated topologically sorted schedules to maximize parallel processing, orchestrating end-to-end ETL process on Airflow.
- Recognized with an Outstanding award by Deloitte for reducing development efforts by 40% over existing static SQL-based solutions.
DataLens
Tech Stack: Python, LangChain, LLM
- Developed an LLM-powered search engine and Q&A portal for structured and unstructured data using Retrieval Augmented Generation for postgres database, and documents.
Implementation of Genetic Algorithm for Path Traversal
Tech Stack: Javascript, P5.js
- Developed an interactive tool for obstacle creation and avoidance using path-finding operation in 2D environment employing Genetic Algorithm, inspired by Dan Shiffman’s book ”The Nature of Code.