PhD · Data Scientist · NLP Researcher

Building systems
at the edge of
language & data.

I design production ML pipelines and publish research on LLM uncertainty, multi-label classification, and NLP — bridging academic rigor with real-world deployment.

View Research Explore Projects

12+

Publications

Citations

PhD

Industrial & Systems Eng.

Models Shipped

About Me

I'm an Data Scientist at Cooper University Health Care and an independent researcher specializing in large language models, uncertainty quantification, and NLP for clinical and financial domains.

My PhD in Industrial and Systems Engineering and Master's in Finance from Paris Dauphine give me an unusual vantage point: I think about systems, uncertainty, and optimization simultaneously across both technical and economic layers.

In production I've built a hospital-policies RAG chatbot (ChromaDB + Azure OpenAI), a call-mining quality pipeline processing thousands of recordings monthly, and a suite of forecasting tools. In research I treat LLM stochasticity as signal, not noise.

Hajar Sakai

Selected Research

JMIR AI · 2026

LLMs for Healthcare Text Classification: A Systematic Review

Comprehensive systematic review mapping the landscape of large language model applications in clinical and healthcare text classification, synthesizing evidence across benchmarks, architectures, and deployment contexts.

Published

IISE 2026 · Conference

Bayesian Latent Variable Modeling of LLM Stochasticity

A novel framing that treats temperature-induced variance in LLM outputs as a structured latent signal rather than noise, integrated with conformal prediction for calibrated uncertainty bounds.

Forthcoming

Working Paper

Lookahead Bias in Financial LLMs

Investigating temporal leakage in LLM-based financial forecasting — identifying conditions under which training data leads to systematically optimistic backtests and proposing evaluation protocols to surface hidden bias.

In Progress

Multi-venue · 2023–2025

QUAD-LLM-MLTC / KDH-MLTC / HAMLET Frameworks

A family of multi-label text classification frameworks combining ensemble LLM prompting, knowledge distillation, and hierarchical label structures — demonstrated on clinical and biomedical corpora.

12+ Papers

Industry Projects

Production · Healthcare

Policies RAG Chatbot

End-to-end retrieval-augmented generation system over 178 hospital policy documents. Features multi-intent routing (semantic / index / clinical-links paths), delta indexing, inline citations, and Azure OpenAI embeddings.

ChromaDB Azure OpenAI Streamlit Python

Production · Analytics

Call Mining & Quality Pipeline

Automated pipeline processing thousands of monthly call recordings via Azure Blob Storage and OpenAI Whisper transcription. Stratified sampling and LLM scoring across 9 quality dimensions for the Patient Access Center.

Azure Blob gpt-audio LLM Scoring Fabric

Finance · NLP

10-K Sentiment Analyzer

Technical assessment combining Loughran-McDonald financial lexicon with LLM-based sentiment analysis on 10-K filings. Designed to surface tone shifts predictive of firm-level risk across reporting periods.

LM Lexicon LLM NLP Python

ML · Forecasting

Provider Star Rating Model

End-to-end ML workflow for provider performance prediction and star rating at Cooper. Combines structured EHR features with NLP signals from patient feedback, with statistical significance testing throughout.

Scikit-learn NLP Statistics Sentiment Analysis

Let's connect

Open to research
collaborations & roles.

I'm interested in academic positions, research collaborations, and senior ML/NLP industry roles at the intersection of language models and high-stakes domains.

Email GitHub LinkedIn Google Scholar Download CV

Building systems
at the edge of
language & data.

About Me

Education

Current Role

Research Interests

Selected Research

Industry Projects

Open to research
collaborations & roles.

Building systems at the edge of language & data.

About Me

Education

Current Role

Research Interests

Selected Research

Industry Projects

Open to researchcollaborations & roles.

Building systems
at the edge of
language & data.

Open to research
collaborations & roles.