中文大模型通用SDK,系统性优化接口适配、增强响应解析和批量处理等能力,深度适配 OpenAI 生态内 LangChain、LlamaIndex、AutoGen 等大模型应用框架。支持作为Agent Skill部署到各种AI编程工具。
-
Updated
May 23, 2026 - Python
中文大模型通用SDK,系统性优化接口适配、增强响应解析和批量处理等能力,深度适配 OpenAI 生态内 LangChain、LlamaIndex、AutoGen 等大模型应用框架。支持作为Agent Skill部署到各种AI编程工具。
Learn GenAI and Agentic AI from Zero to Production
Advanced RAG Pipelines and Evaluation
Advanced RAG pipeline optimization framework using DSPy. Implements modular RAG pipelines with Query-Rewriting, Sub-Query Decomposition, and Hybrid Search via Weaviate. Automates prompt tuning and few-shot selection using GEPA, SIMBA, MIPRO, COPRO, and BootstrapFewShot optimizers on datasets like FreshQA, HotpotQA, TriviaQA, Wikipedia and PubMedQA.
🚀 Production-ready modular RAG monorepo: Local LLM inference (vLLM) • Hybrid retrieval with Qdrant • Semantic caching • Docling document parsing • Cross-encoder reranking • DeepEval evaluation • Full observability with Langfuse • Open WebUI chat interface • OpenAI-compatible API • Fully Dockerized
BNS-LexAI is an AI-powered legal information and case understanding assistant.
pytest lab for testing LLMs: RAG eval, red teaming, guardrails, drift monitoring — 14 modules, 382 tests, zero API calls needed
A hands-on exploration of Deepeval — an open-source framework for evaluating and red-teaming large language models (LLMs). This repository documents my journey of testing, benchmarking, and improving LLM reliability using custom prompts, metrics, and pipelines.
Drop in deal documents → get an onboarding plan, draft invoice, and stakeholder summary. Multi-agent LangGraph pipeline with RAG, human approval, and self-correcting retries.
Automatically discover where and why your LLM is failing — embedding-space clustering + statistical hypothesis testing to surface input slices with elevated failure rates and audit test suite coverage gaps.
Production-grade LLM evaluation pipeline for RAG chatbot — DeepEval + RAGAS + Garak + CI/CD | Financial domain | 7 metrics | Adversarial testing
LLM evaluation framework for medical chatbot — DeepEval quality + RAG metrics + hallucination detection + red teaming | pytest CI/CD | LLaMA 3.1 8B | Groq
A research project to measure AI agent robustness. Contains automated testing pipelines and a benchmarking methodology developed to audit Agentic AI architectures for complex reasoning flaws.
[UNDER DEVELOPMENT] Clinical-RAG is a production-grade, citation-backed AI system designed to bridge the "Trust Gap" in medical information retrieval.
Production RAG evaluation pipeline: RAGAS (faithfulness · context recall · answer relevancy) + DeepEval (hallucination scoring). Lambda-triggered on KB updates. Regression gating blocks deployment at >5% metric drop.
STRICTLY DO NOT DELETE NOR UNARCHIVE - Personal Project - LLM validation platform
Add a description, image, and links to the deepeval topic page so that developers can more easily learn about it.
To associate your repository with the deepeval topic, visit your repo's landing page and select "manage topics."