19–30%of the workday is spent searching for information (McKinsey / IDC) — that is exactly the market RAG addresses

74% → 89%recall@10: the share of cases where the correct answer lands in the top results after adding reranking (Databricks)

+33–40%answer accuracy from cross-encoder reranking at +~120 ms latency (RAG research)

40–50%of routine requests are resolved by support with sources in the answer, freeing up people (RAG support benchmark)

Industry solutions

What you can build with RAG

All solutions

Manufacturing LLM-wiki over technical regulations, GOST standards, equipment manuals and occupational safety, with an answer and a source citation Engineers and operators finding the right procedure: from minutes to seconds, with no risk of working off an outdated versionLearn more →Retail and e-commerce RAG over product cards, return policies and FAQs for customer support and content managers Request handling and content prep: 40–50% of routine questions are resolved automatically with a sourceLearn more →Logistics Knowledge base on tariffs, routes, customs and warehouse regulations with grounded answers Dispatcher and customer service answers on shipping rules without escalations to expertsLearn more →Finance and Insurance RAG over regulations, product terms and compliance documents, with a mandatory citation to the exact clause Consulting and compliance checks: answers are verifiable and auditable, hybrid RAG reduces hallucinationsLearn more →Development and real estate LLM-wiki over project documentation, contracts and property sales policies Managers finding terms for a site and deal without going through legal and the project teamLearn more →IT and support Internal assistant over runbooks, the incident base and internal regulations for support agents Resolving a ticket: time to find the right process and escalation matrix drops from 8–10 minutes to secondsLearn more →HR and onboarding Corporate LLM-wiki on company policies, benefits and procedures for employees Onboarding and HR questions: a new hire finds the answer themselves, load on the HR line dropsLearn more →Legal and compliance teams RAG over the contract base, precedents and internal policies, with provenance for every answer Drafting opinions and reviewing contracts: every answer is tied to a specific document and clauseLearn more →

Capabilities

RAG capabilities

Sources: wiki, regulations, product data, databasesChunking + embeddingsVector store (pgvector / Qdrant)User queryRetrieval: top-k candidatesReranking (cross-encoder): top-nLLM with grounded contextAnswer + link to source (provenance)

A linear diagram of the anti-hallucination loop. On the left, knowledge sources (wiki, regulations, product data, databases) → chunking and embeddings → vector store (pgvector / Qdrant). The user query goes into retrieval (top-k candidates) → cross-encoder reranking (top-n best) → the LLM gets only the selected context → an answer with a link to the source. The model answers ONLY from the retrieved context, not from its memory; the provenance arrow leads from the answer back to the source document. The loop is loosely coupled — each block (store, retrieval, reranker, LLM) is replaceable independently.

Retrieval over corporate sources

The model answers from your documents, wiki and databases, not from internet "memory" — an employee gets the answer in seconds instead of hours of search across the 19–30% of the day that gets lost.

Grounding and source citations

Every answer shows which document it came from — answers are verifiable, and hallucinations are cut off at the architecture level, not by coaxing the model.

Vector store (pgvector / Qdrant)

Semantic search across millions of fragments: finds the answer by meaning, not by word match. pgvector — when data is already in Postgres; Qdrant — for high-load search with filters.

Chunking and content preparation

Documents are split into meaningful chunks with metadata — the model gets "less but more precise" context, which directly raises relevance and lowers query cost.

Reranking (cross-encoder)

The second stage reorders candidates by real relevance: recall@10 rises from 74% to 89%, answer accuracy by 33–40% in ~120 ms. High ROI at minimal latency.

Cache of vetted answers

On top of RAG we keep a cache of pre-validated answers to frequent and critical questions — the system returns a ready answer, bypassing retrieval, which further reduces hallucinations. This is not llm-wiki: there, knowledge is compiled in advance into a vetted base and read without chunk search (the Sloy memory approach). RAG and llm-wiki are different layers and combine together.

RAG for support and employees

40–50% of routine requests are resolved automatically with a source in the answer; the internal assistant cuts regulation lookup time from minutes to seconds.

A loosely coupled, detachable stack

Storage, retrieval and model are decoupled: swap the LLM or vector DB without rewriting everything. The solution moves easily between teams and contractors — no vendor lock-in.

Quality evaluation and anti-hallucination

precision@K, provenance coverage and hallucination rate metrics are built into the pipeline — answer quality is measured, not declared, and does not silently degrade after changes.

Approach

How we implement RAG

Without modifying the core

We don't fork or patch the RAG core. RAG stays on the standard upgradable version — business logic goes into separate microservices alongside it, so platform updates don't break your customizations.

International Standards, Not Homegrown Hacks

Where a mature international solution exists, we use it instead of inventing our own protocol or platform. Before writing code, we study how the problem is already solved in the industry.

Transferability

The solution is loosely coupled and documented: it can be handed over between teams and contractors without rewriting. You are not tied to us.

AI compatibility

RAG in the AI stack

Grounding for any LLM

The RAG layer feeds verified context into the model (GPT, Claude, open-source) — grounding answers in your data no matter which LLM you use today or switch to tomorrow.

Integration with MCP / the context layer

We connect the corporate knowledge base to agents via MCP as a standard source: RAG covers "what we know," MCP covers "how the agent retrieves it." Both layers are portable.

Operating behind the LLM & Security Gateway

Retrieval and model calls pass through a gateway: model routing, budgets, observability and PII obfuscation before sending — corporate knowledge does not leak out.

Foundation for AI agents

Agents that serve users and enter data rely on RAG as the source of truth — this turns a "chatty" assistant into a tool that answers with facts.

Integration with the Sloy platform

Sloy is corporate memory: knowledge is compiled in advance into an llm-wiki and read without retrieval ("No RAG"). RAG connects to Sloy as a second layer — for fresh and rare facts absent from the vetted base. Grounding and provenance work under multiple agents and scenarios.

News

What's new in RAG

All news

2026-07-16
NVIDIA released Nemotron 3 Embed - open embedding models, No. 1 on RTEB
The 8B model scored 78.5% on RTEB and 75.5% on MMTEB Retrieval - first place overall. The 1B variant reaches 72.4% on RTEB (27% fewer errors than the previous 1B version), and NVFP4 quantization preserves 99%+ of BF16 accuracy at 2x throughput. Weights and training recipes are open and available on Hugging Face.

Claude has no open embedding weights of its own, and the retrieval layer depends on third-party Voyage AI Anthropic Claude →
2026-07-16
LightOn released a multimodal reranker with a single relevance scale for text and scans
LightOn-rerank-LW-2B (2B, a LoRA adapter on Qwen3.5) ranks text passages and document scans with a single model and a single scale - 62.66 NDCG@10 on ViDoRe V3 versus 59.18 for Qwen3-VL-Reranker-2B and 59.40 for jina-reranker-m0. The 4B variant (64.69) beats Qwen3-VL-Reranker-8B (64.23) with half as many parameters. Weights are open on Hugging Face.

built-in hybrid retrieval in Elasticsearch (ELSER); text - scans/images on a unified relevance scale; LightOn handles this with a separate reranker layer on top Elasticsearch →
2026-05-19
Hugging Face released the Ettin Reranker family - 6 open cross-encoder models from 17M to 1B
The 17M model beats 33M ms-marco-MiniLM-L12-v2 by +0.051 NDCG@10 (MTEB) with half as many parameters; the 150M model is the strongest mid-tier reranker up to 600M, outperforming the 596M Qwen3-Reranker-0.6B; the 1B model matches the quality of the 1.54B teacher model (a 0.0001 NDCG@10 difference) with 2.4x faster inference. Apache 2.0, open weights.
2026-03-03
Isaacus released Legal RAG Bench - an end-to-end RAG benchmark for legal documents
4,876 passages from the Judicial College of Victoria Criminal Charge Book plus 100 expert questions were added to the Massive Legal Embedding Benchmark (MLEB, October 2025) - evaluating not only retrieval, but end-to-end answer generation.