Training Custom Embedding Models
By Marcin Piekarski builtweb.com.au · Last Updated: 11 February 2026
TL;DR: Fine-tune or train embedding models for your domain. Improve retrieval quality with domain-specific embeddings.
TL;DR
General-purpose embedding models work well for everyday text, but they struggle with specialized vocabulary in domains like law, medicine, or finance. Custom embedding models are trained on your domain's data to better understand the specific meanings and relationships in your field, typically improving retrieval accuracy by 10-30% on domain-specific tasks.
Why it matters
Embeddings are the foundation of modern AI search. Every time a RAG system retrieves documents, a chatbot finds relevant knowledge base articles, or a search engine ranks results by meaning rather than keywords, embeddings are doing the heavy lifting. They convert text into numerical representations that capture meaning.
The problem is that general embedding models are trained on broad internet text. They understand that "cat" and "kitten" are related, but they may not understand that in legal contexts, "consideration" means payment (not thoughtfulness), or that in medicine, "positive" test results are usually bad news. When your embedding model does not understand your domain's language, your entire retrieval pipeline suffers -- the AI retrieves the wrong documents and generates wrong answers.
Custom embeddings fix this by teaching the model the specific language, relationships, and meanings that matter in your domain.
When general embeddings fall short
General-purpose embedding models like OpenAI's text-embedding-3-large or open-source models from Sentence Transformers are remarkably capable. For most everyday use cases, they work well. But specific scenarios reveal their limitations:
Domain-specific vocabulary
Legal documents use terms like "estoppel," "laches," and "tortious interference" that rarely appear in general training data. A general model might produce similar embeddings for unrelated legal concepts simply because it has not seen enough legal text to distinguish them.
Industry jargon and acronyms
In healthcare, "MI" could mean myocardial infarction (heart attack) or motivational interviewing (a counseling technique). A general model has no way to know which meaning applies in your specific context. A custom model trained on your domain learns the right associations.
Specialized relationships
In pharmaceutical research, a general model might not understand that "aspirin" and "acetylsalicylic acid" are the same thing, or that "NSAIDs" is a category that includes both. Custom training teaches these domain-specific relationships.
Proprietary terminology
Every company has internal jargon: product names, process abbreviations, team-specific terms. General models have never seen these words and will produce poor embeddings for them.
Rule of thumb: If your users search using specialized terminology and the retrieval results feel hit-or-miss, the embedding model is probably the bottleneck.
How contrastive learning works (in plain English)
The most common way to train custom embeddings is contrastive learning. The idea is surprisingly intuitive.
Imagine you are teaching someone to organize a library. You show them pairs of books and say: "These two books belong on the same shelf" (positive pair) or "These two books belong on different shelves" (negative pair). After enough examples, they develop an intuition for which books are related.
Contrastive learning does exactly this with text:
- Positive pairs: You provide examples of text that should be similar. A user query and the document that correctly answers it. Two paragraphs about the same topic.
- Negative pairs: Text that should be different. A user query and an irrelevant document. Two paragraphs about unrelated topics.
- Training: The model learns to produce embeddings that are close together for positive pairs and far apart for negative pairs.
Over thousands of examples, the model learns the specific similarity patterns in your domain. A legal model learns that "breach of contract" and "contractual violation" should be close together. A medical model learns that "hypertension" and "high blood pressure" should be neighbors.
What you need to get started
Training data: the critical ingredient
The quality of your custom embedding model depends almost entirely on the quality of your training pairs. Here is what you need:
- Positive pairs: 1,000-10,000+ examples of (query, relevant document) pairs. More is better, but quality matters more than quantity.
- Hard negatives: Documents that are somewhat related but not the correct answer. These are more valuable than completely random negatives because they teach the model to make fine-grained distinctions.
Where to get training pairs:
- Search logs: If you have an existing search system, user clicks tell you which documents were relevant to which queries.
- Expert annotations: Have domain experts match queries to relevant documents. Expensive but high quality.
- Synthetic generation: Use an LLM to generate questions that a given document would answer. This is fast and surprisingly effective.
- Existing Q&A data: Support tickets, FAQ pages, or forum questions paired with their answers.
Compute resources
Fine-tuning an embedding model is much cheaper than training a large language model. A single GPU can fine-tune a Sentence Transformer model in a few hours. Cloud compute costs are typically under $50 for a training run.
A baseline to compare against
Before training, measure how well a general embedding model performs on your domain. This gives you a clear before-and-after comparison. Without a baseline, you cannot know if your custom model is actually better.
Practical implementation steps
Start with a strong base model. Do not train from scratch. Fine-tune an existing model like
all-MiniLM-L6-v2orbge-large-en-v1.5from Sentence Transformers. These already understand general language; you are teaching them your domain.Prepare your training data. Format your positive pairs and hard negatives. Clean the data -- remove duplicates, fix obvious errors, ensure the positive pairs are genuinely relevant.
Train with the Sentence Transformers library. This is the most accessible option. The library handles the contrastive learning setup, loss functions, and training loop. A basic training script is about 30 lines of Python.
Evaluate on a held-out test set. Reserve 10-20% of your pairs for testing. Measure Recall@10 (does the correct document appear in the top 10 results?), MRR (Mean Reciprocal Rank), and compare against your baseline.
Reindex your documents. Once you have a trained model, generate new embeddings for all your documents. Old embeddings from the general model are not compatible with the new model's embedding space.
A/B test in production. Run both the general and custom models in parallel and compare real-world retrieval quality. User satisfaction and click-through rates are the ultimate test.
Evaluating embedding quality
Numbers matter. Here are the key metrics:
- Recall@K: What percentage of the time does the correct document appear in the top K results? Recall@10 of 90% means the right answer is in the top 10 results 90% of the time.
- MRR (Mean Reciprocal Rank): How high is the correct document ranked on average? An MRR of 0.8 means the correct answer is typically the first or second result.
- NDCG (Normalized Discounted Cumulative Gain): Measures overall ranking quality, not just whether the right document appears but how high it is ranked.
Always compare these metrics against a general-purpose baseline. A custom model should meaningfully outperform the baseline on your domain data. If it does not, the training data may need improvement.
The cost/benefit trade-off
Costs:
- Data collection: 1-4 weeks of effort (the biggest cost)
- Training compute: $10-$100 per training run
- Engineering time: 1-2 weeks for initial implementation
- Ongoing maintenance: Periodic retraining as your domain evolves
Benefits:
- 10-30% improvement in domain-specific retrieval accuracy
- Better user experience with more relevant search results
- Reduced hallucinations in RAG systems (better retrieval means better context)
- Smaller model sizes possible (a fine-tuned small model can outperform a general large model on your domain)
When it is worth it: If retrieval quality directly affects your product or your users rely on search to find critical information, custom embeddings are almost always worth the investment. If search is a minor feature and general embeddings perform acceptably, the effort may not be justified.
Common mistakes
- Skipping the baseline measurement. Without knowing how well general embeddings perform on your data, you cannot justify the investment or measure improvement. Always benchmark first.
- Using only easy negatives. Training with completely random negative pairs teaches the model obvious distinctions. Use hard negatives -- documents that are topically related but not the correct answer -- to teach the model nuance.
- Not having enough training data. With fewer than 500 pairs, the model may overfit and actually perform worse than the general baseline. If you cannot collect enough pairs, try synthetic data generation as a supplement.
- Forgetting to reindex. A new embedding model produces vectors in a different space than the old one. You must regenerate embeddings for all your documents. Mixing old and new embeddings will produce nonsensical search results.
- Training once and never updating. Domains evolve. New terminology appears, meanings shift, new document types are added. Plan for periodic retraining, at least quarterly for active domains.
What's next?
- Embeddings Explained -- foundational understanding of what embeddings are and how they work
- Embeddings and RAG -- how embeddings power retrieval-augmented generation systems
- Advanced RAG Techniques -- improving the entire retrieval pipeline beyond just embeddings
- Vector Database Examples -- where to store and search your custom embeddings
Frequently Asked Questions
How much data do I need to train custom embeddings?
A minimum of 1,000 query-document pairs to see meaningful improvement, with 5,000-10,000 pairs being the sweet spot. Quality matters more than quantity -- 2,000 expert-curated pairs will outperform 20,000 noisy ones. If you have limited data, synthetic pair generation using an LLM can help supplement.
Can I fine-tune OpenAI's embedding model or only open-source models?
Both options exist. OpenAI offers fine-tuning for their embedding models through their API. Open-source models via Sentence Transformers give you more control, lower ongoing costs, and no vendor lock-in. For most teams, starting with open-source Sentence Transformers is more practical and flexible.
How do I know if my custom embeddings are actually better?
Measure retrieval metrics (Recall@10, MRR) on a held-out test set for both your custom model and a general baseline. If your custom model does not improve these metrics by at least 5-10% on domain-specific queries, the general model may be good enough for your use case.
Do custom embeddings work for multilingual content?
Yes, but start with a multilingual base model like paraphrase-multilingual-MiniLM-L12-v2. Then fine-tune on your domain-specific pairs in all relevant languages. The model will learn domain terminology across languages simultaneously.
Was this guide helpful?
Your feedback helps us improve our guides
About the Authors
Marcin Piekarski· Frontend Lead & AI Educator
Marcin is a Frontend Lead with 20+ years in tech. Currently building headless ecommerce at Harvey Norman (Next.js, Node.js, GraphQL). He created Field Guide to AI to help others understand AI tools practically—without the jargon.
Credentials & Experience:
- 20+ years web development experience
- Frontend Lead at Harvey Norman (10 years)
- Worked with: Gumtree, CommBank, Woolworths, Optus, M&C Saatchi
- Runs AI workshops for teams
- Founder of builtweb.com.au
- Daily AI tools user: ChatGPT, Claude, Gemini, AI coding assistants
- Specializes in React ecosystem: React, Next.js, Node.js
Areas of Expertise:
Prism AI· AI Research & Writing Assistant
Prism AI is the AI ghostwriter behind Field Guide to AI—a collaborative ensemble of frontier models (Claude, ChatGPT, Gemini, and others) that assist with research, drafting, and content synthesis. Like light through a prism, human expertise is refracted through multiple AI perspectives to create clear, comprehensive guides. All AI-generated content is reviewed, fact-checked, and refined by Marcin before publication.
Transparency Note: All AI-assisted content is thoroughly reviewed, fact-checked, and refined by Marcin Piekarski before publication.
Key Terms Used in This Guide
Embedding
A list of numbers that represents the meaning of text, images, or other data. Similar meanings produce similar numbers, so computers can measure how 'close' two concepts are.
Model
The trained AI system that contains all the patterns and knowledge learned from data. It's the end product of training—the 'brain' that takes inputs and produces predictions, decisions, or generated content.
Training
The process of feeding large amounts of data to an AI system so it learns patterns, relationships, and rules, enabling it to make predictions or generate output.
AI (Artificial Intelligence)
Making machines perform tasks that typically require human intelligence—like understanding language, recognizing patterns, or making decisions.
Fine-Tuning
Taking a pre-trained AI model and training it further on your specific data to make it better at your particular task or adopt a specific style.
Related Guides
Advanced RAG Techniques
AdvancedGo beyond basic RAG: hybrid search, reranking, query expansion, HyDE, and multi-hop retrieval for better context quality.
9 min readFine-Tuning Fundamentals: Customizing AI Models
IntermediateFine-tuning adapts pre-trained models to your specific use case. Learn when to fine-tune, how it works, and alternatives.
8 min readSemantic Search: Search by Meaning, Not Keywords
IntermediateSemantic search finds results based on meaning, not exact keyword matches. Learn how it works and how to implement it.
6 min read