🎯 Fine-tuning Guide
Overview
OKai supports multiple AI model providers and offers extensive configuration options for fine-tuning model behavior, embedding generation, and performance optimization.
Model Providers
OKai supports multiple model providers through a flexible configuration system:
enum ModelProviderName {
OPENAI,
ANTHROPIC,
CLAUDE_VERTEX,
GROK,
GROQ,
LLAMACLOUD,
LLAMALOCAL,
GOOGLE,
REDPILL,
OPENROUTER,
HEURIST,
}
Provider Configuration
Each provider has specific settings:
const models = {
[ModelProviderName.ANTHROPIC]: {
settings: {
stop: [],
maxInputTokens: 200000,
maxOutputTokens: 8192,
frequency_penalty: 0.0,
presence_penalty: 0.0,
temperature: 0.3,
},
endpoint: "https://api.anthropic.com/v1",
model: {
[ModelClass.SMALL]: "claude-3-5-haiku",
[ModelClass.MEDIUM]: "claude-3-5-sonnet-20241022",
[ModelClass.LARGE]: "claude-3-5-opus-20240229",
},
},
// ... other providers
};
Model Classes
Models are categorized into different classes based on their capabilities:
enum ModelClass {
SMALL, // Fast, efficient for simple tasks
MEDIUM, // Balanced performance and capability
LARGE, // Most capable but slower/more expensive
EMBEDDING // Specialized for vector embeddings
IMAGE // Image generation capabilities
}
Embedding System
Configuration
const embeddingConfig = {
dimensions: 1536,
modelName: "text-embedding-3-small",
cacheEnabled: true,
};
Implementation
async function embed(runtime: IAgentRuntime, input: string): Promise<number[]> {
// Check cache first
const cachedEmbedding = await retrieveCachedEmbedding(runtime, input);
if (cachedEmbedding) return cachedEmbedding;
// Generate new embedding
const response = await runtime.fetch(
`${runtime.modelProvider.endpoint}/embeddings`,
{
method: "POST",
headers: {
Authorization: `Bearer ${runtime.token}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
input,
model: runtime.modelProvider.model.EMBEDDING,
dimensions: 1536,
}),
},
);
const data = await response.json();
return data?.data?.[0].embedding;
}
Fine-tuning Options
Temperature Control
Configure model creativity vs. determinism:
const temperatureSettings = {
creative: {
temperature: 0.8,
frequency_penalty: 0.7,
presence_penalty: 0.7,
},
balanced: {
temperature: 0.5,
frequency_penalty: 0.3,
presence_penalty: 0.3,
},
precise: {
temperature: 0.2,
frequency_penalty: 0.0,
presence_penalty: 0.0,
},
};
Context Window
Manage token limits:
const contextSettings = {
OPENAI: {
maxInputTokens: 128000,
maxOutputTokens: 8192,
},
ANTHROPIC: {
maxInputTokens: 200000,
maxOutputTokens: 8192,
},
LLAMALOCAL: {
maxInputTokens: 32768,
maxOutputTokens: 8192,
},
};
Performance Optimization
Caching Strategy
class EmbeddingCache {
private cache: NodeCache;
private cacheDir: string;
constructor() {
this.cache = new NodeCache({ stdTTL: 300 }); // 5 minute TTL
this.cacheDir = path.join(__dirname, "cache");
}
async get(key: string): Promise<number[] | null> {
// Check memory cache first
const cached = this.cache.get<number[]>(key);
if (cached) return cached;
// Check disk cache
return this.readFromDisk(key);
}
async set(key: string, embedding: number[]): Promise<void> {
this.cache.set(key, embedding);
await this.writeToDisk(key, embedding);
}
}
Model Selection
async function selectOptimalModel(
task: string,
requirements: ModelRequirements,
): Promise<ModelClass> {
if (requirements.speed === "fast") {
return ModelClass.SMALL;
} else if (requirements.complexity === "high") {
return ModelClass.LARGE;
}
return ModelClass.MEDIUM;
}
Provider-Specific Optimizations
OpenAI
const openAISettings = {
endpoint: "https://api.openai.com/v1",
settings: {
stop: [],
maxInputTokens: 128000,
maxOutputTokens: 8192,
frequency_penalty: 0.0,
presence_penalty: 0.0,
temperature: 0.6,
},
model: {
[ModelClass.SMALL]: "gpt-4o-mini",
[ModelClass.MEDIUM]: "gpt-4o",
[ModelClass.LARGE]: "gpt-4o",
[ModelClass.EMBEDDING]: "text-embedding-3-small",
[ModelClass.IMAGE]: "dall-e-3",
},
};
Anthropic
const anthropicSettings = {
endpoint: "https://api.anthropic.com/v1",
settings: {
stop: [],
maxInputTokens: 200000,
maxOutputTokens: 8192,
temperature: 0.3,
},
model: {
[ModelClass.SMALL]: "claude-3-5-haiku",
[ModelClass.MEDIUM]: "claude-3-5-sonnet-20241022",
[ModelClass.LARGE]: "claude-3-5-opus-20240229",
},
};
Local LLM
const llamaLocalSettings = {
settings: {
stop: ["<|eot_id|>", "<|eom_id|>"],
maxInputTokens: 32768,
maxOutputTokens: 8192,
repetition_penalty: 0.0,
temperature: 0.3,
},
model: {
[ModelClass.SMALL]: "NousResearch/Hermes-3-Llama-3.1-8B-GGUF",
[ModelClass.MEDIUM]: "NousResearch/Hermes-3-Llama-3.1-8B-GGUF",
[ModelClass.LARGE]: "NousResearch/Hermes-3-Llama-3.1-8B-GGUF",
[ModelClass.EMBEDDING]: "togethercomputer/m2-bert-80M-32k-retrieval",
},
};
Heurist Provider
const heuristSettings = {
settings: {
stop: [],
maxInputTokens: 32768,
maxOutputTokens: 8192,
repetition_penalty: 0.0,
temperature: 0.7,
},
imageSettings: {
steps: 20,
},
endpoint: "https://llm-gateway.heurist.xyz",
model: {
[ModelClass.SMALL]: "hermes-3-llama3.1-8b",
[ModelClass.MEDIUM]: "mistralai/mixtral-8x7b-instruct",
[ModelClass.LARGE]: "nvidia/llama-3.1-nemotron-70b-instruct",
[ModelClass.EMBEDDING]: "", // Add later
[ModelClass.IMAGE]: "FLUX.1-dev",
},
};
Testing and Validation
Embedding Tests
async function validateEmbedding(
embedding: number[],
expectedDimensions: number = 1536,
): Promise<boolean> {
if (!Array.isArray(embedding)) return false;
if (embedding.length !== expectedDimensions) return false;
if (embedding.some((n) => typeof n !== "number")) return false;
return true;
}
Model Performance Testing
async function benchmarkModel(
runtime: IAgentRuntime,
modelClass: ModelClass,
testCases: TestCase[],
): Promise<BenchmarkResults> {
const results = {
latency: [],
tokenUsage: [],
accuracy: [],
};
for (const test of testCases) {
const start = Date.now();
const response = await runtime.generateText({
context: test.input,
modelClass,
});
results.latency.push(Date.now() - start);
// ... additional metrics
}
return results;
}
Best Practices
Model Selection Guidelines
-
Task Complexity
- Use SMALL for simple, quick responses
- Use MEDIUM for balanced performance
- Use LARGE for complex reasoning
-
Context Management
- Keep prompts concise and focused
- Use context windows efficiently
- Implement proper context truncation
-
Temperature Adjustment
- Lower for factual responses
- Higher for creative tasks
- Balance based on use case
Performance Optimization
-
Caching Strategy
- Cache embeddings for frequently accessed content
- Implement tiered caching (memory/disk)
- Regular cache cleanup
-
Resource Management
- Monitor token usage
- Implement rate limiting
- Optimize batch processing
Troubleshooting
Common Issues
-
Token Limits
function handleTokenLimit(error: Error) {
if (error.message.includes("token limit")) {
return truncateAndRetry();
}
} -
Embedding Errors
function handleEmbeddingError(error: Error) {
if (error.message.includes("dimension mismatch")) {
return regenerateEmbedding();
}
} -
Model Availability
async function handleModelFailover(error: Error) {
if (error.message.includes("model not available")) {
return switchToFallbackModel();
}
}