Open Source Tools We Love: The Stack Powering AIMatrix
Building AIMatrix wouldn’t have been possible without the incredible open source ecosystem. These tools saved us literally months of development time and helped us focus on what makes our platform unique.
Here’s our honest take on the open source tools we rely on daily, including what works great, what’s frustrating, and how we contribute back.
Vector Database: Qdrant
Why we chose it: When we were evaluating vector databases (Pinecone, Weaviate, Chroma), Qdrant struck the best balance of performance, features, and operational simplicity.
|
|
What we love:
- Performance: Handles our 10M+ vector searches per day without breaking a sweat
- Filtering: Complex metadata filtering works exactly as expected
- API design: RESTful API is intuitive, gRPC option for performance
- Operational simplicity: Single binary, reasonable resource usage
- Clustering: Built-in distributed mode for high availability
What’s annoying:
- Memory usage: Can be memory-hungry with large collections
- Documentation: Some advanced features lack detailed examples
- Migration tools: Moving from other vector DBs requires custom tooling
Our contribution: We’ve contributed performance benchmarks and documentation improvements, particularly around metadata filtering patterns.
Testing: Testcontainers
Why it’s essential: Testing AI systems requires realistic data stores, message queues, and external services. Testcontainers lets us spin up real infrastructure for tests.
|
|
What we love:
- Real environment testing: No more “works in tests, fails in prod” surprises
- Easy setup: Container management is completely handled
- Multiple languages: Works with our Kotlin, Python, and Node.js services
- CI/CD friendly: Plays well with GitHub Actions and other CI systems
What’s challenging:
- Startup time: Tests can be slower, especially with multiple containers
- Resource usage: CI/CD systems need more memory and CPU
- Flaky networks: Occasional container networking issues in CI
- Debugging: When tests fail, container logs can be hard to access
Message Queue: Apache Pulsar
Why we picked it over Kafka: Multi-tenancy, geo-replication, and built-in schema registry made Pulsar the better choice for our agent communication needs.
|
|
What works well:
- Multi-tenancy: Perfect for isolating customer data
- Schema evolution: Built-in schema registry saves tons of migration headaches
- Geo-replication: Automatic data replication across regions
- Flexible subscriptions: Key_Shared subscriptions are perfect for agent load balancing
Pain points:
- Operational complexity: More complex than simpler message queues
- Learning curve: Concepts like tenants, namespaces, and subscriptions take time
- Tooling: Fewer third-party tools compared to Kafka ecosystem
- Memory usage: Can be resource-intensive
Observability: OpenTelemetry + Jaeger
Why this combo: We needed distributed tracing across our polyglot services (Kotlin, Python, Node.js). OpenTelemetry’s vendor-neutral approach was perfect.
|
|
The good:
- Vendor neutral: Not locked into any specific observability platform
- Language support: SDKs work consistently across our tech stack
- Rich context: Distributed traces show exactly how requests flow through agents
- Community: Strong ecosystem and active development
The frustrating:
- Configuration complexity: Getting all the exporters and samplers right is tricky
- Performance overhead: Tracing everything can impact performance
- Storage costs: Trace data volumes can get expensive quickly
- Query complexity: Finding specific traces in Jaeger can be challenging
Configuration: Typesafe Config (Lightbend Config)
Why we still use it: Despite being Java-focused, it’s the most robust configuration library we’ve found for complex, hierarchical configs.
# Our agent configuration in HOCON format
aimatrix {
agents {
default {
timeout = 30s
max-retries = 3
memory {
type = "vector"
vector-db {
url = "http://localhost:6334"
collection = "agent_memories"
}
}
}
conversation-agent = ${aimatrix.agents.default} {
model = "gpt-4"
max-context-length = 8000
memory {
retention-days = 30
}
}
reasoning-agent = ${aimatrix.agents.default} {
model = "claude-3"
timeout = 60s # Reasoning takes longer
memory {
retention-days = 90 # Keep reasoning chains longer
}
}
}
}
Why it’s great:
- Hierarchical configs: Inheritance and overrides work intuitively
- Environment handling: Easy environment-specific configurations
- Type safety: Compile-time validation of config structure
- HOCON format: More readable than JSON or YAML for complex configs
Downsides:
- Java ecosystem only: No native support for other languages
- Learning curve: HOCON syntax isn’t widely known
- Runtime errors: Some validation happens at runtime, not compile time
API Framework: Ktor
Why we chose it: Lightweight, Kotlin-native, and perfect for building APIs that integrate with our agent systems.
|
|
What rocks:
- Kotlin-first: Feels natural with our agent code
- Coroutine support: Async handling without callback hell
- Modular: Use only what you need, no bloat
- Testing: Built-in testing support is excellent
Rough edges:
- Ecosystem: Smaller community compared to Spring Boot
- Documentation: Some advanced features need better examples
- Deployment: Requires more setup compared to Spring Boot
- Learning curve: If your team knows Spring, there’s adjustment time
How We Give Back
Our contributions to the ecosystem:
- Documentation improvements: We’ve added examples for Qdrant metadata filtering and Kalasim custom distributions
- Bug reports and fixes: Fixed memory leaks in Testcontainers and performance issues in Ktor
- Benchmark data: Shared performance comparisons for various vector databases
- Example projects: Open-sourced simplified versions of our agent architectures
Our open source releases:
aimatrix-agent-sdk
: Simplified SDK for building AI agents (coming soon)simulation-testing-utils
: Testcontainers extensions for simulation testingotel-agent-instrumentation
: OpenTelemetry instrumentation for AI agent systems
The Hidden Costs
Open source isn’t free:
- Learning time: Each tool has a learning curve
- Integration work: Making everything work together takes effort
- Operational overhead: Managing updates, security patches, compatibility
- Support burden: When things break, you’re the support team
But the benefits far outweigh the costs. These tools let a small team build sophisticated AI infrastructure that would have taken much longer to develop from scratch.
What’s Next
We’re evaluating:
- DuckDB for analytics workloads (love the simplicity)
- Grafana Mimir for metrics (Prometheus is struggling with our scale)
- Apache Arrow for data processing between services
- Temporal for workflow orchestration (might replace our custom orchestration)
The open source ecosystem for AI infrastructure is evolving rapidly. Tools that didn’t exist two years ago are now essential parts of our stack.
Building on open source means we can focus on what makes AIMatrix unique - the agent orchestration and reasoning capabilities - while leveraging the community’s work on infrastructure, observability, and data management.
That’s a pretty good deal.