Democratising AI deployments Webinar by Daniel Lenton
Since the release of ChatGPT at the end of 2022, enterprises of all shapes and sizes have started to spend a huge amount of money querying Large Language Models (LLMs), and this number is only growing. Since this monumental moment, the AI landscape has rapidly evolved, the AI-as-a-Service market has exploded onto the scene, with many companies exposing public endpoints to various proprietary and open source models. Each of these models incur differences in output quality, but also in runtime performance characteristics, such as time-to-first-token (TTFT), inter-token-latency (ITL), and reliance to request concurrency. The benchmarking landscape is very messy, and a central source of truth is urgently needed. In this talk, we explain why benchmarks must be presented across time, and how this unique insight enables us to paint a much clearer picture of the LLM endpoint landscape. Furthermore, we introduce the concept of dynamic routing, and through live demos we demonstrate how this can hugely improve both the speed and cost when querying LLMs, especially in light of how frequent the performance changes across providers throughout any given day. We then close by talking about future directions, exploring how far dynamic routing can really be pushed, and how this ties in with an expanding multi-modal endpoint landscape.