⚔️ Want to see how to easily compare your LLM app's performance with OpenAI GPT-4o-mini vs Meta Llama 3.1 vs Mistral AI Large 2?
It's been a huge week for model releases. This means more choice for which you use to power your LLM applications. But newer isn't always better, and every application is different. It's critical to test model performance on your pipeline, and not just go by the stats threads on Twitter.
Luckily, Arize AI's Phoenix makes it really easy to compare model and prompt changes on your own app. Check out this video to see how:
https://lnkd.in/gH9uAmGv
How do you hire the right model for the right job without changing a single line of code? At Arize:Observe, Facundo Santiago covered the Microsoft Azure AI model catalog.
Watch: https://lnkd.in/g94AtZxm
🚀 Join us at the next #AGIBuildersMeetup in San Francisco! Connect with #AI builders and enthusiasts for an evening of inspiring tech talks and networking. Limited capacity, so make sure to register asap!
🗓️ Date: July 30th
📍 Location: San Francisco, California
🕒 Time: 5:30 PM - 8:00 PM, PDT
Here is what you can expect:
✅ Rogue Agents — Stop AI from misusing APIs Twilio
✅ Choosing Your Champion: LLM Inference Backend Benchmarks BentoML
✅ Evaluating Agents and Assistants Arize AI
See the comments for the link to register ⬇️
Llama 3.1 was just released, so we tapped Chris Park and Aman Khan to discuss live. Join us next week as we dive into the new Llama herd (hopefully they don't bite).
Will the latest Llama family of models ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation?
We'll take a closer look at what they did here, decide if we should believe the hype around Meta's "most capable model to date," and talk about the future of open source.
Join us here: https://lnkd.in/dmEY6C8F
Bazaarvoice, a top platform for user-generated content (UGC) and social commerce, has leveraged AI for much of its history — and now has a pioneering LLM app in production.
Lou Kratz, Principal Research Engineer at Bazaarvoice, leads those efforts from a technical perspective. “The biggest impact AI has at Bazaarvoice is around ensuring the content that we provide our clients — which are generated by users — is of high quality,” he recently said. In addition to leveraging other AI systems, “we used generative AI recently to release what we call a content coach that guides consumers though the process of writing a good review.”
Kratz sees two challenges that you may not want to neglect when getting an LLM app into production:
📊 ❌ Data quality for RAG: “You look at something like retrieval augmented generation — it’s really powerful, it can really make things explainable and usable to the general public — but it's only as good as the data we give it. When it comes to business-specific data, the first challenge is getting that cleaned up.”
👩🏻🏫 📗 Education: “Almost all of our data scientists and engineers have become mentors…in order to help people understand the specifics about how AI works and if it solves their use case.”
⭐ Link to his full Arize:Observe talk on creating evals from scratch in the comments.
How do you efficiently fine-tune and serve OSS LLMs using open source packages? Arnav Garg, Senior MLE at Predibase, presented on this at Observe and the recording is now available.
Learn how to efficiently fine-tune and serve OSS LLMs using open source packages like Ludwig and LoRAX with a focus on core fine-tuning and inference ideas that enable faster, cheaper, smaller and better models compared to closed source providers.
OSS FTW: https://lnkd.in/gBcTZj73
NATO has an LLM app in production with a big goal.
“At NATO, we have hundreds and hundreds of thousands of documents and it’s impossible for a human to make sense of all this information. We’re trying to use the power of LLMs to get the right information and the right answers to people and give them certainty that the answers that they are getting are correct and can be traced — and help them then use it for other things like writing reports, building presentations, creating new policies, comparing (doing semantic similarity, also sentiment analysis) so we can really use the power of generative AI to be more productive and faster,” Arnau Pons of NATO Allied Command Transformation (ACT) told us at Arize:Observe.
Lessons learned? As with anything new, achieving results requires being deliberate. “The whole community has tried to rush to make a demo — and that’s great, that impresses everybody at first sight — but when you get to production…that’s when expectations are above what the actual outputs are," he continued. "Falling short there means that we need to change gears and be a lot more deliberate in making sure that we increase the accuracy and value and bring the users along.”
👨🏫 Link to the full talk in the comments.
Bridging LLMs with APIs presents a significant challenge. At Arize:Observe, Shishir Patil presented on Gorilla LLM, which surpasses the performance of all open-sourced LLMs on writing API calls.
Watch the full talk here: https://lnkd.in/gmADVG3S
At Arize:Observe this year--Joe Palermo talked about custom models research at OpenAI, in which they leverage large-scale fine tuning to create domain-specific models to tackle complex problems.
"The goal with custom models is to use our research stack to do large scale fine tuning on behalf of customers in order to produce models that are radically better in specific domains..."
Watch the recording: https://lnkd.in/g5n7F6AF
🛝 Introducing a reimagined prompt playground 🛝
A prompt playground offers a UI to experiment with prompt templates, input variables, LLM models and LLM parameters.
This demo explores Arize’s new prompt playground, covering prompt optimization – including leveraging Arize Copilot – and running optimized prompts on datasets.