Anyscale’s Post

View organization page for Anyscale, graphic

25,434 followers

3w Edited

Introducing RouteLLM: a sophisticated routing framework developed with Berkeley-LMSys in collaboration with Anyscale. RouteLLM optimizes query handling by dynamically selecting between high-performance proprietary LLMs and cost-effective open-source models, cutting costs by over 2x without sacrificing quality. Using human preference data and LLM-as-a-judge for data augmentation, our routers evaluate query complexity to choose the appropriate model. Rigorous testing on benchmarks like MMLU and GSM8K confirms our cost-efficient, high-quality performance. Explore our open-source code, models, and preference data on GitHub: https://lnkd.in/gJyijZRx, and try our online demo: https://lnkd.in/gME-GC2a Explore More about RouteLLM and learn how it can revolutionize your LLM applications: Read our Blog here: https://lnkd.in/gfPD-u-y LLMsys Blog here: https://lnkd.in/ga_MgERE Full research paper here: https://lnkd.in/gqRy7Pjy

GitHub - anyscale/llm-router: Tutorial for building LLM router

github.com

1 Comment

Royal Sequeira

Machine Learning Engineer @ Georgian | Founder & Convener, Sushiksha

Wonder how the results compares to GPT-4o both with regards to cost and performance? Also, are there hidden costs such as hosting the causal LLM?

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Kale Nagraj

Cyber Security Professional | Pentester | CEH-V12 in Prep📚 | Network | Nmap | Python | Linux | Security Frameworks | Ensuring Digital Safety | Polyglot
2w
Report this post
Performed Port Scanning using sx Tool in lab. The sx tool is a command-line network scanner that can be used to perform ARP scans, ICMP scans, TCP SYN scans, UDP scans and application scans such as SOCS5 scans, Docker scans and Elasticsearch scans. Command: sx <Options> Options: -> arp: performs an ARP scan. -> docker: Performs Docker scan. -> elastic: Performs Elasticsearch scan. -> icmp: Performs ICMP scan. -> udp: Performs UDP scan. -> tcp: performs a TCP scan. -> --json: converts a text file to the JSON format. -> tee: writes the data to stdin. -> -p: specifies the range of ports to be scanned. -> --help: to obtain the list of commands that can be used.
Like Comment
To view or add a comment, sign in
Muhammad Rizwan Munawar

Computer Vision Engineer @Ultralytics | Solving Real-World Challenges🔎| Python | Published Research | Open Source Contributor | GitHub 🌟 | Daily Computer Vision LinkedIn Content 🚀 | Technical Writer VisionAI @Medium📝
7mo
Report this post
🧠 MemGPT: OS inspired LLMs that manage their own memory 🧠 🔗 Project Link - https://lnkd.in/dsAJnBgx 🔥MemGPT allows LLMs to have unlimited context drawing inspiration from traditional operating systems’ hierarchical memory systems. ✅ Has virtually unlimited memory! ✅ Can easily connect to external data sources ✅ Comes with LanceDB (YC W22) support by default providing scalable semantic search via archival storage ✅ Support many LLMs out of the box and can be plugged into a custom LLM server 😍 you can simply dump all of your data into it and ask it to look for information in its archival memory :) See this simple experiment where Ayush Chaurasia ingested memGPT docs and asked questions about it. Drop a 🌟 https://lnkd.in/dsAJnBgx #llms #gpt4
3 Comments
Like Comment
To view or add a comment, sign in
Umberto Surricchio

Senior IT Project /IT Application Operations Manager at Engineering Ingegneria Informatica SPA
4mo
Report this post
📣 Don't miss out on the chance to learn more about LLMs with the new DeepLearning.AI's free course. I am so thankful for this valuable opportunity. 🙏 #Deeplearning_AI #LLM #LLMOptimization #LLMOps

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S

1 Comment
Like Comment
To view or add a comment, sign in
TechDogs

10,166 followers
8mo
Report this post
Discover the realm of tech jargon through the #TechDogs tech dictionary. Understand the word of the day with its simplified definition. Word of the day - Cache on a STick (COASt) https://bit.ly/3tTcTnb

Cache on a STick (COASt)

techdogs.com
Like Comment
To view or add a comment, sign in
Konark modi

VP of Technology at HolidayCheck Group AG | Driving Innovation in Tech
4mo
Report this post
Just finished with this course. If you are deploying AI to production or managing teams doing so, I can highly recommend this short course to gain a better understanding of things happening under the hood.

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S
Like Comment
To view or add a comment, sign in
Maksym Prokopov

Staff Site Reliability Engineer at Billie
3mo
Report this post
Registration for the course it open, if you want to learn how to serve LLM efficiently, what are the channenges and how to solve them, my recommendation.

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S
Like Comment
To view or add a comment, sign in
Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo
Report this post
Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S

57 Comments
Like Comment
To view or add a comment, sign in
Kant Narcisse

AI/ML Consultant with a decade of Data Solutions Expertise | Data Governance Advocate | Lead Solutions Architect | Principal Data Consultant | Educator in Cutting-edge AI Technologies | AWS & Azure A.I. Certified | CISM
4mo
Report this post
Unlock the secrets to efficiently serving LLMs at scale with expert insights and real-world optimizations in our collaboration with Predibase - dive in now!

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S
Like Comment
To view or add a comment, sign in
Pedro Tobarra

Data Scientist & Machine Learning Engineer @ BASF | Python, Linux, AI #KeepCoding #KeepCoder
4mo
Report this post
20240320 Andrew Ng: Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S #llm #datascience

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S
Like Comment
To view or add a comment, sign in
Raunak Sharan

Founder | I help companies grow and enable innovation
4mo
Report this post
Learn how to build LLM inference system.

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI
4mo

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here: https://lnkd.in/db5MC88S
Like Comment
To view or add a comment, sign in

25,434 followers

View Profile Follow

Anyscale’s Post

More Relevant Posts

Explore topics