Cover Image for LLM EVALUATION AND QUALITY CONTROL
Private Event

LLM EVALUATION AND QUALITY CONTROL

 
 
Google Meet
Registration
Registration Closed
This event is not currently taking registrations. You may contact the host or subscribe to receive updates.
About Event

We will be bringing together some of the founders, developers, and executives who are building in the LLM space!


CONTEXT

Evaluating the output of large language models is crucial due to the nuances involved in assessing their performance. These models generate text based on statistical patterns and do not possess true understanding or knowledge. Therefore, evaluating their output requires careful consideration of factors such as coherence, relevance, and factual accuracy.

Quality control of the behavior of products built using large language models is another important aspect to consider. These models are often used for various use cases, such as chatbots, virtual assistants, and content generation tools. Ensuring that the behavior of these products aligns with product requirements, human preferences, and ethical and legal standards is crucial. It is necessary to implement robust mechanisms for filtering and moderating the generated content to maintain the integrity and trustworthiness of the products and their associated businesses.


FORMAT

We will have 8 sessions throughout the day. Each 50(ish) min session will comprise of a 15-20 minute presentation followed by at least a 30 minute discussion about the topic based on questions from the audience.


JOIN THE COMMUNITY

If you want to speak about any of these topics or similar ones, reach out to amir@ai.science

JOIN THE CONVERSATION ON OUR SLACK

Want to catch on the content of the previous workshops?

See WORKSHOP MATERIAL


SCHEDULE

FRIDAY, DECEMBER 8th (times are in ET):

9:00 Dr. Daniel Rock (Assistant Professor @ The Wharton School, University of Pennsylvania) GPTs are GPTs!!!

10:00 Dr. Andrew McMahon (Head of MLOps at NatWest Group) Intro toLLM-Ops

11:00 Percy Chen (PhD Student @ McGill U. / R&D Engineer @ Aggregate Intellect) Sherpa - Open Source Project Update [INCLUDES DEMO]

12:00 Suhas Pai (CTO @ Hudson Labs, formerly Bedrock AI) Break! Machine Learning Trivia and Networking

13:00 Meg Risdal (Sr. Product Manager @ Google / Kaggle) Empirical Rigor in ML

14:00 Dr. Val Andrei Fajardo ( Founding ML Engineer @ LlamaIndex) Evaluating Multi-Modal RAG Systems [TUTORIAL]

15:00 Dr. Benedicte PIERREJEAN (Sr. ML Scientist @ Ada) Automatic Evaluation of Dialogue Systems

16:00 Benjamin Labaschin (MLE @ Workhelix) Normie Tools for Validating LLM Outputs [INCLUDES DEMO]


SPEAKERS


Daniel Rock

Daniel Rock is an Assistant Professor of Operations, Information, and Decisions at the Wharton School of the University of Pennsylvania. His research is on the economic effects of digital technologies, with a particular emphasis on the economics of artificial intelligence. He has recently worked on studies addressing the types of occupations that are most exposed to machine learning, measuring the value of AI skillsets to employer firms, and adjusting productivity measurement to include investments in intangible assets. His research has been published in various academic journals and featured in outlets such as The New York Times, Wall Street Journal, Bloomberg, Harvard Business Review, and Sloan Management Review. Daniel received his B.S. from the Wharton School of the University of Pennsylvania, and his M.S. and Ph.D. from the Massachusetts Institute of Technology.

Generative Pre-trained Transformers are General Purpose Technologies - Using a new rubric-based approach on tasks from the O*NET database, we quantify the labor market impact potential of LLMs. Both human annotators and GPT-4 assess tasks based on their alignment with LLM capabilities and the capabilities of complementary software that may be built on top of these models with our rubric. Our findings reveal that between 61 and 86 percent of workers (for LLMs alone versus LLMs fully integrated with additional software) have at least 10 percent of their tasks exposed to LLMs. Additional software systems have the potential to increase the percentage of the U.S. workforce that has at least 10 percent of their work tasks exposed to the capabilities of LLMs by nearly 25 percent. We find that LLM impact potential is pervasive, LLMs improve over time, and complementary investments will be necessary to unlock their full potential. This suggests LLMs are general-purpose technologies. As such, LLMs could have considerable economic, societal, and policy implications, and their overall impacts are likely to be significantly amplified by complementary software.


Andrew McMahon

Andrew (Andy) McMahon is a machine learning engineer and data leader with a passion for delivering valuable solutions that are robust, reliable and scalable. As Head of MLOps at NatWest Group, he is responsible for driving operational best practice for AI and ML products and services across the bank and runs the internal MLOps Centre of Excellence. Andy has delivered high-value ML solutions across multiple industries and is a multi-award winning data practitioner and leader. He is also the author of the popular technical book, Machine Learning Engineering with Python, which is a practical guide to building real solutions using the latest ML engineering and MLOps best practices.

An Introduction to LLMOps - The explosion of interest in LLMs over the past year has meant that more organizations and teams are questioning what it will take to run solutions using these models in production. In this talk, Andy will bring some of his thoughts and ideas around what the new marriage between MLOps and LLMs (LLMOps) means and where the challenges lie. This talk is useful for data scientists, ML and MLOps engineers as well as business stakeholders who want to get an idea of where this new field is and where it is going.


Benedicte PIERREJEAN

Bénédicte Pierrejean is a Senior ML Scientist in the Applied Machine Learning team at Ada. She has a PhD in Natural Language Processing and is passionate about improving customer's experiences using ML.

Automatic Evaluation of Dialogue Systems - LLMs are making a huge impact in a variety of fields and they are especially powerful for customer service. They make it easier to move away from structured dialogue flows and provide users with natural conversations. However, using LLMs for dialogue systems is not without challenges. One such challenge is evaluation. How do we ensure that the model we use produces content that is safe, accurate and relevant? How do we make sure we are driving conversations towards resolution?

This talk will focus on how we use LLMs for evaluation by simulating conversations between our production dialogue systems and users. We will see how using this approach we can reproduce realistic testing conditions and be able to quickly assess the impact of any new change to our production pipelines.


Benjamin Labaschin

Ben Labaschin is Principal MLE and founding engineer at Workhelix. He has worked as data scientist and economist in various places in the past decade.

Normie Tools for Validating LLM Outputs - As Principal MLE at a generative AI startup, I use LLMs a lot. Whether it be OpenAI or LLama 2, whatever the generative process is, the same problem persists: how can I trust the output generated for me is of the proper structure and content? It may surprise you to learn that more often than not, I will reach for the standard tools of the trade to accomplish 90% of my needs. In this talk I will discuss these tools and how you can use them to overcome the problem of stochastic generative output. 


Meg Risdal

Meg is a lead Product Manager at Kaggle based in Toronto. Her academic background is in linguistics where she studied sociophonetics and language variation. She has Master's degrees from UCLA and North Carolina State University.

Empirical rigor in ML as a massively parallelizable challenge - The foundation for the remarkable advances in the field of machine learning is a rigorous process of peer review. New results are subject to significant scrutiny via peer review from expert reviewers, compared to a wide range of relevant benchmarks, and are independently reproduced by other researchers as part of the overall validation cycle of a mature scientific discipline. In this talk, we examine some of the structural issues that may be causing parts of this process to be breaking down in our field, and note the ways that the rapid expansion and growth of the field may be exacerbating these problems. We then move on to show how a transparent, community-driven approach can help to address many of these fundamental, structural issues through a different structural approach. From a broad perspective, it turns out that machine learning competitions and other open challenges have exactly the characteristics that allow us to address the bottlenecks and constraints identified above in a massively parallel way, where the parallelization happens across the open community. We argue that for this reason such competitions and challenges are more important now than ever before, and show how they can improve the quality, trustworthiness, and rigor of results while simultaneously increasing the pace of progress for the field.


Percy Chen

Percy is a PhD student at McGill University. His research interests include model-driven software engineering, trustworthy AI, verification for ML systems, Graph Neural Networks, and Large Language Models.

Percy will present an update on aggregate intellect's open source "thinking companion" Sherpa!

In this session, we'll explore the unique challenges and methods of testing for Large Language Models (LLMs)-based systems, contrasting them with traditional software testing approaches. Using our open-source Sherpa project as a case study, we'll demonstrate practical implementations of these testing strategies, highlighting how they ensure robustness and reliability of the components interacting with LLMs. The session will also include the latest updates to the Sherpa project, ending with some open questions and challenges we will tackle next.


Suhas Pai

​​Suhas is the CTO & Co-founder of Bedrock AI, an NLP startup operating in the financial domain, where he conducts research on LLMs, domain adaptation, text ranking, and more. He was the co-chair of the Privacy WG at BigScience, the chair at TMLS 2022 and TMLS NLP 2022 conferences, and is currently writing a book on Large Language Models.

Suhas will host the "ML Trivia" session where he will teach us all about ML stuff we should know already!

Participants will be assigned to groups of 3-5 people. Each group will be given 10 minutes to get to know each other, and brainstorm together about the answers to the questions and individually submit their answers. Groups will be scrambled 2 more times to give you an opportunity to meet more people and brainstorm about LLMs with them. After 3 rounds of collaborative trivia question answering (5 ish questions in each round), Suhas will go through the questions and provide the nuances and important learnings about each of them. The top 3 individuals will be given perpetual bragging rights about their LLM skills.


Val Andrei Fajardo

Andrei is a Founding Software/Machine Learning Engineer at LlamaIndex. He has nearly 10 years of industry experience building Machine Learning systems and has a PhD in Statistics from the University of Waterloo.

Evaluating Multi-Modal RAG Systems - In this workshop, we briefly go over Multi-Modal RAG systems and how they can be evaluated with traditional and modern (i.e., those involving LLMs) techniques. We'll go through the materials in this talk interactively through a live tutorial on how to use the LlamaIndex library in order to ingest text documents and images into a Vector store, build a Multi-Modal RAG engine, and finally evaluate it against benchmarks.