Cover Image for Real-world Data Prep for LLMs: Challenges and Solutions
Cover Image for Real-world Data Prep for LLMs: Challenges and Solutions
Avatar for Open Source for AI
Presented by
Open Source for AI
Providing all developers the resources to understand, use, and contribute to the development and direction of AI
Hosted By
178 Went

Real-world Data Prep for LLMs: Challenges and Solutions

YouTube
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Building LLM applications? One of the top problems you'll face is going to be presenting the LLM with good input data. Good LLM responses need good input data. Clean, native text PDFs that are used in explainer articles and example code are rarely what you'll encounter in production use cases. Real-world data is wild to say the least!


Here are some challenges you'll face:
- Scanned PDFs
- Scans with non-standard orientations
- PDF forms with checkboxes and radiobuttons
- Handwritten forms
- Smartphone-clicked documents
- Complex tables
- Tables that span pages

In this practical workshop, let's compare the various libraries and techniques we have at our disposal, looking at their strengths and limitations. This talk hopes to arm you with the knowledge of extracting raw text from real-world documents with the aim of sending that raw text to Large Language Models so that we can structure that data for easy processing downstream.

Your speaker, Shuveb Hussain, is the co-founder and CEO of Unstract, an open source startup building an LLM-powered platform that extracts data from unstructured documents, helping automate critical business processes. Unstract currently extracts and structures millions of pages of real-word data every month. The two products they offer are LLMWhisperer, a Raw Text Extraction API and Unstract, an LLM-powered data structuring platform.

Avatar for Open Source for AI
Presented by
Open Source for AI
Providing all developers the resources to understand, use, and contribute to the development and direction of AI
Hosted By
178 Went