LLM Data Prep Workshop: Dealing with real-world documents

Unstract

Register to See Address

Chennai, Tamil Nadu

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Building LLM applications? One of the top problems you’ll face is going to be presenting the LLM with good input data.
Good LLM responses need good input data. Clean, native text PDFs that are used in explainer articles and example code are rarely what you’ll encounter in production use cases. Real-world data is wild to say the least!

Here are some challenges you’ll face:
- Scanned PDFs
- Scans with non-standard orientations
- PDF forms with checkboxes and radiobuttons
- Handwritten forms
- Smartphone-clicked documents
- Complex tables
- Tables that span pages

What will you be learning?

In this practical workshop, let’s compare the various libraries and techniques we have at our disposal, looking at their strengths and limitations.

This talk hopes to arm you with the knowledge of extracting raw text from real-world documents with the aim of sending that raw text to Large Language Models so that we can structure that data for easy processing downstream.

Who is speaking?

Your speaker, Shuveb Hussain, is the co-founder and CEO of Unstract, an open source startup building an LLM-powered platform that extracts data from unstructured documents, helping automate critical business processes.

Unstract currently extracts and structures millions of pages of real-word data every month. The two products they offer are LLMWhisperer, a Raw Text Extraction API and Unstract, an LLM-powered data structuring platform.

Location

Please register to see the exact location of this event.

Chennai, Tamil Nadu

Presented by

Unstract

Hosted By

22 Went