Implement a Search Engine
Text and Vector Search from Scratch - Alexey Grigorev
Join us for a workshop where we will dive into the fundamentals of building a search engine from scratch. This session will focus on two primary search methodologies: text search and vector search. Designed for enthusiasts and learners interested in understanding the core principles of search engines, this workshop requires no prior experience with search technologies or machine learning.
Workshop Outline:
1. Introduction to Search Engines:
- Overview of how search engines work
- Differences between text search and vector search
2. Basics of Text Search:
- Implementing a simple in-memory text search engine
- Tokenization and text preprocessing
- Inverted index creation for efficient text search
3. Introduction to Embeddings:
- What are embeddings and how they work
- Different types of embeddings (e.g., word2vec, BERT)
- Use cases for embeddings in search
4. Basics of Vector Search:
- Implementing a simple in-memory vector search engine
- Converting text data into vectors using embeddings
- Calculating similarity between vectors (e.g., cosine similarity)
5. Combining Text and Vector Search:
- Strategies for integrating text and vector search
- Practical applications and examples
About the workshop:
In this hands-on session, we will:
- Text Search Implementation: Build a basic in-memory search engine that indexes and searches text data. You will learn how to preprocess text, create an inverted index, and efficiently retrieve results based on keyword queries.
- Vector Search Implementation: Explore how to use embeddings to transform text into vectors and implement a vector search engine. You will understand how to compute similarities between vectors to retrieve relevant results.
- Hands-On Coding: Work through practical coding examples in a Jupyter Notebook using basic Python. This will give you a clear understanding of both text and vector search mechanisms.
- Integration Techniques: Discuss methods to combine both search approaches to enhance search results and improve retrieval accuracy.
This workshop is an optional bonus module designed to provide additional insights and practical experience in search engine implementation. There are no assignments or prerequisites, making it accessible to anyone interested in expanding their knowledge.
About the speaker:
Alexey is the founder of DataTalks.Club — a community of 54,000+ data enthusiasts.
Alexey wrote a few books about machine learning. One of them is Machine Learning Bookcamp — a book for software engineers who want to get into machine learning.
As creator of MLOps Zoomcamp, he has helped thousands to get started with MLOps (Machine Learning Operations) through this free, hands-on training program.
DataTalks.Club is the place to talk about data. Join our slack community!