

Web Crawler for AI projects
In this session, we’ll explore how web crawlers are vital tools for building high-quality datasets used in AI and machine learning projects. The presentation will begin by explaining the core concepts of web crawling and how it differs from scraping, along with key ethical considerations such as respecting robots.txt and rate limits. We’ll examine real-world use cases including data collection for large language models, RAG (Retrieval-Augmented Generation) systems, and sentiment analysis. Attendees will be introduced to widely-used tools like Scrapy, BeautifulSoup, and Selenium, and learn how to construct scalable data pipelines—from seeding URLs and parsing web pages to cleaning and storing the resulting content. We’ll also cover strategies for handling dynamic sites, CAPTCHAs, and multilingual content. The session will highlight techniques to deduplicate and filter crawled data to ensure relevance and quality for AI models. You’ll see how crawlers can be optimized for both batch and real-time use cases and how to design distributed systems that scale using task queues and proxy rotation. We’ll walk through an end-to-end example where we build a domain-specific dataset for fine-tuning a language model or powering a knowledge-augmented chatbot. By the end of the session, you’ll have a strong grasp of how to design, implement, and scale a web crawler pipeline tailored to the data needs of your AI project. Practical tips, code templates, and architectural patterns will also be shared to help you get started right away.
----
Speaker: Krishnan Ramaswamy
Gen AI Product Development & Principal Architect @ Cisco for, AI, ML, and Gen AI-enabled computer networking products & solutions.
In my current role as Gen AI Product Owner and Principal Architect at Cisco, I spearhead the innovation of AI/ML-enabled solutions within the Customer Experience portfolio, concentrating on Data Center and Security domains. Our team's initiatives have led to the identification of transformative Gen AI solutions, significantly enhancing productivity across support teams, customers, and partners.
I take pride in leading the development of advanced AI technologies, such as Conversational AI Search and Custom LLMs-based applications, which have been instrumental in advancing Cisco's product configuration and deployment design. The strategic integration of next-generation products and services under my guidance has driven meaningful advancements, positioning us at the forefront of the industry's evolution.