Podcast episodes
Season 3
Saurabh Rai - Growing Resume Matcher
Topics: 00:00 Intro - how do you like our new design? 00:52 Greets 01:55 Saurabh's background 03:04 Resume Matcher: 4.5K stars, 800 community members, 1.5K forks 04:11 How did you grow the project? 05:42 Target audience and how to use Resume Matcher 09:00 How did you attract so many contributors? 12:47 Architecture aspects 15:10 Cloud or not 16:12 Challenges in maintaining OS projects 17:56 Developer marketing with Swirl AI Connect 21:13 What you (listener) can help with 22:52 What drives you? Show notes: - Resume Matcher: https://github.com/srbhr/Resume-Matcher website: https://resumematcher.fyi/ - Ultimate CV by Martin John Yate: https://www.amazon.com/Ultimate-CV-Cr... - fastembed: https://github.com/qdrant/fastembed - Swirl: https://github.com/swirlai/swirl-search
Season 2
Sid Probstein - Creator of SWIRL - Search in siloed data with LLMs
Topics: 00:00 Intro 00:22 Quick demo of SWIRL on the summary transcript of this episode 01:29 Sid’s background 08:50 Enterprise vs Federated search 17:48 How vector search covers for missing folksonomy in enterprise data 26:07 Relevancy from vector search standpoint 31:58 How ChatGPT improves programmer’s productivity 32:57 Demo! 45:23 Google PSE 53:10 Ideal user of SWIRL 57:22 Where SWIRL sits architecturally 1:01:46 How to evolve SWIRL with domain expertise 1:04:59 Reasons to go open source 1:10:54 How SWIRL and Sid interact with ChatGPT 1:23:22 The magical question of WHY 1:27:58 Sid’s announcements to the community YouTube version: https://www.youtube.com/watch?v=vhQ5LM5pK_Y Design by Saurabh Rai: https://twitter.com/_srbhr_ Check out his Resume Matcher project: https://www.resumematcher.fyi/
Atita Arora - Search Relevance Consultant - Revolutionizing E-commerce with Vector Search
Topics: 00:00 Intro 02:20 Atita’s path into search engineering 09:00 When it’s time to contribute to open source 12:08 Taking management role vs software development 14:36 Knowing what you like (and coming up with a Solr course) 19:16 Read the source code (and cook) 23:32 Open Bistro Innovations Lab and moving to Germany 26:04 Affinity to Search world and working as a Search Relevance Consultant 28:39 Bringing vector search to Chorus and Querqy 34:09 What Atita learnt from Eric Pugh’s approach to improving Quepid 36:53 Making vector search with Solr & Elasticsearch accessible through tooling and documentation 41:09 Demystifying data embedding for clients (and for Java based search engines) 43:10 Shifting away from generic to domain-specific in search+vector saga 46:06 Hybrid search: where it will be useful to combine keyword with semantic search 50:53 Choosing between new vector DBs and “old” keyword engines 58:35 Women of Search 1:14:03 Important (and friendly) People of Open Source 1:22:38 Reinforcement learning applied to our careers 1:26:57 The magical question of WHY 1:29:26 Announcements See show notes on YouTube: https://www.youtube.com/watch?v=BVM6TUSfn3E
Connor Shorten - Research Scientist, Weaviate - ChatGPT, LLMs, Form vs Meaning
Topics: 00:00 Intro 01:54 Things Connor learnt in the past year that changed his perception of Vector Search 02:42 Is search becoming conversational? 05:46 Connor asks Dmitry: How Large Language Models will change Search? 08:39 Vector Search Pyramid 09:53 Large models, data, Form vs Meaning and octopus underneath the ocean 13:25 Examples of getting help from ChatGPT and how it compares to web search today 18:32 Classical search engines with URLs for verification vs ChatGPT-style answers 20:15 Hybrid search: keywords + semantic retrieval 23:12 Connor asks Dmitry about his experience with sparse retrieval 28:08 SPLADE vectors 34:10 OOD-DiskANN: handling the out-of-distribution queries, and nuances of sparse vs dense indexing and search 39:54 Ways to debug a query case in dense retrieval (spoiler: it is a challenge!) 44:47 Intricacies of teaching ML models to understand your data and re-vectorization 49:23 Local IDF vs global IDF and how dense search can approach this issue 54:00 Realtime index 59:01 Natural language to SQL 1:04:47 Turning text into a causal DAG 1:10:41 Engineering and Research as two highly intelligent disciplines 1:18:34 Podcast search 1:25:24 Ref2Vec for recommender systems 1:29:48 Announcements For Show Notes, please check out the YouTube episode below. This episode on YouTube: https://www.youtube.com/watch?v=2Q-7taLZ374 Podcast design: Saurabh Rai: https://twitter.com/srvbhr
Evgeniya Sukhodolskaya - Data Advocate, Toloka - Data at the core of all the cool ML
Toloka’s support for Academia: grants and educator partnerships https://toloka.ai/collaboration-with-educators-form https://toloka.ai/research-grants-form These are pages leading to them: https://toloka.ai/academy/education-partnerships https://toloka.ai/grants Topics: 00:00 Intro 01:25 Jenny’s path from graduating in ML to a Data Advocate role 07:50 What goes into the labeling process with Toloka 11:27 How to prepare data for labeling and design tasks 16:01 Jenny’s take on why Relevancy needs more data in addition to clicks in Search 18:23 Dmitry plays the Devil’s Advocate for a moment 22:41 Implicit signals vs user behavior and offline A/B testing 26:54 Dmitry goes back to advocating for good search practices 27:42 Flower search as a concrete example of labeling for relevancy 39:12 NDCG, ERR as ranking quality metrics 44:27 Cross-annotator agreement, perfect list for NDCG and Aggregations 47:17 On measuring and ensuring the quality of annotators with honeypots 54:48 Deep-dive into aggregations 59:55 Bias in data, SERP, labeling and A/B tests 1:16:10 Is unbiased data attainable? 1:23:20 Announcements This episode on YouTube: https://youtu.be/Xsw9vPFqGf4 Podcast design: Saurabh Rai: https://twitter.com/srvbhr