How synthetic data powers expert LLMs

0 views

0 0

How synthetic data powers expert LLMs

The demand for high-quality data to train large language models (LLMs) is outpacing supply, especially for domain-specific enterprise needs. In this session, research scientists Shivchander Sudaliraj and Hao Wang break down how Red Hat is tackling this challenge with an open source toolkit for synthetic data generation and processing.

SDG-Hub (synthetic data generation hub) is a modular and scalable open source toolkit that enables researchers and practitioners to quickly generate high-quality, domain-specific training data. Learn how SDG-Hub helps customize and fine-tune LLMs for specific industry expertise, using techniques like retrieval-augmented generation (RAG) and data subset selection for efficient training.

00:36 Why synthetic data and the SDG-Hub toolkit
01:31 Why we are running out of high-quality data for LLMs
02:08 The gap in proprietary data for custom LLMs
03:51 Synthetic data in today’s LLM pipeline
04:22 Synthetic data generation algorithms
04:40 Choosing the right generation algorithm
05:06 Introducing SDG-Hub
05:21 LLM customization workflow overview
06:08 What is SDG-Hub?
09:30 How to create your own custom flow in four easy steps
10:44 Use case: Training an IBM finance bot with raw data
12:20 From raw documents to domain-tuned models workflow
13:36 Generating synthetic data using SDG-Hub
14:56 Data subset selection for efficient training
15:24 The three dimensions of LLM efficiency
17:22 Use case 1: Enhancing efficiency with data subset selection
18:10 Use case 2: Targeted data subset selection
18:44 Use case 3: Complementary data selection and bias mitigation
24:04 Data subset selection pipeline visualized

🔗See SDG-Hub’s pre-built workflows for yourself: https://developers.redhat.com/articles/2025/10/27/sdg-hub-building-synthetic-data-pipelines-modular-blocks

#RedHat #AI #LLM

Date: November 28, 2025

Related videos