Loading...

Course Description

In this course, you will analyze how large language models are constructed from diverse text sources and examine the entire model life cycle, from pretraining data collection to generating meaningful outputs. You’ll explore how choices about data type, genre, and tokenization affect a model’s performance, discovering how to compare real-world corpora such as Wikipedia, Reddit, and GitHub.

Through hands-on projects, you will design tokenizers, quantify text characteristics, and apply methods like byte-pair encoding to see how different preprocessing strategies shape model capabilities. You’ll also investigate how models interpret context by studying keywords in context (KWIC) views and embedding-based analysis.

By the end of this course, you will have a clear understanding of how data selection and processing decisions influence the way LLMs behave, preparing you to evaluate or improve existing models.

You are required to have completed the following courses or have equivalent experience before taking this course:

  • LLM Tools, Platforms, and Prompts
  • Language Models and Next-Word Pronunciation
  • Fine-Tuning LLMs

Faculty Author

David Mimno

Benefits to the Learner

  • Summarize the life cycle of a language model, detailing each phase from data collection through inference
  • Assess the impact of data collection and curation choices on a model’s predictive capabilities and domain coverage
  • Analyze pretraining documents to see how an LLM extends prompts within specific, real-world contexts
  • Classify and quantify text collections by genre, language, and code to gauge their effect on model behavior
  • Explain how a pretraining dataset’s composition influences tokenizer coverage and performance across different text domains

Target Audience

  • Engineers
  • Developers
  • Analysts
  • Data scientists
  • AI engineers
  • Entrepreneurs
  • Data journalists
  • Product managers
  • Researchers
  • Policymakers
  • Legal professionals

Applies Towards the Following Certificates

Loading...
Enroll Now - Select a section to enroll in
Type
2 week
Dates
Dec 03, 2025 to Dec 16, 2025
Total Number of Hours
16.0
Course Fee(s)
Contract Fee $100.00
Type
2 week
Dates
Feb 25, 2026 to Mar 10, 2026
Total Number of Hours
16.0
Course Fee(s)
Contract Fee $100.00
Type
2 week
Dates
May 20, 2026 to Jun 02, 2026
Total Number of Hours
16.0
Course Fee(s)
Contract Fee $100.00
Required fields are indicated by .