| Class | Semester | Instructor | Department | License |
|---|---|---|---|---|
| AI for Social Science Methods | Fall 2025 | Daniel Karell | Sociology | CC BY-NC-SA 4.0 |
Learning Objectives
- Students learn how to use LLMs to generate data comparable to data created by people responding to a survey.
- Students explore the effectiveness and limitations of using LLMs as substitutes for human participants in a survey
- Students gain insights enabling them to critically assess a debate in social science over using LLM-derived data as a complement – or even replacement – for human-produced data.
Overview (see attached for complete instructions)
The goals of this activity are to gain (1) hands-on familiarity with prompting techniques (some of which we have read about) and (2) experience utilizing large language models (LLMs) to label observations in a dataset and extract information from text, which are common tasks in social science research. We will be using a dataset provided by Armed Conflict Location and Event Data (ACLED). ACLED is a non-profit organization that collects data on violent conflict and protests around the world. It organizes and publishes these data, along with a codebook, or a guide explaining the data. For the exercises below, we will utilize both a sample of ACELD’s data and its codebook. When working through the example code in the following section(s), you do not need to submit answers to any questions in the text. These questions are meant to help you reflect on what is happening in the example analysis. Try to answer them to yourself to check your understanding. You only need to submit answers to the questions and prompts in the “Exercises” section below.
Reflections
The goal of this activity is to explore the effectiveness and limitations of using large language models (LLMs) as substitutes for human participants in social science research. The title of this assignment, “Silicon Subjects” refers to the idea of using LLM-derived (or “silicon” or “synthetic”) data as a complement – or even replacement – for “organic”, human-produced data. In this assignment, we will be engaging with insights and analyses from four assigned readings: Argyle, et al. 2023; Bisbee, et al. 2024; Lyman, et al. 2025; and Broska, et al. 2025. Please familiarize yourself with these articles before you start this assignment. You can find the articles in the Files/readings/ folder on Canvas.
Readings and Resources
1. “Out of One, Many: Using Language Models to Simulate Human Samples” by Argyle, et al. in Political Analysis (2023)
2. “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models” by Bisbee, et al. in Political Analysis (2024)
3. “Balancing Algorithmic Fidelity and Alignment in Silicon Sampling Research Methods” by Lyman, et al. in Sociological Methods & Research (2025)
4. “Nationally Representative, Locally Misaligned: The Biases of Generative Artificial Intelligence in Neighborhood Perception” by Bollen, et al. in Political Analysis (2025)
5. “The Mixed Subjects Approach: Treating Generative AI as (Potentially) Informative Observations in Experiments” by Broska, et al. in Sociological Methods & Research (2025)