Campaign Trend Podcast

The Digital Twin Dilemma: Why AI Can't Replace Real Voters — Yet

Eric Wilson Season 4 Episode 8

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 27:50

Every pollster knows the problem: for every hundred voters you ask to take a poll, only one or two actually respond. So what if AI could just answer the questions for them? It's a seductive idea — and Ben Leff has put it to the test.

Ben is co-founder and CEO of Verasight, a survey research firm founded by academics and trusted by leading institutions and media organizations. He and his colleagues G. Elliott Morris and Peter Enns recently published a series of papers asking the question the industry is buzzing about: can AI digital twins replace human survey respondents? Their verdict, after rigorous real-world testing, is a firm — for now.

In this conversation with host Eric Wilson, Ben breaks down exactly what synthetic sampling is, how Verasight built and stress-tested it, and where it consistently fell apart. Ben sees a narrow lane where synthetic data could earn a legitimate spot in the toolkit — as a directional pre-screen, a quick and cheap starting point before committing to a full sample. But for anything requiring precision in close races, the humans still have to show up.

Visit our website: CampaignTrend.com

SPEAKER_00

You say, here's a example data set from the census. Let's pick a thousand of these people at random to try to predict how they'll behave.

SPEAKER_02

Welcome to the Campaign Trend Podcast, where you're joining in on a conversation with the entrepreneurs, operatives, and experts who make professional politics happen. I'm your host, Eric Wilson. My guest today is Ben Leff, co-founder and CEO of VeriSite, a survey research firm that's become a trusted source for high-quality polling data. Ben and his colleagues, G. Elliott Morris and Peter Enns, just published a report asking a question that every pollster is worried about, which is can AI digital twins replace human survey respondents? And their answer, after a whole series of experiments, is not in any way that we've tried. So here's the challenge that we faced. For every hundred voters you asked to take a poll, only one or two actually do. That is what's making polling more expensive and less reliable every cycle, because we have to keep those surveys open for longer. So it sure would be a lifesaver if AI could answer questions like a real voter at a fraction of the cost. So today we're going to get into what synthetic samples actually get right, where they fail, uh, why that matters, and whether the human benchmark itself is holding up. So, Ben, it it's not much of a leap. AI chatbots are genuinely good at writing like humans now. They're the tells them it's not X, it's Y cadence that we all recognize. But it is convincing. So it's not a crazy, crazy leap to wonder whether AI could also answer questions like a human, say, in a political survey. So explain for us what the idea of a digital twin is and walk us through how you tested it and iterated on it.

SPEAKER_00

Yeah, so it's not a crazy idea, and that's why a lot of people are trying it. Businesses have spun up trying to do just that, and people are actually using this type of information. So the way you try it and what you do to build a synthetic sample is we know what the American public looks like from the census and similar high-quality surveys. And so what you do is you have a demographic picture of the United States. So when you build, just like how a real survey, you take a sample of a thousand real people, what that means is they'll have different age groups, income levels, educations. When you build the synthetic sample, you do the same thing. So you say, here's an example data set from the census. Let's pick a thousand of these people at random to try to predict how they'll behave. So that's sort of the first step in building a synthetic sample. We're gonna randomly sample benchmarks, calling them digital twins from the public. And then what you're doing with the LLM is for each demographic persona, you're then predicting how they would answer a set survey question. So in the context of politics, it might be whether they approve or disapprove of President Trump, who they plan to vote for in the midterm election, for instance, or how they feel about a variety of issues. So essentially what the model is doing is you're first building sort of this profile of what your sample looks like, and then for each individual in that synthetic sample, you're guessing how they would answer the question. So that's at a very high level what we're doing when we say um building a synthetic sample. Now, where it gets more complicated is if you really believe in the promise of synthetic data, and one thing we wanted to test and take seriously, is say, what if we gave the model even more information before asking it to predict how the human would um respond or what the model thinks the human would respond? So great. So in addition to just giving the model demographic data, we also have in this case, we took a year's worth of public parasite survey data, about 15,000 respondents in that case. And we looked at which respondents were closer to the were closest to the synthetic persona. So let's say it's a um 24-year-old female Democrat in New York City. We found a survey respondent just like that, and we assigned their attitudes from the last year's worth of surveys. You basically preloaded some opinions. Exactly. Exactly. So the thesis being the more data you give the model, the better um it will perform. So that's the thesis.

SPEAKER_02

Which is which is basically the fundamental conceit of all AI, right? Give it more data and it performs better, yeah.

SPEAKER_00

Exactly. And what we found, which is discouraging, is more information doesn't always make the model perform better. And that's a really important overall conclusion from our series of papers that we published. Because if every time we gave the model more information, or every time we used the newest and latest model, sometimes the newest models did worse than the previous model. But if every time it was getting better, then a reasonable conclusion would be it's just a matter of time until synthetic data is as accurate. But what we're finding is, depending on the question, different models perform better, whether the model has more or less information affects the performance. And so it doesn't point to sort of this clear direction where synthetic data is going to continue to get more accurate.

SPEAKER_02

So the the the headline conclusion then from the report, which we'll link to in the show notes, is that these synthetic samples, even when you made them really robust, are fundamentally just predictions and not measurements of public opinion. Um explain that distinction for us in the context of political polling.

SPEAKER_00

Totally. So what we did to just measure the effectiveness of synthetic samples in the reports is whenever we ran the simulations, we would also run a real survey of a nationally representative sample of a thousand Americans. So that would be what we would compare the synthetic predictions to. And what we found is on average the error rate was a little over six and a half percent. And that was across about 15 questions.

SPEAKER_02

Now, sometimes the error rate means that the AI uh answer was on average six points different than the actual humans answer.

SPEAKER_00

Exactly. And sometimes that was really low, closer to two points or less, and sometimes it was above twelve points. So what's really problematic as a pollster is unless you have the real survey, you don't know if your synthetic prediction is closer to the two percent or closer to the 12%. So you really don't know how far off. And this is when we say 12%, it could be 12% higher, 12% lower. So that's a big range of error that's very hard to make conclusions, particularly for political pollsters when elections are super tight. That is far from the precision needed. So go ahead. Yeah. And then I I just really quickly want to touch on, and it might be part of your next question, when the error rate is really low, because people love that and are like, oh, synthetic samples are amazing. So the example I want to give is, and and you'll see this in the report. If the model knows that someone is a, if we give the model this person is a democrat, a registered democrat, they voted for Kamala in the last election, and then we ask, predict whether they approve or disapprove of President Trump. And the model says they're gonna disapprove of President Trump. And as a whole, the synthetic model is is very accurate. That's when the model is is super accurate and very low. What I would contend is that's not very impressive. We don't need an LLM if I asked you, Eric. Yeah. If I asked you, Eric, tell me what a Democrat who voted for Kamala in the last election approves or disapproves of President Trump, you would probably have about the same rate of success as the LL. I I drink less water, actually. There you go. Yeah. And you probably don't charges, man.

SPEAKER_02

Here's how I've come to think about it. So AI is basically a rear view mirror. So so it's not a windshield looking forward because the LLM is trained essentially to mimic text that it's already read. And so when I'm explaining this in trainings, I use the example of of like a game of chance where you are probabilistically filling in the next likely word, and it's gonna get some things wrong. But but it makes sense to me that that your model fell apart on things like the Venezuela operation, you know, events that had um happened after it was trained. Is that really what the problem is? Or or is it we we can give it more data or better prompts? Uh or or is that just kind of baked into the the models?

SPEAKER_00

It's a mix of both. The models are trained on, as you said, historic data, and most public opinion surveys are measuring the present or trying to predict the future. And so the question you're referring to, which had the large largest error, was about people's awareness of the raids in Venezuela. That's when the study was was first fielded, so that was top of mind. And the model did really poorly with it because it did not have a large pool of data to draw on. So you're exactly right. Whenever public opinion is pulling a new topic or there's a major political event, a new scandal comes out, a candidate exits the race, that's where we would predict the models to perform worse relative to questions that are super predictable or stable over time. Going back to the presidential approval measure, for instance, there's now Donald Trump has been in the picture for 10 years at this point. And the model has 10 years worth of data that it's been trained on. So you would expect it to be more accurate on questions related to approval of the president.

SPEAKER_02

Yeah, and and by the way, you know, you that was where the the more robust version of the AI got more accurate with generic ballot, Trump approval. But we already have that polling. It's it's really easy to get that that data and it's everywhere. Um so it's just one of those things where it's like it it may, you know, it may do a good job there, but not necessarily helping us with the underlying problem of it's getting harder to reach real people. One thing that the the Venezuela question kind of reminded me of, I think it was Jimmy Kimmel who used to do the um sort of man on the street interviews. We he would go, you know, ask people there in in LA, well, you know, can you believe that so-and-so, who's been dead for 20 years, didn't win the Grammy? And they'll say, Oh yeah, well, that was that was a terrible thing. And so I wonder if there's a thing with humans where we just have to give our opinions, whereas the uh the AI might sort of default to unsure or or or neutral.

SPEAKER_00

Um the model we actually found some of the reverse. The model was the most was very often not saying don't know. Um, you know, if you asked it about a current events issue, the model didn't want to say that people would answer not sure, don't know. We found really low rates, sometimes zero, of LLMs predicting don't know responses.

SPEAKER_02

Just total, total confidence.

SPEAKER_00

Exactly. Yeah.

SPEAKER_02

You're listening to the campaign trend podcast. I'm speaking with Ben Leff from Vera site uh about the question of can AI replace humans when it comes to to polls, or are we stuck with asking people for their opinions? Um Ben, one of their most in surprising results is that the more information, the better models, more voter file data data, richer prompts, that that didn't really help, uh at least reliably, and and as you mentioned, sometimes made things worse. Um can you give us a sense of why this doesn't behave like a you know a normal data problem?

SPEAKER_00

I think a normal data problem has a real stable answer. And the crux of what we do is public opinion researchers we're measuring something that's continually changing. So it's it's inherently different from the typical problems that an LLM is trying to solve. The other reason that makes this very different is the way LLMs are trained, and what they want to do is give you sort of the modal response, what they think is the most common response. And this gets really problematic as survey researchers when we're trying to look at particularly crosstabs, so breaking down the data by either certain ethnicities or certain income levels, because the model is is again, it's designed to tell you what it thinks the average person wants to hear, in this case, what the average response rate is. So what we really found is while the the average error rate we said was around 6.5%, that really grew at the crosstab level. And we saw pretty problematic trends where the model was just resulting to stereotypes in a in a more extreme way at the crosstab level. And that's where you saw more deviations from how groups were actually behaving in in real survey data.

SPEAKER_02

Yeah. So here's where I think I might have a kind of a contrarian take. And and your benchmark is traditional human polling, right? So where we you're comparing it to the the real poll. And and what that is, is a snapshot in time. It's backward looking by its very nature. Um and and you mentioned that survey research leans on this principle of the representative sample. So the idea of you know a 40-something college-educated Republican in this zip code, go ask him his opinions, and then roughly people who look uh like him are are probably going to have the same opinions. Um and I think that held uh well, I guess is that a fair statement of the representative sample before I proceed to the phone. Absolutely. Okay. Um and so I think that held when we all watched the same news, Walter Cronkite every night, or Tom Brokaw, and we read the same newspaper, the Washington Post. But today our media diet is as unique as as we are. Even within a household, they're going to be totally different. So I am becoming or have been skeptical that anyone can still capture a truly representative sample, right? So how how worried are you that the the human benchmark itself is getting shakier under all of underneath all of this, that the the the fundamental conceit of public opinion research is is starting to falter?

SPEAKER_00

Well, I'm very concerned. I think public opinion research is is harder than ever. A generation ago, when you could just rely on random digitaling and response rates were above 30%, it was much easier to get a snapshot of of what the what the public thinks and believes. So I do think public opinion research as a whole is incredibly challenging. And you know, we focus on multi-mode recruitment, so finding the best way to get in contact with different types of Americans. Sometimes it's text message, sometimes it's online, sometimes it's still snail mail. We're trying to model that optimal recruitment mechanism for each type of American that we're trying to target. And that's also where relying on really high quality public opinion or survey data is is conducted. So the census or Federal Reserve type data, using those as benchmarks when evaluating synthetic data can also be really valuable. But I do I fully agree with with what you're asserting with the broader challenges in public opinion research, making it really hard for survey researchers. But the the I what I don't believe is that the the answer is relying on synthetic data.

SPEAKER_02

Yeah. Um yeah, I'm just thinking about, you know, I I have really strong opinions about something depending on the last podcast or I listened to or last YouTube video I saw. And so if you ask me uh, you know, if you asked me yesterday what I thought about uh Thomas More and Oliver Cromwell, uh I probably would have said, I really don't think about them that much. But I listened to a podcast this morning about the the the topic, and and now I've got some strong opinions that'll probably trail off. So I think that that, you know, um is making it so much more difficult.

SPEAKER_00

Totally. But what what I think you're speaking to, though, is why it's so important to be doing rapid survey research or continual survey research, because another trend, I just think to your point, public opinion is changing more rapidly. It's it's not one news show a night. People are getting information constantly. Right. So the thought that an LLM trained on historic data will be able to keep the continual pulse is just hard to justify.

SPEAKER_02

So I I I want to now now that we've we sort of understand the science of it, I want to shift a little bit to the reality because a lot of our listeners run campaigns where a real poll, you know, a real poll, uh, might be out of reach. And and so this promise of polling for everyone can be a really kind of seductive pitch. So when someone approaches them selling a synthetic sampling product, what what are the specific questions that that a campaign should ask? And and I guess how would you you suss out what's a defensible use case from one that will confidently give them wrong answers?

SPEAKER_00

Totally. So I think the first question is just to ask to see an example of synthetic data prediction next to the real survey result. And when you're doing that, have your critical hat on in the sense that we take the example earlier. If they brag about a synthetic model that predicted that in the most Republican district in the country, a Republican's gonna win, that's probably not an incredibly useful complement to the model. But as it gets more granular and there are other types of questions, that's what I would really pay attention to, the the average. The second thing I would ask to see is the cross-tap level. Because one thing we noticed in the data is sometimes the the actual prediction looks sort of aggregates to correct. It aggregates to the correct. So one in our earlier study, one thing I found fascinating is 20% of people who approved of Trump, the model said they disapproved of Trump. But 20% of Americans who disapproved of Trump, the model said they approved of Trump. So the aggregate data look looked pretty good and pretty accurate. A broken clock is right twice a day. Yeah. Yeah, exactly. But when you look at the crosstab level, it it falls apart. So I would ask to see the the crosstab information as well. I think it's always helpful to test pretest before you get a survey in the field. So, you know, one thing survey researchers really struggle with is trying to figure out how many people I want to survey. And the more if something's gonna be super close, you want to survey more people because you need more precision. So imagine there's a world where a synthetic sample tells you this Americans are gonna be so opposite, you know, it's so far off. That might be a useful indicator, not perfect, but a useful indicator, maybe I don't need as as large of a sample size, for instance, to start.

SPEAKER_02

Got it. Okay. So sort of like um, you know, minimum detectable effect, sort of range estimate.

SPEAKER_00

Exactly. That could be an interesting way to get started with synthetic estimate.

SPEAKER_02

Yeah, I I like that. Um that that gives me an idea I'm gonna have to come back to. Um and so I I think for now, the the the instinct, and I think where where your your report certainly comes down is that for better or for worse, asking actual humans what they think is is kind of the the best thing we have right now. You you don't slam the door shut though. Um you say it may yet have value, and you gave you know kind of one example right now. Um looking ahead, where where do you think this could earn a spot in the toolkit? Um and And and what what's kind of the standard of evidence for for you to to change your mind?

SPEAKER_00

Totally. Well there's always gonna be new models coming out and additional data sources. So we're gonna continue to test this issue just as technology continues to evolve. I think there's gonna be there's also a range of public opinion research. There's what I would call quick, dirty, and cheap public opinion polls and really high quality benchmarking polls where you need intense precision. So I could imagine a world where synthetic data tackles that lower end of the spectrum. Where if I just want a quick estimate, I just need to be directionally right. I don't need precision and predicting who's gonna win a close race. I just want to be directionally right on something that there should be. I just need a a bit of a quick sense of the direction. There I can see a a value to to starting with synthetic data in the future as as the models continue to evolve.

SPEAKER_02

Yeah. And I I I have two thoughts on this, and I'd like to get your your reaction to it. Um and one of those is we we already see kind of an example of this with multimodal survey research, where it used to be the case, look, I I'm not gonna send out a link on text message because I don't know who that person is, or I don't know who's gonna scan that QR code off of the the postal mail. But you know, the live interview over the phone got harder and harder to do. So we've we've accepted a variation there. I'm wondering if we might get to the point where we say, look, it's pretty easy to get an old person on the phone and talk for 20 minutes. Younger people, um, not so much. So maybe we try and supplement them with a synthetic sample. So we're almost doing not multimodal collection, but multimodal generation. Uh what what do you think about that? Is that is that crazy?

SPEAKER_00

I I think it's crazy, but people are doing it. And something called synthetic boosting, which is sort of what you're describing, where you want to double the sample size or just you're low on a certain demographic group, you generate synthetic data for the demographic group. The reason I think it's crazy is I think the people that are hardest to survey are gonna be the ones where the synthetic predictions are the worst. That old retired person, if they've been a registered Republican, voted Republican for 40 years, them I think the synthetic model is gonna do a good job predicting who they're gonna vote for relative to the person you say you are who's listening to podcasts, changing their opinion quickly, and is both the hardest to pull, but also the most important because their predictions, their beliefs quickly change.

SPEAKER_02

Right. So it's still not solving our problem of having these people that we need to talk to. And and another area where I I wonder, because this is something that I confront a lot, which is you know, in in the digital and tech world, we're looking at these really close races and how can we make a um uh a difference. Uh and and the mathematics of it start to become really difficult, right? Because we are looking at something that might move a percentage point on the ballot, which is a big deal on election day. But in order to be able to detect that, I might have to survey more voters than are actually in that congressional district to be able to detect the effect. So that's the other area where I'm kind of hopeful if you know if we could we could supplement it or at least um give our own um you know synthetic view of the world.

SPEAKER_00

Totally. I think there's there are ways for AI and synthetic models to be helpful. It's just figuring out the right figuring out the right cause for the given problem. And there's no, unfortunately, no shortage of of problems.

SPEAKER_02

Yeah, and a conversation for another day, but I I think that this is kind of where some of the prediction markets that we're seeing um, you know, might might blend in. Or, you know, that that's what's so tough is we've got more data than ever before, but we just don't know what it's telling us and what we should do about it.

SPEAKER_00

And figuring out which which data to integrate is gonna be the the new challenge going forward.

SPEAKER_02

Right. So the example is when when my wife tells me that she saw um uh a dinosaur walk down the street, I'm more likely to believe her than when my four-year-old son tells me that he saw a dinosaur walk down the street. And so we may have to get some reps on that. Exactly.

SPEAKER_00

What's going on?

SPEAKER_02

Thanks to Ben Leff for a great conversation. You can learn more about Varasite in our show notes, and I'll include a link to that report, which is very fascinating. And I've I've read it a few times now. Uh, if this episode made you a little bit smarter or gave you something to think about, all we ask is that you share it with a friend or colleague. You're gonna look smarter in the process, more people learn about the show. So it's a win-win all around. Remember to subscribe to the Campaign Trend Podcast wherever you listen to your podcast, so you never miss an episode. And you can visit our website at campaigntrend.com for even more. Thanks for listening. We'll see you next time.

SPEAKER_01

The business of Artic Show is produced by advocacy content attempts to media productive.