Zoom sucks, right? If you've spent any time on it — or any other video-chat app — you've probably felt it. No matter how good your internet connection is, the personal connection rarely clicks. You don't know when to start talking. You overlap and interrupt the person you're talking to, and they don't seem to be listening. Everyone is accounted for, but no one feels present.
It's not just you. In a study last year, people who were face-to-face responded to yes/no questions in 297 milliseconds, on average, while those on Zoom chats took 976 milliseconds. Conversational turns — handing the mic back and forth between speakers, as it were — exhibited similar delays. The researchers hypothesized that something about the scant 30- to 70-millisecond delay in Zoom audio disrupts whatever neural mechanisms we meatbags use to get in sync with one another, that magic that creates true dialogue.
This seems like something science could fix, doesn't it? But first we'd have to get past the vibes and understand what's actually bad — or maybe even good? — about video chat.
That's the idea that struck Andrew Reece, the lead scientist at BetterUp Labs, the research arm of the big online coaching company. He knew from his job that video chats could be good. "We basically connect people on Zoom calls to have one person help another person be happier at work," he says. But there was one problem: "We don't record those calls. All we know is, when our members come out they're like, 'That was great, I want to do it again.' It's sort of a black box what's going on in there."
So Reece wondered whether it might be possible to crack open that black box and learn why those conversations were working. Was it something about what people said, or how they said it? Was it the way they sounded, what their faces looked like? "We decided: OK, the first thing we wanted to do is just see conversation writ large," Reece says. "What kind of dynamics could we capture?"
The result is the largest-ever database of one-on-one Zoom conversations. It's called CANDOR, short for Conversation: A Naturalistic Dataset of Online Recordings. Reece and his colleagues examined more than 1,600 conversations — some 850 hours and 7 million words total. The researchers paired volunteers, people who had never met each other, and asked them to hop on Zoom for half an hour about any old thing — with Record turned on. Which means that unlike most conversational databases, CANDOR didn't just encode their words, which were transcribed automatically, by digital algorithms. It also automatically captured things like the tone, volume, and intensity of conversational exchanges, recording everything from facial expressions to head nods to the number of "ums" and "yeahs." And before and after each Zoom chat, the researchers interviewed the participants, so they could measure their reactions to the interactions: their likes and dislikes, their favorite moments, their unspoken anxieties. Did they appreciate it, say, when their partner kept nodding in agreement? Or did it secretly piss them off?
"We have variables from really low-level, millisecond-by-millisecond turn taking, all the way up to, 'Did you enjoy this conversation and why?'" says Gus Cooney, a social psychologist who helped develop CANDOR. "It's nicely processed, easily analyzable, and it's been really, really vetted." So this new corpus, as such databases are known among social scientists, might do more than help us better understand how our coworkers perceive us on Zoom. It may shed new light on what we talk about when we talk about talking today — the conversation of the future.
You don't say!
Think about the complexity of even the simplest of dialogues, and how we pass the conversational baton back and forth. You talk, I make some uh-huh noises, I talk, you hit me with an "OK" or two, you talk again and I nod, you shift to a new topic with a question, and I give you a yes or no before picking up the new thread.
That pas de deux is a miracle of human communication. When people talk, somehow we almost never overlap. The gaps between you-go and I-go are just about a quarter of a second — literally the blink of an eye, so fast that we must be predicting when our turn will come. We use fillers like uh-huhs and OKs — linguists call those "backchannels" — to align with one another. A nod while someone's talking is encouragement; a nod at the end is off-putting. A "yes" comes within a half-second; a no takes longer. If I say yes, but delay until the back half of that second, you think I mean no. "Um" means "wait a little longer," "uh" means "I'm about to get to my point." The answering noise "huh" sounds pretty much the same in a dozen languages.
The analysis of conversation goes back a long way — at least to the early 1970s and a classic paper on turn taking as dialogue's primary engine. But the complexity of the data always made it a real slog. "It used to be very much at the fringe, because it was technically challenging to deal with real speech. Written stuff was dead easier. You can just go look at it," says Simon Garrod, a cognitive psychologist at the University of Glasgow who is one of the field's leading researchers. "That's changed because technology has changed. Suddenly everything is recorded, speech is recorded. It's all there."
They were looking at something new in the world. What made people happy on Zoom? And what made one person more fun to talk to than another?
But people — well, grad students — still had to listen to or watch the recordings and note all the things that might be of interest to a researcher, a process called coding. "Transcription was a real struggle, actually," Garrod says. "It took hours and hours of people's work to do it, and you had to do it repeatedly." That meant you needed a big team — and a lot of money to pay them.
So in 2018, Reece connected with Cooney, a grad-school pal who studied conversation. New tech, they thought, might solve the coding issue, and even account for the complexities of overlapping back-channel speech and the timing of turns. They figured they could just get volunteers to have a half hour of chitchat and ask them about how it felt.
It turned out to be a lot harder than they expected. Everyone's video was laggy, which meant they had to scrap hundreds of hours of video for quality reasons. They also had to figure out how to get software to stitch together the two sides of the conversation precisely enough to allow them to analyze interactions down to the millisecond. "Hundreds of hours were spent on that particular problem," Reece says.
When they finally assembled all the videos and built the neural nets to process the dialogues, many of their findings confirmed previous research. That was good; it signaled that their dataset was big enough to trust. But this was back in 2020, the year we all began struggling with how to interact over Zoom. So they were looking at something relatively new in the world. What made people happy on a video chat? And what made one person more fun to talk to than another?
The Tom Cruises of Zoom
Cooney and Reece's first pass at the data suggests that "good conversationalists" on Zoom are those who talk faster, louder, and more intensely. They're the Tom Cruises, as it were, of the interactive back-and-forth. People rated by their partners as better conversationalists spoke 3% faster than bad conversationalists — uttering about six more words a minute. And while the average loudness of speakers didn't change across bad or good conversations, the "good" talkers varied their decibel levels more than the "bad" talkers did. Cooney and Reece's team speculate that the good ones were better at reading the Zoom room, calibrating their volume to the curves of the conversation.
But loudness, it turns out, isn't as good a metric as intensity — maybe because intensity is more subtle, a combination of the frequencies and sibilance of speech and the emotion conveyed by everything from tone to body language. To help the computer to assess something so ineffable — like, what is this thing you humans call love? — the CANDOR team fed it the Ryerson Audio-Visual Database of Emotional Speech and Song. That enabled the candorbots to draw on more than 7,000 recordings of 24 actors saying and singing things with different emotional shading, from happy or sad to fearful or disgusted. The machine found that women rated as better Zoom conversationalists tended to be more intense. The differences among men, strangely, were statistically insignificant. (The reverse was true for happiness. Male speakers who appeared to be happier were rated as better conversationalists, while the stats for women didn't budge.)
Then there's nodding. Better-rated conversationalists nodded "yes" 4% more often and shook their heads "no" 3% more often. They were not "merely cheerful listeners who nod supportively," the researchers note, but were instead making "judicious use of nonverbal negations." Translation: An honest and well-timed no will score you more points than an insincere yes. Good conversationalists are those who appear more engaged in what their partners are saying.
Another question the researchers looked at was: How much new stuff do you have to say when it's your turn to talk to keep a conversation fresh? The results were inconclusive. The coding system found that some rate of "semantic similarity" is ideal — the highly rated conversationalists, in general, changed the subject and brought bring new ideas more often than the poorly rated ones. But the machine couldn't decide whether the low-rated talkers had nothing interesting to add, or whether they just tended to repeat themselves more. More research is needed, apparently. "I still think that's one of the coolest things," Reece says.
"The assumption was, the thing that makes you tired or sad is the medium. But it doesn't seem like that's true."Andrew Reece, BetterUp
Overall, the study found, people liked chatting on Zoom, even during the hellish first year of the pandemic. That January, CANDOR found, barely anyone mentioned COVID-19; by December, it came up in almost every conversation. At the beginning of the year, only a quarter of the conversationalists talked about politics; by Christmas, politics came up in nearly half the chats. Yet when researchers asked participants to rate their "positive feelings" — defined as "good, pleasant, happy" — on a scale of one to 10, the mean rose from a little above a 6 before the video chats to more than 7 afterward. A rise in happiness was experienced by everyone, across all demographic groups, and especially by people between 50 and 69 years old.
Some of the biggest surprises were what the researchers didn't find. The good news for BetterUp, which depends on video chat for its business model, was the lack of any evidence that people dislike Zoom itself. "The assumption was, the thing that makes you tired or sad is the medium," Reece says. "But it doesn't seem like that's true. We see massive effects of: You feel better when you talk to a stranger online." The very act of chatting, it turns out, makes people happy — even when it's over Zoom.
The study also failed to confirm other assumptions. The old chestnuts about men interrupting women more than vice versa, or women being more accommodating and "affiliative" in their turn taking? No evidence. Video chat making it hard for people to have smooth conversations? Nope. So maybe all those old findings are wrong. Or maybe CANDOR's algorithms weren't finely tuned enough to recognize jerky men or jerky audio. After all, you can't spell "mansplaining" without "AI." Either way, Cooney says, there's more to follow up here in the corpus.
Next up for the CANDOR team: trying to analyze the optimal pace of smiles, and how quickly to smile back when your partner smiles first. "We've only done the top-line cut, which is to see how these things relate to overall enjoyment," Cooney says. "Really digging into how the moment-to-moment smiles 10 seconds ago relate to current smiles and relate to future smiles — that's something we're just on the cusp of understanding."
Emphathetic kung-fu
The CANDOR corpus is a good start — maybe. "Things like this study are exciting and going in the right direction — recording everything in real time, real humans talking to each other," says Nick Enfield, a linguist at the University of Sydney and author of "How We Talk." "We can get it transcribed at the flick of a switch, because we've got computational power to do so now."
But, Enfield says, the dataset has some serious limitations, however huge it might be. For one thing, it's only in American English, which means scientists in the field can't use it to explore and identify cross-language commonalities. And for another, the conversations involved people who were randomly paired, which might be just weird enough to skew the data. "How much of your life — not your professional life, but your real life — is spent getting to know a complete stranger?" Enfield says.
BetterUp has a financial incentive to optimize Zoom behaviors: It wants people to come out of conversations feeling good, feeling heard, feeling understood. And the CANDOR results certainly suggest some ways that a conversationalist can project those sensations. But whether those feelings are authentic — on either side of a dialogue — is a whole other story. Breaking these quantifications into qualifications, into "good" and "bad," turns conversations into empathic kung fu. These are the kind of simulated responses that successful dinner-party hosts, psychotherapists, and reporters operationalize to achieve their ends. As a onetime professional TV journalist, I promise you that I can nod intently into a camera and judiciously introduce new subjects for hours at a stretch.
Maybe that doesn't matter if you're looking for a Zoom mentor through a service like BetterUp. Coaches gonna coach, right? But someday databases like CANDOR could be used to train artificial intelligences to imitate the way humans conduct conversations. Chatbots that serve as customer-service representatives or intake staffers at urgent-care centers may learn to nod and smile like the world's greatest conversationalists, but they're not going to feel anything. They can't. All they'll know is how to make us feel good — with deepfaked faces that understand precisely when to say uh-huh, and how widely to smile, down to the millimeter, no matter what they're actually saying. Studying Zoom calls may help us have better conversations on Zoom. But it could also end up making a weird future even weirder.
Adam Rogers is a senior correspondent at Insider.