Wednesday, October 07, 2009

Techwrecks VI: MIT study uses public Facebook data to identify gay users; huge implications for market research, civil liberties

Wow. Using nothing more than publicly available profile data — primarily relying on the principle of "homophily" (which does and doesn't mean what you think it means — "homo" = same, "philia" = friendship, so "homophily" = birds of a feather flock together) — MIT students were able to create a program that successfully identified Facebook users' sexual orientation with startling accuracy. Basically, they predicted, if you're gay, you're more likely to have gay social connections than if you're straight. And while this theory may be less valid for people in certain contexts (stereotypes aside, I'm sure if you're a straight male in musical theater or fashion design, you will bollix up the system), it holds true enough that the study has both interesting and troubling consequences, for market research and for civil liberties. From the Boston Globe article:

People of one race tend to have spouses, confidants, and friends of the same race, for example. Jernigan and Mistree downloaded data from the Facebook network, choosing as their sample people who had joined the MIT network and were in the classes 2007-2011 or graduate students. They were interested in three things people frequently fill in on their social network profile: their gender, a category called “interested in” that they took to denote sexuality, and their friend links. 

Using that information, they “trained” their computer program, analyzing the friend links of 1,544 men who said they were straight, 21 who said they were bisexual, and 33 who said they were gay. Gay men had proportionally more gay friends than straight men, giving the computer program a way to infer a person’s sexuality based on their friends.

And it's not just sexual orientation that this kind of process works on:

Other work, by researchers at the University of Maryland, College Park, analyzed four social networks: Facebook, the photo-sharing website Flickr, an online network for dog owners called Dogster, and BibSonomy, in which people tag bookmarks and publications. Those researchers blinded themselves to the profiles of half the people in each network, and launched a variety of “attacks” on the networks, to see what private information they could glean by simply looking at things like groups people belonged to, and their friendship links.  On each network, at least one attack worked. Researchers could predict where Flickr users lived; Facebook users’ gender, a dog’s breed, and whether someone was likely to be a spammer on BibSonomy.

So what are the implications for market research? Well, since the industry was invented back in the 1920s (corresponding with the arrival of broadcast media, naturally), market researchers have been engaged in the study of consumer behavior based on three basic principles. The first is that if you ask people questions about themselves, they'll respond to those questions accurately. The second is that those responses
expose patterns that are meaningful clues as to how similar consumers are likely to behave. And the third is, even if some of the people in question are lying or mistaken, in aggregate, the mob speaks truth.

Now, even those of us in the actual market research industry know there are giant problems with these assumptions. For one, people lie all the time. To themselves, as well as to telemarketers who interrupt their dinners with random calls about feminine hygiene. Meanwhile, identifying links between query responses and future behavior is a tricky craft that sadly is best performed in hindsight (polls are much better at telling you why something may have happened than what might happen next—this sleight-of-hand is why political pollsters manage to stay in business, even though their predictive results often suck lemons).

But the biggest problem in modern market research is the notion that aggregate results are accurately usable in a world where communications have become more and more granular. We don't live in a broadcast world any longer; messages and media are increasingly customized for microniches (or personalized for individuals). The secret sauce that marketers are looking for is a way to do this kind of detailed fine-tuning without requiring individuals to opt in: There are, after all, many messages that people would choose not to see, and many people who choose not to see any messages at all (hello, AdBlock Plus). And while bribing people to provide information works, it's still easier and cheaper to try to obtain information from third-party public sources, assuming you can do so with any consistent validity. And for information that's potentially dangerous, like sexual orientation, plenty of people choose to lie simply to protect their careers — or their families.

Which brings up the civil liberties element in this exercise. On the one hand, this program uses publicly available data to make its conclusions—and social networks are fundamentally rooted in homophily, so there's no way of altering their business model to avoid the generation of patterns that might be useful to third parties, ranging from marketers to law enforcement. (It's not a huge leap of faith to imagine that if you're, say, a child pornographer, a noninsignificant number of your Facebook buddies might share your sick interests.)

So is there any solution?

For one, social networks might eventually be forced to turn off public searchability of profile pages by default, and to make it much harder to opt in to exposing them. At the least, they should auto-alert all first-generation friends of a user that wants to make his/her profile public to search, and give them the option of making their friendship status hidden on such pages.

You don't need to erect a total firewall; putting enough black holes in the social graph renders programs that rely on aggregate parsing, like the simple one created by those MIT students, useless.

On the other hand, imagine, if you will, what Facebook would have turned into had Google purchased it, and combined its gargantuan pool of active search data with the site's burgeoning trove of profile data. Can you say...Skynet? I knew you could.

