Andreas Weigend | Social Data Revolution | Fall 2015
School of Information | University of California at Berkeley | INFO 290A-03

Video:
https://www.youtube.com/watch?v=qw3V4SrMaMM
https://www.youtube.com/watch?v=yXygI8y4XZ4


Introduction

Data is being generated everywhere. Google knows when you wake up and first check your email in the morning, the power company can detect the minute change in current when you turn on your espresso machine. We all leave traces of data, whether we want to or not.

This class is about data of the people, data by the people, to make that data for the people, not against the people. Together, we will explore what world we might want to create, based on the data we have. Data created implicitly just by living, and explicitly by creating it.

This is not about promoting the magic of data or the collapse of privacy but asking the very important questions about how we want to live with this abundance of social data.

This is about the Social Data Revolution.

Social Data and Where it Comes From:

Social data is data about people and their relationships.

The communication graph is a central element of social data. Who friends whom? Who follows whom? Who calls whom? Who ignores whom?
For example, in 1993 Michael Schwartz at the University of Colorado at Boulder studied email headers and constructed a social graph predicting who might want to know whom based on mutual contacts.


external image nm2NzCIQcYwYZtG-9xKbhtGWNTYuGe4kicizhJy7qP6xJ6CsR-3bsSezksYyDLBndC44xLYMf0Jr9mnB6wQca9I-KL6BBBHPUnLynAiTxcR4pt8LeG9b0eMsQxUcJBu-=s1600



The transaction graph connects purchases and payments. With 2 billion MasterCards in the world, it becomes easy to predict infidelity, for example. What you buy, when aligned with purchases of others, is a very telling social data point. It can reveal answers about interests and behavior.

Geolocation is perhaps the most interesting and scary datasource we have. The power to know 24/7 where someone is going, how fast they are walking, who they are meeting, if they are turning off their phone, if someone else is turning off their phone at the same time and place. Google Latitude allowed users to see their exact location, as tracked by Google, throughout time.


external image i6NMetopX6memQQHcyHmzLgvH-P-ck9kNP9UrVFJat5NtI5Pf2k1WzfRDXZzNl8u-f16MNcxBbK8kzKdM-X_btCBwQZawRSEwzbaFbiV1kyjL4po-ZBDt-bXSXWGnmZf=s1600



To think about:

What would you not share about yourself? Financial data? DNA? Some of this you are already sharing and some of this data you cannot help but share. Some of this data, if shared, might provide you with a lot of value. While in this country we might shy away from talking about our wealth, we already often already put it on display through the place we live, the car we drive, and the clothes we wear. You might not share your DNA, but if your cousin shared his and your sister hers, how much would someone be able to find out about you? We are sharing, intentionally and not, more and more information because we often get something in return, but could we get more in return for the data we share?

Example: What could we do for car insurance with social data?

Some car insurance is billed on a pay as you drive system, but what about pay how you drive? Would that be a better world? If you are speeding, this system would automatically take a bit of money out of your bank account. Most laws like speed limits were made in a time when there was no data. Should we use social data to figure out how fast you should be driving at any point in time? Must we have a fixed speed limit no matter what the weather is like or how many people are driving around you?


external image dc6bIcCmtuUKrE3Hto1p7JysAHZ4tdesYYfT8P_-BgT0LI5UkCM_BskJoAscLBij81XueciGZt0Js-c-HV4EuxS82XzN6N-GkJsv8PhwdIEKHYaaUOU6kk_esg6koxJd=s1600


Studying Data

One of the ways we learn is by drawing distinctions. Do we assume data is persistent, there forever, or do we assume data is ephemeral, there for the moment and gone the next? What other distinctions exist? Quality and quantity? Accuracy and precision? For example, when Google Latitude reports you were somewhere you were not. Is it a mistake? Did Google have the wrong model?

One big distinction is private vs. public - what is the scope to which we want to make data available.

Attitudes Towards Data

Give-to-Get Model

Many data refineries have a give-to-get model.
For example, in order for Google Maps to instruct “turn left at the next intersection”, it has to first know:
  1. Where you are.
  2. Where you want to go.
It would be very difficult for Google (or anyone!) to give directions without those two points.

external image fGCRTiBmkpFFC9DApGj-3Ccgvu2xdqfnec45PHowX-vjYttS0Ts7xH0cIn9hDhlyhnmYiIofD7jU1TUlA0DS3GuCHIvLtrSYshf2G6ifETNCRCby_dIt8JUIylvO73BcmQ=s1600


The Value of Data

Data is as valuable as the decision that is impacted by that piece of data. What may be very valuable to you may not be worth much to the refinery at all. When people talk about data being valuable, always ask - what decision depends on it? For example, what part did that data play in deciding whether to buy certain stocks or what clothes to wear?

Assume you had all the data in the world... What would you do?

This course is about the enormous amount of data that we create in basically everything we do and asking the questions: What do we do with the data? What could we build based on what data we have about ourselves, and what we may have from other people?

The goal is to ask questions rather than just answer them. What you think is the case vs. what is actually the case - that delta is where the learning happens. Identifying your assumptions or expectations and then challenging them creates opportunities for learning.

PHAME

The last century was all about the physical sciences, running experiments on particles to see how they interact. This century is about the social sciences. More and more we will now be running experiments on, and then instrumenting interactions between, humans. PHAME is a mnemonic for remembering the steps:
  • P - Problem statement
  • H - Hypothesis
  • A - Action
  • M - Metric
  • E - Evaluate

Metrics

You need to think about how you will measure the data. It is easiest, and often very effective, to simply count it. We should also think about the data, and change in the data, over time. The metrics you use will determine the data you collect and the questions you can ask of the data.
  • “Not everything that can be counted counts, and not everything that counts can be counted.” - Albert Einstein, apocryphal

Two examples of creative social metrics:


Allegory of the Cave

With Facebook and Google, we are in the same situation as with the allegory of the cave. We are sitting in the cave and all we are looking at are shadows on the wall. We make sense out of these shadows. We see them interact. They don’t jump from one place to another. These are only shadows from our perspective. But what about that light source? Shining on the objects and casting the shadows on the wall. True, Google does not make up webpages and Facebook doesn’t make up posts. Just like the shadows, there is something real, but are we having ways of actually understanding the light source. Is the light source actually more important than the shadows?

For Google, it would be entering a search term. My search term is real. And the pages are real. But the search order, the page ranking that has certain pages show up over other pages… That is under Google’s control.

For Facebook, my online friends are real and so are their posts, but the way my friends show up on Messenger and how their posts show up on my Newsfeed is dictated by Facebook.

external image images?q=tbn:ANd9GcTkTLqqk28sWtqRguAfVciJzrWQ24KB_aPuYu01yQ3lS_J1s_3M

Questions to Ponder:


  1. Have you thought about quantity and diversity of traces of data you are generating daily?
  2. What data are people not willing to share? Perhaps insurance, health history, Google searches, or financial statements? Why? Why not?
  3. Most laws are made when there’s no data. So what should we do now? What should we build (or rebuild)?
  4. Does Google make mistakes too? Would we be able to tell?
  5. How do you put a price tag on your data? Is it related to how important the decision you make with the data?
  6. When do the machines know us better than ourselves? When it is the opposite?




Administrivia:

  • Data Safari on 10/2
    • Meet at Hearst Mining Circle at 9am.
    • Lunch at Facebook
    • Dinner at Google
  • HW1 (due 9/27 @ 5pm):
    • If you had all the data in the world at your fingertips, what would you do? Come up with two experiments that you would like to run - one for Facebook, and one for Google. You don’t have to only use data that already exists.

References:


Contributors:

Emily Lutz (emily.lutz@berkeley.edu), Andre King (andre.king@berkeley.edu), Daniel Griffin (daniel.griffin@berkeley.edu)