Is Your Organization Ready for the Impending Flood of Data?
You’ve got several orders of magnitude more data coming your way, warns Google’s chief economist.
Topics
Competing With Data & Analytics
With a mission to “organize the world’s information and make it universally accessible,” Google is a central part of the current focus on huge amounts of data. Even the name Google is rooted in largeness, as it was derived from googol, an alternate term for 10100.
Hal Varian, chief economist at Google and emeritus professor at UC Berkeley, has been with Google for more than a decade and has unique insight into the past and future of data analytics.
In a conversation with Sam Ransbotham, associate professor of information systems at the Carroll School of Management at Boston College and guest editor for the MIT Sloan Management Review Data and Analytics Big Idea Initiative, Varian says that companies need to beef up their systems to function within an overwhelming data flow — including new voice-command system data and other computer-mediated transactions.
Thank you for taking the time to talk with us. Can you tell us a bit about your background at Google?
When I stepped down from being dean at Berkeley, I ran into Eric Schmidt, who told me he had joined this cute little company down in the Valley, and asked if I could come down and help them out. So I went down in May of 2002, and met everybody. There were maybe 200 people there, located in one building — Building Zero. I had so much fun during that year that I stuck around and consulted for Google when I returned to Berkeley. And then later, as the company grew so rapidly and there were so many demands on my time, I shifted over to a full-time position in 2007.
When I got there, I said, “Eric, what do you want me to work on?” He said, “Why don’t you take a look at this ad auction. I think it might make us a little money.” They had implemented this novel method for selling ads, namely the AdWords auction. That had been rolled out in February of 2002. When I got there, I took a look at it and ended up constructing an economic model, doing some fairly detailed empirical analysis, trying to understand its properties and how it should evolve and so on. So that occupied most of my time that first year.
You’ve said that Internet devices that interact with the physical world will soon be the norm — that ubiquitous, constantly connected device will learn on their own, with some verbal instruction by their users. How do you see organizations becoming able to take advantage of these devices?
There are a lot of processes where there are sensors, or what I call computer-mediated transactions: you have a computer in the middle of the transaction that can capture a wealth of information about the transaction.
As you know, for instance, GE has a lab on the industrial Internet, and what they’re trying to do is to improve their system monitoring for big devices like airplanes, electricity generators and so on.
And if you look at just mobile phones, they capture a huge amount of information that can be used for navigation and for reminding people to do things and for scheduling. The voice recognition — at least on the Android side — is so good now that it’s quite feasible to give almost all those instructions verbally. And we think, or I think, that the most natural interface for all those things is a verbal one: you speak to your house, you speak to your phone, you speak to your car. And I think we’ll find that is the norm for most kinds of activities of that sort.
Once these devices are collecting data, what about the ability to take that data and then to use it, process it, extract it, visualize and communicate it? What do organizations need to be doing to be able to do this?
Well, the challenge for most big companies is that they grow by acquisition. And so you end up with several separate data systems that don’t communicate easily.
Google has tremendous discipline in this respect. Basically, when we acquire a company, we integrate its software into our way of doing things. What’s great about that is you can take an engineer from one project on Google, move them to another project at Google, and they are productive pretty much immediately, because everyone is using the same conventions and the same coding style, the same basic blocks for storing and accessing data.
But Google is virtually unique in that respect. There are very few other companies that are able to do that. And because you have that integrated system, then it’s much easier to access data and use the tools that we’ve developed to do this kind of analytics. So, for example, creating a dashboard for some process. You’ve created some new system, you want to monitor it; you create a dashboard to display that monitor. That’s a half-hour implementation at Google because of these great tools that we’ve developed.
How is Google able to do this where others can’t?
It’s very costly, because when you do an acquisition, you bring somebody in, you’ve got to basically redo their system to align with Google’s.
On the other hand, Google’s system of doing things is good. It’s evolved over many years. It’s usually a step up over what people have when they come in. And that’s because Google is a company that’s run by engineers. It was founded by engineers. Larry Page, Sergey Brin and Eric Schmidt, they all have essentially PhDs in computer engineering, and so they were willing to spend the money to make this standardization happen across the company.
And I guess it’s a transition between upfront costs versus paying later, with every dashboard, in your example, versus paying up front one time. I used to have a software company, and one of the things that used to drive me nuts was people saying, “Oh, yeah, we’ll just do it this way for now.” I’ve never seen a “for now” that didn’t turn into a “forever.”
Yep.
You mentioned that a lot of commonly used data analytics techniques don’t really apply to datasets with millions of observations. We came from a mindset of dealing with 100 or so observations, and so we have some, I think “bad habits,” are what you called it. Can you give some examples of some organizational bad habits? And how do organizations unlearn some of these bad habits?
For example, in one case, we’ve seen organizations where they’ve had to monitor a lot of data. They’ll build the system, as you said, “for now”; it’ll handle 90 days of data or something like that. And that’s really bad, because so much data is highly seasonal. In consumer data, there are holiday effects, there are weather effects, there are all sorts of things going on, and you just can’t do anything much with 90 days of data. So you have to have at least two years of data to really get the seasonality right in a lot of cases.
So a lot of these companies, they’re thinking too small or they’re just doing something for the moment. You’ve got to plan for much bigger, if you really want to provide high-quality service to your user base.
Another thing we do — this isn’t exactly a managerial issue — but when you look at econometricians and statisticians and so on, they’re used to working with relatively small amounts of data. They’ll do a lot of in-sample forecasting and things like that — see, this regression fits really well. But when you talk to a person who’s used to working with large amounts of data, they’re always going to do out-of-sample forecasting, out-of-sample predicting, because you get a much more realistic estimate of what it is you’re trying to predict.
So at Google, we have two groups, the statisticians and the machine-learning people — and there’s some overlap in the groups, but I have to say, I think we’ve learned a lot from each other in terms of how to deal with these massive datasets.
Oh, and the one criticism — I will mention, the one thing that the machine-learning guys are not used to doing is taking samples. Because they want to work with a trillion observations, when it might be just as good to take a 5% sample. They find it challenging. And of course in production work, when you’re really doing the production, you may have to be able to deal with data that size. But when you’re doing the analytics, I’ve found that doing sampling is fine for lots of things.
So, how do we get the new generation of managers to understand the data that’s available, and what could be done with it?
Well, there is this problem of getting the data from your point of sale or from your devices, from your customers, into the cloud. So you’ve got to set up that pipeline. And that can be pretty challenging, integrating the systems. But there’s enough commonality at this point that it’s much, much more straightforward than it once was.
So now you’ve got the data available in some data warehouse configuration, and then the question is, how do I access it? How do I input it in decisions? How do I utilize that data effectively? That’s where people are now. They say, “Let’s go hire a data scientist or some statisticians. Let’s go hire some data engineers.” And they find out everybody else is trying to hire the same people.
The bottleneck ends up being, in many cases, finding those skilled data scientists to hire. Now, I will say, universities have been very good at creating programs to educate people in this area. At Berkeley, at Stanford, at many other places around the country, they’ve kind of jumped into this — into providing such programs. So I think this shortage is going to be alleviated in a few years.
I’m not going to ask you to reveal the next cool Google thing that’s coming out that we don’t know about, but are there any initiatives that you’re pretty excited about at Google right now?
One of the things that I think we’ve been working on and we’re excited about internally and it’s getting a lot of external attention as well is Google Now, which is on the Android phones, and I think there’s also a version for iOS now. That’s basically a personal digital assistant. You mostly interact with Google Now through speech.
And it’s just like having an outstanding administrator who’s watching out for you and reminding you where you parked the car and your calendar appointments and what the weather’s going to be like on that trip you’re taking tomorrow, and on and on and on. It’s just proactive in answering questions.
Proactive is an interesting word choice.
Larry Page used to say, “The trouble with Google is that you have to ask it a question. You shouldn’t have to ask it a question; it should just give you the answer.” And that’s what Google Now does. It gives you the answer.
That’s a pretty exciting development. And by the way, I think that’s going to be a big area of competition in the industry, because of course, Apple has Siri, and Microsoft has Cortana, and they’re all going to be competing in providing these personal digital assistant capabilities.
You’re so pro-voice. On a personal level, there’s nothing more annoying to me when I call in somewhere and get into the phone tree that wants me to speak, then it doesn’t understand my beautiful Southern accent.
Yes, well, the nice thing about Google Now is that the voice recognition can be personalized to you. So it is much more accurate than the generic systems.
How you interact is a choice that you make, but you can also do interaction via the phone screen or via your computer. I think we’re going to move to this more natural way of just asking your house to turn on the lights or open the garage door, or — these sorts of things. It’s a more natural way to communicate.
I assume that the garage door, then, won’t start spouting off how much RAM it has and what version of the BIOS it’s using. I think that’s the other step we’ve got to get through too, where, when you ask the light to turn on, that it just turns on, rather than giving you a lot of details about its wattage.
Well, maybe if you’re lonely and need someone to talk to, you could talk to your garage door.
By the way, there’s an interesting angle on economic policy, because obviously I think we’ve known for a long time we should be doing the Consumer Price Index using scanner data, because it’s out there and available.
And the statistical agencies know this, too. It’s just that they’re kind of hard-pressed for funds, and they have to spend money on keeping the current systems going. But I think we’re going to get more reliable, more timely, more up-to-date economic statistics as well. There’s the public sector side of things, but there’s also the private sector. It can be a lot more responsive in these areas.
Even in the U.S., GDP comes out quarterly, right? It should come out daily. Maybe that’s an exaggeration, but it could come out at much higher frequencies. It’s technologically possible. It’s just putting those systems in place.
If you look at Walmart or Target or Panera, they can tell you how much money was spent in their stores yesterday, or maybe even a few hours ago. The government, though, takes three months to see how much expenditure is taking place around the country. Really, tapping into those systems or being able to utilize… the real-time data that’s generated in the private sector, you’d get much more responsive data-gathering in the public sector as well.
All this data that we see, the economic data gathered by the government, is gathered from the private sector. But it’s gathered through traditional means of questionnaires and surveys and documents sent around. That system is — you know, it’s like the container ships. They use to unload the ships, load it into trucks, drive somewhere, unload it again, do all this stuff. You can really eliminate those intermediate steps by modernizing the way data is gathered. That saves money, time, and improves accuracy on both sides — both for the people providing the data and the people crunching the numbers.
Let me follow up with one question, which is: what do you see is next, and what’s the big thing that you’re worried about or excited about in the broader picture of data and analytics?
I think we’re going to see a lot more of these computer-mediated transactions, a lot more information being produced by sensors. And that will have the capabilities I described — customization, personalization, experimentation, contractual integration and so on. We’ll see a lot of activity in that area, as it diffuses from the high-tech, Silicon Valley companies into lower-tech businesses.
I mean, one business that I think is quite interesting is Panera Bread — the chain of bakery and snack shops. They’re actually very heavy users of information technology. They know what’s sold in every store, so they can experiment. Fifty percent of their clientele are using the loyalty cards, so they can personalize and customize offerings at different branches. And you think about bakeries — what is that, a 12,000-year-old business? But there they are doing this high tech data analysis.
Everybody has the point-of-sale cash registers, everybody has a data warehouse now. As the expertise becomes more available, they will be able to do the same sort of thing. It’s going to be a major source of competitive advantage.
So it’s not just high-tech. It’s going to be every kind of business.