What is data science and engineering? To kick off the podcast season on the topic, Tom Dietterich, distinguished professor of computer science, explains what it is, how it is related to Big Data, and shares his thoughts on the pros and cons.
NARRATOR: From the College of Engineering at Oregon State University, this is Engineering Out Loud.
MOVIE CLIP (2001: A Space Odyssey): Hello Hal, do you read me? Do you read me, Hal? Affirmative, Dave.
RACHEL ROBERTSON: Hello podcast listeners. Do you read me? This is Rachel Robertson from the College of Engineering at Oregon State University. Welcome to the first episode of Engineering Out Loud. Today I have Jens Odegaard with me in the studio. Jens, can you tell us what you do?
JENS ODEGAARD: Yeah, I work in marketing and communications in the College of Engineering here at Oregon State and I kind of focus on nuclear engineering as well as radiation health physics.
ROBERTSON: Very good. And I’m also a communicator for the College of Engineering but I focus on electrical engineering and computer science. So, our listeners are probably wondering what that clip 2001: A Space Odyssey has to do with engineering. It sounds pretty cool but it also a purpose which will be revealed later in the podcast. First I wanted everyone to hear from you, Jens since it was your idea to have this podcast, maybe you could just tell us a little bit about why you had this idea, what was the purpose?
ODEGAARD: So I come from a marketing and communications background but I’m talking to engineers here all the time on campus and they have really interesting research, really cool things that they produce and are building here, but most of the time it’s published in technical journals and in places where the general public wouldn’t run across it. So, I was thinking if we made a podcast to tell stories about their research and innovation that they are doing here it would open it up to a more broad audience and also help break it down from the technical details to a place that everyday folks can understand, like myself.
ROBERTSON: We should also mention that there is a whole team of us who have been working on this podcast including Krista Klinkhammer, Steve Frandzel, and Johanna Carson and you’ll be hearing from them all later on in the season. And our behind the scenes folks Mitch Lea, Jack Forkey, and Megan Kilgore. So, we should also tell our listeners a bit about the format.
ODEGAARD: Yeah, so the format of the podcast will be three seasons per year: fall, winter and spring to coincide with our academic terms here at Oregon State, and in each season we’ll explore one big topic and we’ll have six episodes per season that each tie back to that main topic from a different angle. So for this first season we are going to explore data science and engineering and now I’m going to turn it back over to you, Rachel, to introduce data science and engineering and tell us what that even means.
ROBERTSON: All right! So, here we go. So data science and engineering is an area of research that holds great promise for improving our lives, but also has some potential pitfalls.
MOVIE CLIP (2001: A Space Odyssey): Open the pod bay doors, Hal. I’m sorry, Dave. I’m afraid I can’t do that.
ROBERTSON: Now we are back with Hal in a fictional 2001. But in reality 2016 we have not yet experienced a robot take over, instead the age of Big Data has ushered in a host of concerns we could not imagine in 1968 when the film came out. So, to help us understand this topic of data science and engineering I’ve invited Tom Dietterich to talk with us. Tom is a distinguished professor here, which is the highest rank the university can give a professor. He’s also just really fun to talk to. But more relevantly he is considered to be one of the founders of machine learning, a branch of artificial intelligence.
TOM DIETTERICH: So I’m very interested in how we can use data to build computer programs that do interesting things. And I do a lot of applied research in problems in sustainability and ecosystems and lots of the things that people on this campus are concerned about.
ROBERTSON: As we began putting together the episodes for this season we realized that data science and engineering can be tricky to define, but one thing I’ve figured out for sure is that it usually involves Big Data, extremely large data sets that can be analyzed for any number of applications. So, my first question for Tom was how is Big Data related to data science and engineering.
DIETTERICH: Well, so, when we talk about Big Data we could be talking about multiple things. On the one hand you have big data that is coming from scientific instruments — for example, the Hubble Space Telescope or any of these other huge telescopes that NASA has, or satellites that are sensing the Earth. They generate tons of data every day. But the other kind of Big Data that we think about a lot today is sometimes called our digital exhaust which is when we use our cell phones, when we search on the web, even now when we drive around in our car with our Google driving directions enabled, we are leaving a trail of the locations we’ve been, the websites we’ve visited. So that, for example, now Google can tell you for a restaurant or a stadium – when does it tend to be busy? And that’s based just on tracking how many phones were there at different times of the day. So, that’s a kind of Big Data…sort of as a side effect of other things we are doing.
ROBERTSON: So, now that we have a handle on Big Data, the next question is how is it related to date science and engineering?
DIETTERICH: So, the field of data science is really the marriage of statistics and computer science. In the past, data sets were small enough that statisticians could analyze them sort of interactively and manually. But with these massive data sets we need much more automation to find the interesting galaxies or (if you are analyzing the telescope data) or to learn to detect where accidents might be happening because you are seeing some sort of bunching up of the cell phone traffic. And so we can’t afford to have people do all that manually. There’s just too much data and too many questions. And so that’s when …so data science then brings in automated advanced algorithms from computer science to deal with the huge amounts of data for those problems. Data engineering. Well, that means different things to different people, some people think of it as engineering the data, but I think normally at Oregon State we think of that as doing engineering using data.
ROBERTSON: In fact, in this season of our podcast you will hear from several researchers who are using data for engineering in a broad range of topics: Geoff Hollinger is teaching underwater robots to incorporate human preferences in their decisions about where to travel to gather data; Xiaoli Fern is using machine learning to identify birds by their song from recordings in the woods to help biologists track bird populations; and Haizhon Wang models various evacuation scenarios – such as walking or driving – in response to tsunamis to help improve evacuation plans. But data engineering is not all for science.
DIETTERICH: Facebook encourages you to label your friends and then they use a machine learning system to …for the computer to analyze all of those images and the person’s name and try to figure out what kinds of patterns in the images predict that this is Tom Dietterich’s face versus someone else’s face. So, that’s doing engineering with data. There are many other examples. Self-driving cars have to recognize pedestrians and pets and dangerous conditions and so on and all of that is collected again by using data and then applying machine learning techniques to create the software to do those recognition tasks.
ROBERTSON: During this season will talk about the many benefits of data science and engineering, but first we wanted to talk more about what the dangers are. One of the obvious dangers are the privacy concerns we open ourselves up to when we use applications and devices that can be tracked such as Facebook, Google and our cell phones.
DIETTERICH: We have generally had an expectation of privacy in the U.S. even in the public sphere where we are walking around on the streets. I mean, people say one of the attractions of living in a big city is the kind of anonymity you have, well, you don’t have that anonymity anymore. Presumably we will be able to have things like, you know, you can look up and ask where your friends are and if they have allowed you to access that information it will tell you they are in the subway at 25th Street and heading in your direction or something. And you can see how that might be useful but you can also see how that might also really quickly lead to dangerous things. It’s an ideal tool for a stalker or a terrorist or a criminal. So we currently are relying primarily on the fact that companies have a tremendous interest in not having privacy be violated because people will stop using their products, but we don’t really have strong rules and regulations and so many companies have been compromised in one way or another. Data sets compromised, I mean, you know, our social security numbers seem to be pretty accessible, credit card numbers, health records and so that’s I think is the big danger is that without really good care in computer security all this information has the potential to become public.
MOVIE CLIP (2001: A Space Odyssey): This mission is too important for me to allow you to jeopardize it.
ROBERTSON: So, back to Hal and Dave from the beginning of our podcast. Our fear of an evil artificial intelligence has been around for a while and it resurfaced recently when Bill Gates, Stephen Hawking and Elon Musk spoke publicly about the dangers of artificial intelligence. If fact, Musk declared it was “our biggest existential threat.” Tom Dietterich has been the voice of the academic perspective in this debate. So, we’ve talked about what you think the dangers are of Big Data, but what are they not?
DIETTERICH: Well, Hollywood loves the story of the robot that develops its own will and its own goals and those goals conflict with humans and so it decides it needs to kill the humans to achieve its goals. Perhaps the most realistic scenario might be from 2001 where the computer system on the spacecraft was programmed to prioritize the mission goals above things like keeping the crew alive and so it made a decision that the crew was a threat to those goals so it kills the crew. And of course then we have an epic man versus machine competition and man wins! And it’s a great story. I think that the real threats are more likely to be software bugs. We have a lot of trouble in making software that works correctly. As we all know, aps on our phones crash, our phones have to be rebooted and things like this, and you certainly wouldn’t want a computer system in a safety critical application like a self-driving car that says, ‘oh, you need to pull over to the side of the road because I have to reboot.’ This is just …a car is not a phone and so developing software that is reliable and robust in these kinds of high-risk applications, that’s a huge challenge for computer science because it’s always been a challenge in avionics and spacecraft and even with the very best software engineering techniques mistakes still happen in those settings.
ROBERTSON: So, there are some obvious dangers to big data, but what are the benefits of it.
DIETTERICH: Well, I think there are tremendous potential benefits particularly in health care, for example. So, right now when someone is studying the effectiveness of a drug or looking for potential interactions between multiple drugs that might be adverse they really have to guess in advance what those might be and then include them in a controlled clinic trail. But there has been recent work at Microsoft that says we can detect adverse drug interactions just by looking at people’s web searching behavior. They might look up one drug and then another and then might also be looking up some symptom that they’re having. Other benefits, well, we already have a lot of benefits like I was mentioning before, I can now look at my phone and ask well, ‘When is this restaurant busiest?’ Or I can look on the Google map and say what time do I need to leave Corvallis in order to make sure I get to the Portland airport by 9:30 in the morning and it can tell me well, it usually takes between an hour and a half and two hours on a Thursday morning to get there. How does it know that? Well, because it’s got this information from people’s phones in their cars, plus some very clever algorithms. So, I benefit from other people sharing their phone data when they are driving. So, there is big benefits like medical and there are convenience benefits like driving. A lot of the benefits, it’s hard to even imagine what they are right now. I like to tell the story that when I was a young scientist we had just developed the internet and we thought the purpose of the internet was to allow computers to communicate with each other and so when the internet was rolled out and everything as working around 1985, we said, ‘okay, we’re done.’ We had no idea that the main purpose of the internet would be to put people communicating with other people and that this would be the revolutionary thing. So, I think it’s very difficult to forecast what will be the ultimate ways in which we use these new technologies. But they are likely to be interesting!
ROBERTSON: To find out about the interesting things we are already doing with data science and engineering, keep listening to this season of Engineering Out Loud from the College of Engineering at Oregon State. This episode produced by me, Rachel Robertson, with additional editing by Mitch Lea. Our intro music is The Ether Bunny by Eyes Closed Audio on SoundCloud and used with permission via a Creative Commons 3.0 license. For more episodes visit engineeringoutloud.oregonstate.edu.