Author:Karl Schliep

September 11, 2021


Maneuvering the Machine Learning and Artificial Intelligence Ecosystem with Karl Schliep

008 - GAT Podcast: Force Multiplier

38 min read


008 - GAT Podcast: Force Multiplier - Maneuvering the Machine Learning and Artificial Intelligence Ecosystem

MJ 00:00 Hi, everyone. This is Michael Jelen from the Global Applied Technology podcast. The GAT team, as we call ourselves, is a globally distributed team of software engineers, data scientists, graphic designers, and industry experts who serve clients through our products built atop the BRG DRIVETM analytics platform. We're helping some of the world's largest and most innovative clients and governments transform raw data into actionable insights, drive efficiency through automation, and empower collaboration to improve business decisions. You can learn more about us, our products, and our team on our website, brggat.com. And if you have any questions or comments, please email us at gat@thinkbrg.com.

Today, I'll be speaking with Karl Schliep, who's the lead data scientist in the artificial intelligence and machine learning team at BRG. He's the one in charge of developing our state-of-the-art machine learning solutions. We cover the differences between machine learning and artificial intelligence, the legal implications of AI, and some pretty fun examples of this technology in action to catch oil-stealing pipeline pirates and analyze surveillance video. Please enjoy this conversation with Karl. Hi, Karl. How's it going?

KS 01:04 Hey. Good.

MJ 01:06 Great. Well, thanks for joining me today to talk about a topic that I think both of us are very excited about, machine learning and artificial intelligence. First, for people who haven't met you, I'd love if you could start off with a little introduction about who you are, how you first stumbled upon machine learning, and what really brought you to this space.

KS 01:24 Sure. Yeah, I'm Karl Schliep. I got my PhD in 2017 in materials science and engineering. And before that, I got my undergraduate degrees in chemistry and math. So, a lot of people wonder, "Why are you in data science now? You've got this whole career of science behind you. What made that whole transition happen? Why are you here today doing what you're doing at Berkeley Research Group?"

So actually, my whole history of data science and where it's coming from stems back to my work in the sciences. I worked with a lot of image data. I had a bunch of images. I was processing them all by hand day after day. Actually, I figured out all the keystrokes. If I do control-C-up, enter, backslash, ta, ta, ta, ta, ta—I had this whole system. I'd sit there, I had a nice motion, nice click, click, click, click, click, click, click, click, click, click, click, click. I could work through hundreds of documents a day. And I did that for months, I would say. I'm a little depressed to say that, but in my PhD, months were probably spent doing this data analysis, these image analysis by hand.

Eventually, I got wise, and I started looking into like MATLAB, some software that people use. And I figured out, hey, I can automate some of this. I can take these images, I can take these keystrokes, I can do this a smarter way. Started doing that a little bit, got better at it. I was doing hundreds in hours now.

Once I started doing hundreds in hours, my perspectives changed. I was like, well, now I can analyze this quickly. What more can I do? I started learning more and more about how there's this whole environment, this whole community of people developing cool techniques for analyzing things. The whole open-source world: you can basically find anything now, where I'm not starting from scratch, building everything up all by myself in my own little bubble. I can start using these things that everybody else has been using for years. And it was kind of just this cool tool that everyone else is using. And it kind of like—my eyes open. I was like, "Oh my gosh, I've got to learn how to do these cool things."

So, it must have been 2017, or maybe beforehand, I started looking into Python, working with Python, still doing my science-y stuff. And eventually, the science is too much. I'm going to move more toward learning this cool new tool. And so, I started; I joined the data science workforce, worked in healthcare for a little while, and then I moved over to Berkeley Research Group, where now I'm working in the legal space of data science.

MJ 03:50 Wow, that is awesome. I love it. Sometimes frustration and repetition drives innovation and a better way of doing things. So super glad you're able to automate out that very manual process and move into something that's more exciting and fun.

KS 04:03 Yeah, absolutely. Wasted a lot of time.

MJ 04:07 Absolutely. And so, for people who aren't familiar with machine learning—I know it's a very, very broad subject that covers a lot of different verticals and specializations. Would you be able to maybe just start at the ground level and build this up so that we know what that really means when you're talking about machine learning? And maybe even throw in artificial intelligence. How are those two related? And what does that really mean?

KS 04:28 Sure. Yeah. So, machine learning in its most basic sense is just a numerical solution to problems. People were running into these problems, and they found smart ways to kind of come to a solution. The most common one is like a linear regression. We've all done it using Excel. You can make a couple of clicks, and you get a linear regression. But what is that actually doing? That is like the fundamental of what machine learning is. You have some problem, and you kind of change the way you think about it. Rather than finding the best solution, you want to find a solution that minimizes how wrong you are. So, one way to do it, you might think naively, is, "I'm just going to try every possible solution. There's only so many, right?” In the numerical sense, there's only so many. So, you try every single one. And that's kind of the basis for a lot of it.

As time went by, we got smarter. We developed better algorithms for doing it faster. And we developed cool new techniques for actually looking at the data in different ways. So, the fundamentals of machine learning are we're trying to just get to some optimized solution. We're trying to get to the best solution. In real cases, nobody ever gets to their best solution. We reach a solution that's good enough. I don't need to be right 99.99999% of the time. If I'm right 95% of the time, that's good enough for most applications, right?

MJ 05:41 Yeah.

KS 05:41 So that's kind of machine learning as an idea. And why is it starting now? This is probably a thing that I didn't hear about 15 years ago. Why is it, machine learning, starting to become more prevalent in society? It's because computers got a lot faster, algorithms got a lot better. We had the open-source community to help for spreading it far and wide, making it easier to get into it. And then we also had cloud computing. I have access to, right on my computer, hundreds of computers that I can say, "Hey, all these computers in the world, I'm going to use you for 20 seconds to solve this really hard problem. Done and done."

MJ 06:22 That's super awesome. And I think it's important to note that that is from an aggregate of standing upon the shoulders of giants who have figured out these algorithms little by little, tinkered with them over the course of years. And every single person in the community seems to be building upon that to create better and better packages, to the point where now—not to say that getting started in it is pretty easy—but there are things like Pytorch and TensorFlow that are already pretty much out of the box able to help you use saved models that exist out there, to apply them to your specific problem.

So, I don't know, could you spend a little bit of time talking about maybe some of the major verticals in different areas? Things that come to my mind are, what is a convolutional neural network? What's a recursive neural network? How do these things fit into this overall machine learning ecosystem?

KS 07:11 Yeah, sure. So, without getting too technical into a lot of data, for deep learning, it's a subset of machine learning. That's where you don't really know what you're looking for. Right? So, with machine learning things, we've got: if you're looking at things like X-, Y-axis, you have two variables, X, Y. I want to minimize something there. With deep learning, we're looking for something deeper. We're looking for something beyond what the humans can try to optimize. So, we actually develop these problems where we pass our system through a bunch of different variables and we just tinker, tinker, tinker, tinker, tinker, and we get to some fine-tuned solution. That's what a lot of these convolutional neural networks are doing. It's a neural network of a bunch of just tinkering nobs. And we let the system flow. I mean, these take weeks, months. I mean, I think they had a recent natural language processing model that they came out that trained for like 100 billion years or something over-- they probably had 500,000 computers all operating on 16 cores, all running for like nine straight months to come out with this one model. This one model.

Now, it's not used for only one thing. Everyone else can take this model now. It's free on the internet. You can take it, you can add in your specific information, and it just tinkers it slightly to give you a more specific solution to what you're looking for. It's amazing the types of technologies that we're being able to get out of this by drawing this problem into this framework, where we can come to an awesome solution that can be generalized across the board. GPT-3. Basically, a language interpreter.

So, I can type into it anything. Now I could say, "Program me a video game," and it would start writing out in a language, Python or C. It would start programming you a game. It's so smart that it can actually—you tell it to program itself. And it can probably learn to do that in different ways. There are limitations, but you could tell it, "Write me a novel like Hemingway," and it would write you a novel like Hemingway. So, it would take all of the things that it knows about Hemingway, all of the books that Hemingway has ever written, and make up something that—it's not perfect, but it's not bad for having it being generated entirely by a computer asking one question. It's unbelievable.

MJ 09:30 Yeah, it is pretty amazing how—so I guess, this family of technologies that we're calling machine learning draw upon a bunch of different kinds of methods to try to solve these problems. And over time, these methods have been honed, perhaps new methods. As you mentioned, deep learning is an entirely new kind of methodology where we're essentially trying to create these artificial neurons to simulate the way that the human brain interacts by turning and flipping switches on and off and changing the weighting and importance of certain things over and over and over again to drive in—any sort of pathway that seems to be working gets stronger and stronger. And, I guess, it continues to optimize in that direction.

But as a result, as a user, what we have is a computer we can talk to. And it can talk back to us. Or in some cases, as you're driving a car, it detects people, slows you down automatically. All of the computer vision work is incredibly interesting and useful. And so, it's cool to see this being applied, as you mentioned, both in the natural language processing text world, along with the images and video recognition that's coming out right now. So very, very cool space. And it's amazing that these have all been packaged up into pretty easy to use, straightforward things these days. But it's really quite novel when you go back and think about it. That's very cool.

KS 10:45 Yeah, absolutely. And you're touching on a topic that we mentioned, artificial intelligence. So, machine learning is a subset of artificial intelligence. Artificial intelligence is the umbrella that encompasses every different way that kind of humans interact with a computer. The idea being, we're trying to—artificial intelligence, the idea of it is we're trying to make a computer system that does something a human could do, but better. We want a computer system to be able to do—move your body.

We've got robotics, which is a subset of artificial intelligence. We want it to be able to see faster than we do. We've got computer vision. We want it to be able to read text to us. There's audio. There's also thoughts, natural language processing. We're trying to get the ideas from our mouths into what a computer might understand. So, a lot of what artificial intelligence is trying to emulate human actions into a computer and have the computer perform it better than we can do it. So, if you just ask me, "Write a Hemingway book," oh man—my wife would kill me because I'm not the greatest or most read up on Hemingway. But it would not be good. And so, we're already surpassing in a lot of ways what humans can do with a computer.

MJ 12:01 So, is it safe to say—?

KS 12:01 That is artificial intelligence: how to make a better human.

MJ 12:06 Yeah. So, is it safe to say, sort of, that machine learning is applying mathematics to solve specific problems, and then artificial intelligence is the application of machine learning to specific problems in order to solve them in a way that makes it appear intelligent to us as a user?

KS 12:24 Absolutely. Absolutely. That's right. Yep.

MJ 12:26 And so, I guess when we start to take a step back and look at all the different industries that are using this, it seems like it's something that's a buzzword but also being applied in many, many different spaces. I know we already mentioned computer vision and driverless cars, the ability to generate content, create books like Hemingway. How do these things work? Do we start with samples that are trained, or does it just kind of go out and ingest all the information? How do we start this whole process? And where does the data come from?

KS 12:55 Yeah. So, there's actually like two main ways that we're using machine learning in any type of instance, is we're either using supervised learning, and that's where we've got some sort of historical data. And in that historical data, we're trying to find trends in there. So that's like linear regression. We've got some data. We want to find a trend in it. But there's also unsupervised. Unsupervised is trying to find and trying to cluster everything together. You're trying to find similar objects. So, if you had a bunch of pictures of dogs and cats, one way you could look at is: I want to find all the things that are dogs in one group, so that later, if I have something that looks kind of like a dog, it would know, "Oh, this is a dog," rather than going back and using a bunch of labeled data that said, "This is a dog, this is a cat," and those are inputs.

So, there's two ways that you can actually look at it. And its uses in society, I mean, they go from like face detection on your phone, spam filtering in your Gmail, Netflix recommendations. So how does it know I want to watch these other things? The way that Netflix does it is, it clusters you together with similar users. So other users all use things. They'll watch some new show. It'll get recommend to others. If it's picked up and they start watching it, then that little cluster, that bubble, grows. And now everyone in that same bubble, they get like one more thing that they can look into and watch.

The sciences, I mean, coming from them, every new material system—not every—it's slowly being adopted in science. But the way that they do it is, you have this giant scope of materials. So, polymers, let's say. You've got a bunch of different chemicals you could all throw together. They all do different things. There's too many options to choose, so they use machine learning to actually decide, "I'm going to try this system next, then this system." And using this gradient descent, it's this optimized algorithm for deciding which way to go in this giant landscape of options. It'll lead you to the best solution faster.

It's also used in healthcare, financing. I mean, predicting the stock market. And everybody always wants that one. What am I going to buy next, Amazon? They send you an ad? Well, yeah, I would like that. Doctors' offices use it. I would not be surprised if someone told me that it was used to produce the COVID-19 vaccine. It's used prevalently everywhere, but it's kind of under the sheets. It's under our radar. But we just get the end products out. And we get to have fun with it and use it in our everyday lives.

MJ 15:20 Yeah, it is really fascinating how general purpose this technology is. And it can be applied everywhere in the world. And I guess in our day-to-day life at BRG and the things that you're working on, how do you use this technology? And what are some examples that you've seen with your clients?

KS 15:35 Yeah. So, we work in the legal space. So, we're trying to help lawyers out. I mean, lawyers have a really tough problem with how data has been exploding throughout the last couple of decades. Back in the day, when Enron happened, to look through all their emails and things and track down who did the wrongdoings, they just brought in a thousand lawyers. Those thousands of lawyers—my boss was one of them—they looked through all these documents kind of by hand. And that's the kind of what you could do. It was tedious and it took months. But nine months in, they were able to find the things they needed to do.

I mean, on a previous case, I had 12 terabytes. Actually, I'm working on a 20-terabyte coming up soon. 20 terabytes of data. We're talking—

MJ 16:22 And just for anyone who doesn't know, yeah, how many pages would that represent if you were to estimate?

KS 16:26 Billions.

MJ 16:27 Billions. Okay.

KS 16:28 Billion of pages of documents. And these can be emails. They can be Excel documents. These are PDFs. These are all sorts of different types of information that these lawyers get to review and look through and try to figure out, when did person X know idea Y? Because that's very important. They need to know, when did you know this? Is this your fault? Is it somebody else's fault? And so, the way that we use it in our group is, we're trying to help these lawyers out because there's no—I mean, for a long time, they could do it with just sheer willpower. But now there's no way. So, we've got a bunch of systems in place to actually help them out.

So, the first thing that lawyers want to do when they get a new case is, they want to know what're the most important things I need to know about this case. So, there's this hot document retrieval where they have a couple of keywords, they know this lawsuit is about pickle jars or something. So, we search through and we find all the documents that have pickle jar in it. Sure, that's cool.

The next step is, they can start formulating their ideas of how they're going to try this case, to get a better idea of who the people are that are involved. The next step is they actually need to identify all of the important documents. Not just those about pickle jars or whatever else or coal or whatever is the important term. They need to find things that are slightly broader terms. So instead of coal, maybe they're looking at flue gas or medical adhesives, things of the same genre. So, they need to figure out what kind of keywords can they use, what kind of information gain can they have by adding in these more and more documents into the scope of what they want to look at.

Once they have their scope of all the documents they want to look at, they can throw away the softball team emails, they can throw away the beers on Friday night emails. So, they don't care about those. Once they have their scope of like, "Okay, here's all the documents that we kind of know are probably important," they have to filter it down again. Because at the end of the day, for these lawyers, they need to boil down, they need to filter down all this information into something that's digestible in the courts. They need to go to a judge and in front of a jury and say, "Here's what we know in less than a couple of hours." Because nobody's going to sit through one hundred days of us going through millions of documents.

So, after that, we have what's called a technology-assisted review, TAR. It's a big thing in the field of—but basically, there's no way you've got—let's say your scope is a million documents. Even still, you could throw one thousand lawyers at it, but there's no way anybody wants to pay for that. That's super expensive. So, we've got this technology-assisted review that is the same—when I was talking about for determining which polymers to you, it's this continuous learning model versus active learning model, where we give it a little bit of information and it points us in the right direction. And it just keeps pointing us in the right direction, so that we only have to—instead of a million documents, maybe we only have to look at twenty thousand. There's a huge increase in productivity. At the end of analyzing that twenty thousand, we can say with 95% confidence that we only missed less than 1 percent of the documents, which for most court cases is totally fine. They're not going to drill you down to every last document, because the courts would be backed up for years.

And so, there's a bunch of steps along the way. And our job is basically just to help the lawyers out. They have a bunch of crazy ideas. "Oh, I wonder if we could find all the things that have handwriting on them?" "Okay, sure. Yeah, we can develop in-house our own handwriting detection model, apply it to your documents, and say, “'Hey, here's a flag for you. Do you think these are important?' Sure. 'You want to know all the major important ideas?' We call them entities. ‘Do you want to know all the proper nouns kind of that people are talking about? You want to know how they're linked together?’"

There's link analysis of saying, "Wow, everybody talks about Amazon online. Oh, they're all shopping over Christmas. Okay, that's fine. That makes sense." So, we try to provide these—we call it a fact pattern analysis. We try to provide this information that the lawyers can then digest and be like, "Okay, I'm looking through time. A lot of people here are talking—." We have outlier detection. We say, "Okay, most of the time on Fridays, they only send ten emails. Today, they send four hundred on Mondays. Why did they send four hundred emails?" And we also do sentiment analysis. Why are so many of them bad? Why is the mood detected in our sentiment analysis bad?

So those are the things we isolate, the information that the lawyers can then use to start building their case. And then they can look deeper. And I could go on and on about the different things that we do, but I don't want to bore you. We are the people they come to when they're like, "I have data questions, and I don't know how to get to the right answer." We're your guide through the mind that is the data set.

MJ 21:36 Very cool. Yeah. And it's very interesting. I've sort of been in this space as well for the past fifteen years and watching from the very beginning. I remember being in a room where I was the technology person in charge of ensuring that these four hundred review attorneys had documents to look at. And at that point, they were manually redacting or putting little black boxes over certain things. The very beginnings of this, where first, we could at least put the documents on some digital format and have people manually look at them rather than flip through actual physical pages. That was kind of the first thing. And then after that, having the ability to read the documents and at least be able to search for keywords was a huge benefit. But at that point, I guess the clustering technology that you had mentioned wasn't quite there yet. So if things were misspelled, you wouldn't be picking up. And I remember looking at lists of keywords that had all the different possible spellings of certain things. So yeah, that was maybe the next phase. And then being able to—I think it really all kind of exploded once you started to leverage machine learning in that space to be able to cluster together similar documents, concepts, start to understand and classify what kinds of documents are we looking at. Are these Excels? Are they Word documents? Are there videos in there? And then yeah, from that point forward, it really exploded.

And some of the stuff that you're doing with linking entities together, applying that sentiment analysis and understanding: "Hey, when they talk about this is, is it positive? Is it negative? Are they angry? Why is it?" I think that really has gone very, very far to be able to get us to the answer or the key information as quickly as possible, finding that needle in a haystack.

But I assume that this wasn't a very easy process to convince judges or any sort of regulators that yep, this magic black box of technology, I just kind of click a couple of buttons and look at: this is what comes out. That's not something that's very comfortable to someone who probably has been in the legal field and used to dealing with paper for quite some time. Tell me a little bit about how you've been able to convince regulators. I assume that's a big challenge in this industry.

KS 23:38 Yeah, it absolutely is. And as with a lot of things in the legal space, it's precedence. We didn't set the precedence, but maybe some time back in the early 2000s it started becoming a thing. And there's been a lot of academic research on it. They have, let's say, ten thousand documents; they give one lawyer a chance to go through and label them. They get fatigued. And then we have our model. The model that they used was able to do it in one hour. The lawyer by themselves, it took them twenty hours or something like that. And they compare. And they have multiple of these studies. Year after year they keep proving the same thing over and over again, is you can throw humans at it, but humans are—we still make mistakes. Maybe you just start skimming; after your five hundredth of that day, you start skimming or you miss this important word, you miss this important phrase. Maybe didn't have enough coffee that day, you know?

MJ 24:31 Yeah.

KS 24:31 So, all these human factors. I mean, I think in the legal space, at least people are trying to convince—they think that humans are perfect when they're reviewing things. They're not. And we can link back to these articles. We can tell them it's better, it's faster, it's cheaper. The main way we try to convince them is monetary. If we can convince them that you can make more money using it our way, oh, then it's easy. Then it's easy. Convincing the judges and things like that and getting it admissible in court and defending it in court, that's always a difficult thing.

And a lot of that's education. We have to educate the courts about how these systems work. Otherwise, you get somebody up on the stand, and they spend four hours trying to talk everybody through and make sure everybody's on the same page. And especially when it gets into the heavy stats and stuff. A lot of what we're doing, it's all built upon mathematics and stats. And now, if the courts aren't educated, and there aren't the precedents before you, I can imagine being in court with two statisticians going back and forth trying to explain how this stuff works to the rest of the courts. And it doesn't end well. So without the precedence, without the monetary value, I think it'd be pretty difficult.

The other one is necessity. There's no way they can do 12 terabytes of data. They just can't. I mean, they tell their client, "We have 12 terabytes of data; that will take us seven years," they'll be like, "Okay, see you." That's actually something lawyers are trying to sell in their own right, is, "We are able to handle your big data cases. You've got big data? We know the people who can help handle that." And so it gives them more clients. They bring it to us. We help them out. We guide them through. And that's how we've been able to sell a lot of what we've been doing.

MJ 26:24 Cool. And what are some of the challenges that you're coming up against right now? Because as you went through the evolution and the history, we talked about kind of each of the different major milestones and challenges that were difficult to overcome. What's the current challenge? Where are we right now in the current state of this industry?

KS 26:41 So the industry is actually blowing up. There are so many legal startups. Everybody's starting to get—I think they smell the blood in the water. They know that there's money to be had. Because back in 2007, there were still legal startups, but they didn't have the precedence. It wasn't becoming so commonplace. Now that it is, there's so many different avenues in the legal field that you can take to try to make a name for yourself. It's like the wild west of startups. Twenty pop up; one comes out the sole survivor. And they all have their different ways of taking on the ideas. So there's a couple commercially available pieces of software that people have. And their biggest selling point is that they host the data and they have all their analytics and machine learning all backed into it. And they had probably one hundred court cases. So they've got precedence. Everyone knows this software works. It's easy-peasy.

The one thing that we try to do, us trying to compete with that, maybe, is we don't want to kind of compete with that because they've got their whole thing. We want to add additional features. So they can use this new software-- that software might get them an average answer. If they need something more, if they are like, "This model only gets me 90 percent of the way," come talk to us. We can get you to 95. We can also answer all the things that this software doesn't do. It's only got a couple of things that it does. And we do a bunch of other clustering and guiding that we do on our way.

MJ 28:14 I was also wondering if you could maybe give me a couple of examples of some of the different specific areas where maybe you take it from 90 to 95 percent. What are some tangible ways that lawyers are seeing that?

KS 28:28 So yeah, in the review process. Let's go to that. So if your model, so let's say you've got a million documents. If your model is only able to get you to 90 percent, you’re still reviewing a bunch of documents. And if we're able to get you 5 percent more, that drastically reduces—I mean, that halves what you have to do. That halves the work that you're going to have to put in it. It doesn't work out that exactly, because with how the models work, we're guiding you up to a certain point. So I think in the past, we don't have like a good example for saying, "If you had done it this way, you would have had to review fifty thousand documents. If you did it our way, you would have had to only do twelve thousand." But that's kind of where we're at. And those are metrics that I think we need to outline and figure out for ourselves.

MJ 29:13 Yeah. Another one that I thought was super interesting that we're collaborating on together was a recent law that was passed relating to the amount of time that someone spends in a physical location while they're at their workplace. And if they spend a certain amount of time in that physical location and don't have to reach very far or move around a whole lot, then it would be required for the employer to provide them seating. And I know this is a big deal with large warehouses and people who have more stationary-style jobs, where you may be serving customers at a counter or something like that. But it was just fascinating to me that—the old way of solving this problem was someone would go and stand in that location, that office, whatever, with a clipboard and watch that person all day and see if that person is moving around, reaching for things, how much of the time are they spending there. And they have like a stopwatch going back and forth. And I'm sure it sort of relates back to your initial frustration of manually hitting shift-up-C or whatever it was. There must be a better way to do this.

And so you were able to build a model that would look at the security video camera and measure exactly where this person was. Maybe we define a specific hot zone of where their workstation would be located, and then use some body position measurements to see what are they doing. Are they bending? Are they reaching? Are they sitting? Are they squatting? All that sort of information. And then together, in aggregate, you're able to take a look at that and provide a pretty easy picture to a court or to anyone about whether or not this person spends the required amount of time in that space to need a chair at the office. And it was super incredible to be able to see all of that automated into a single package and displayed on the screen with a little bounding box around the person as they move around the video. I thought it was really quite powerful. And it's cool that—I don't know that that would have been something that was even possible a handful of years ago, but the technology has gotten to that point right now.

KS 31:12 Yeah. And the best part of that is this isn't straight off the shelf. This is coming straight from the researchers. A lot of that work we evaluated, I think, four different researchers works, where this is state of the art, using a 2021 paper. Because of how far open-source and programming has become in the sciences, they created their own GitHub repo. We download the repo, we set up their whole system, we apply it to our samples, and then we test across these different researchers’ works which one works the best in our situation. And that's the product that we're kind of bringing to people.

These systems don't exist months ago. The research paper probably came out in January, and we're applying it already... So it's just amazing how far and how fast we're able to develop these products for people and how useful they are. I mean, we're cutting out hundreds of hours of people. As you said, you'd have ten people in stores with cameras so that they could go back home, put it in, and then have a stopwatch and a clipboard and write it all down. And it was just—the total number of hours is reduced drastically. And it's a lot easier nowadays.

MJ 32:22 Yeah. And I think one of the things that I know is very important is the input to all of this. And we had talked about, in some cases, the pages being the raw data that we're looking at. In this situation, we're talking about security camera footage as the raw data. But I mean, if you were to be speaking to a client about some of the most important things, how would you describe the importance of input data in this process? And how critical is it that that's, I guess, clean and informative and accurate?

KS 32:50 Yeah. As a data scientist, this is something that we always harp about, is we can't pull something from nothing. If you want to be able to use machine learning, data science, you've got to have the data. So, if you're a business owner and you want to evaluate how your systems are doing, you have to have the data. You have to have a schema built up. You have to have your retention logs. I want to keep this data for five years. I only want to keep this important information. If it's all in a place, it's all—it doesn't have to be the same format. I understand it gets difficult as the years go on. Excel 2012, there's Excel 2017 or whatever. But as long as you have some mapping, you know where your data is, you're not throwing away an old hard drive, that's where you can actually pull real insight out of. And if you're in the lawyer field, you've got to look at everything.

I worked on a case where we were—what is it? 2004 to 2007. Earlier days of computer systems. What we were able to do is look back through the Windows server logs showing when people logged in and out of their computers over terabytes of data, over thirty thousand employees. And so that's kind of where the information lies. And it's hidden away. You would never suspect, "Okay, I'm just going to grab this old log file from 2004 that I don't even know where it is on my computer." That's actually where a lot of really useful information can be found as long as you keep it. If the data's there, we can generally pull out information from it. Without the data, we’re worthless.

MJ 34:17 Yeah. And I think the cool thing is we're in a place in society where so much data exists that we may not even think about to solve the problem. We've been installing cameras everywhere. We've been installing sound sensors everywhere. And it really could be up to the lawyer or the team that working on the case to think about how do we use this information creatively to come up with a solution.

I'll give an example. We were working on a pipeline dispute where there was suspicion that pirates were coming and tapping into an oil pipeline and stealing oil, which actually happens way more than you'd think in the world. I was very shocked about that. And the method for trying to figure out where along the pipeline this would happen more frequently than others, where these small little sound sensors that were placed on the edges of the pipeline to measure and hear the sound of oil flowing through the pipe. And you would hear a certain frequency and a certain sound when it was full of oil, and then you'd hear a different one at a lower pressure. And so by looking and listening to all of this different sound data appliance and machine learning, you're able to identify, especially if they're set up at some sort of frequent interval, exactly where in that pipeline the theft seems to be occurring. And that's a problem. If someone said, "Who's stealing oil from me? And where is it being stolen from?" I don't know that I would necessarily think of that method to solve this problem. But data is everywhere. And I think it's important that we find the right place to use it.

KS 35:43 Absolutely. Yeah. I mean, without the data, again, that question would be unsolved.

MJ 35:48 Yeah.

KS 35:48 But that brings then the next question of, with all this data that's out there, what data should we be using? Can we tap people's personal phones? Is that something the government's doing? How do you ethically use some of this data? How do you ethically apply artificial intelligence and machine learning to some of these situations without infringing on people's personal rights, without applying biases of our own historical wrongdoings?

MJ 36:19 Yeah, I know that probably the biggest experiment in that space seems to be the Chinese social score that the Chinese government has for each of its citizens. And it is interesting that certain data, like a lot of security camera footage, is certainly used for that sometimes. At what times of day people are entering or leaving their houses or certain buildings. But I know that that's an absolute huge can of worms that we could open and dissect. I'd love to save that for another day, because I think that's super fascinating and certainly a place where the industry is going.

Yeah, I guess if you were to kind of somewhat sum up and encapsulate a lot of the things that we've discussed today—I know we started way at the beginning, with what is machine learning. We went into artificial intelligence and how we apply that to specific business problems. We went a little bit through the history of the legal environment and how it's been used. Talked a little bit about regulators and how to get people on board by educating them. And talked a little bit about practical advice to be creative with the data that you have but also make sure that that data is very good. Where do you see this heading? And how do you want to encapsulate this for our listeners to see what they can do with machine learning?

KS 37:29 Yeah, absolutely. So, data is everywhere. Machine learning, data science, it's going to be the future. It basically is the future. It leads every little different part of our lives already. And so early adopters, those who can get ahead of the curve, they're going to be the ones making the most money. They're going to be the ones pushing the envelope. They're going to be the ones changing the future. And as a part of being that change in the future, hopefully you guys, anyone listening here, you'll be able to help lead the discussions on the ethics of it. If you're an early adopter, you know where it stands in your field. And you can be the ones to decide, "Maybe this isn't how we use this. Maybe this is a place where we need to have some introspection, to look back on how we've done it with systems in the past." To figure out, going forward, how do we address machine learning and the ethics of how we use it.

MJ 38:18 I love it. So, machine learning is the future. And you get to be the one to decide how we use it. That's perfect.

KS 38:24 Absolutely.

MJ 38:24 Well, thank you so much, Karl. It's been such a pleasure talking about this. I can't wait to dive in a little bit more into the ethics of AI. We'll save that for a separate session. But really appreciate your time today. This was excellent.

KS 38:35 Awesome. Thanks, Michael, for having me on.

MJ 38:36 Thank you. Talk to you soon. Bye.

The views and opinions expressed in this podcast are those of the participants and do not necessarily reflect the opinions, position or policy of Berkeley Research Group or its other employees and affiliates


The opinions expressed in this blog are those of the individual authors and do not represent the opinions of BRG or its other employees and affiliates. The information provided in this blog is not intended to and does not render legal, accounting, tax, or other professional advice or services, and no client relationship is established with BRG by making any information available in this publication, or from you transmitting an email or other message to us. None of the information contained herein should be used as a substitute for consultation with competent advisors.