April 19, 2022
Analytics and Data Engineering with Nick Haylund, phData
015 - GAT Podcast: Force Multiplier
25 min read
GAT Podcast: Force Multiplier - Analytics and Data Engineering with Nick Haylund, phData
S1 00:00 Welcome to Force Multiplier, the official podcast of BRG's Global Applied Technology team. The GAT team, as we call ourselves, are a globally distributed team of software engineers, data scientists, graphic designers, and industry experts who serve clients through our BRG DRIVE analytics platform. We're helping some of the world's largest and most innovative companies and governments transform data into actionable insights. I'm Michael Jelen, and in these conversations, we speak with people both internal and external to BRG to discuss how technology and specifically software acts as a force multiplier to extend the impact of people across any kind of professional organization. Today I'll be speaking with Nick Haylund. Some listeners might recognize Nick from the Alteryx community or from Tessellation, a leader in data analytics. With Tessellation recently merging with phData, Nick now leads the analytics delivery arm of phData. He's focused on empowering customers to make faster, more reliable data-driven decisions by building robust platforms, processes, and data-literate teams using technologies such as Snowflake, Tableau, Power BI, Alteryx, KNIME, and more. Please enjoy this conversation with Nick. S1 01:08 Hi, everyone. My name is Michael Jelen from BRG's Global Applied Technology Practice. I'm here today with Nick from phData. Hi, Nick. Thank you for taking the time to talk to me. S2 01:18 Hey. Appreciate the invite. Thanks. S1 01:20 I am super excited. Today we're going to be talking about analytics engineering and data engineering, two very interesting topics, and along the pipeline and evolution of how things have been progressing in the industry. To start off, since I think we might post this on both different podcasts, I'll give a quick little introduction about me. And then, Nick, I'd love if you can do the same thing. S2 01:38 Yeah. Most definitely. S1 01:39 So I am the lead of our machine learning, artificial intelligence team at the Global Applied Technology Practice at BRG. So we build tools that enable users to interact with and collaborate with data. So typically we take the approach of simple machines like analytics, forms, workflow, and we can couple them together to create more complex solutions to problems. Today we're going to be talking about tools that democratize machine learning. So very excited to jump in there with Nick. S2 02:05 Yeah. Again, thanks for having me. Nick Haylund here, director of analytics for phData. I used to work for this company called Tessellation who merged with phData just a few months ago. And we're happy to be working with each other. And I think it's a good conversation to have now especially since we do have folks on the data engineering side from phData, usually building up with technologies like Snowflake, Databricks, and others. And then on our side, we're kind of code-free, low-code first in our analytics space. So it's definitely a conversation that we've been having a lot about the difference between analytics engineering and data engineering. So happy to be here and looking forward to getting into it here. S1 02:44 Awesome. Well, thanks so much, Nick. Let's start off with data engineering. At a high level, I'm familiar with it as a necessary evil where you always have to take any kind of data feed that's coming in before you do something with it; so before you turn it into analytics or often before you apply more complicated machine learning models on top of it. But could you go into a bit more detail about what that is and what analytics engineering is? S2 03:06 Sure. I'm going to be doing a lot better job at describing the analytics engineering space. So I just preface that for my data engineering friends on the call and at work here. But data engineering, I like to think of it as the IT building for the business. So making data sets, reports available for the business, that can sometimes take months. It's a lot to do also with the digital transformation that a lot of companies are going through right now, standing up really robust and highly available platforms so that the business can make great decisions on. And not even necessarily just the business, but also IT making some certain products based on that data sets as well. On the analytics engineering side, I think is a little bit more different where I kind of view it as the business building for the business using lower no-code tools. So data engineering, you got our Python friends, maybe even some R friends, plethora of other languages that have traditional backgrounds in training, whether that be a couple of years in boot camps and learning that code or perhaps graduating with degrees in that space versus the business, a lot of times knowing what they're starting off with, what they need to end with and not traditionally knowing the in-betweens. S1 04:17 Got it. So I guess in general, it's safe to say that data engineering is something that's a lot more manual, requires higher degrees of technical sophistication. Often that might be more of an IT style role. And it sounds like analytics engineering is more GUI or user-interface driven, a lot of drag and drop, low or no-code, and something that the business can do itself? S2 04:39 Yeah, I think so. And a lot of times when I'm talking to people about analytics engineering, they're all of a sudden going to be using a new tool. For example, Alteryx is a good one to reference here. I like to compare a lot of times the VLOOKUP with the join tool so that if we're talking about the business, something that would resonate with them, it's going to be a lot of times Excel or Microsoft Access or those kind of traditional spreadsheet-like systems and translating that into a more efficient way to do things which can be these low, no-code analytical platforms. S1 05:08 Gotcha. And I think on either side of the spectrum, it's pretty clear to see the distinction. But I presume that this is indeed quite a wide spectrum with a lot of things that sit in the middle there. Are there things at the intersection there? Is there a mixture of areas where you can mostly do things through GUI, but add in a little bit of code, or mix and match a bit? S2 05:28 Yeah, we like to talk about that space, too. So a lot of times these platforms do thrive in like, "Hey, we don't need any programming skills to work our platform and stuff." But there's a lot of applications out there like Tableau Prep and Alteryx that do provide the code-friendly space through the release of SDKs, Software Development Kits. So if you do have something that you want to implement with the low, no-code solution, there's a way to do that these days. So while it's traditionally maybe 80% of the users might use it for only drag-and-drop functionality, for those programmers at night and weekends that actually do want to put a Python package into, let's say, an Alteryx workflow, that's available these days in a lot of these platforms. S1 06:08 Awesome. Yeah. Do you have any examples of cool areas where you've seen what would normally be an extremely complicated or difficult process be automated through one of these low or no-code solutions? S2 06:19 Yeah. So those are the fun use cases, right? To get back to-- I'm in a little group. It's me and one other person, Tom Larson, that we like to develop Python SDK tools for Alteryx. And one of the examples that we have on that one-- and we have a free package, kind of a suite of tools out on GitHub that make your job [inaudible], but one of those tools is a VADER sentiment analysis tool. So if folks are on the call or on the podcast here that are familiar with VADER, that's a way that you can score sentiment of certain phrases or sentences. So there's a way that I want to kind of get generally the sentiment of an entire conversation or even maybe of just one sentence. There's a Python package you can pass through the text. It'll generate a score, either negative one to one, being either very negative, very positive, and everything in between. So traditionally speaking you'd have to basically program in Python and create endpoints and make sure you're able to pass through these sentences and phrases through this tool. But through a low to no-code way, a lot of folks can start off with a data set full of sentences and phrases; maybe they get a survey data set and they're really interested in generally the sentiment of what people are saying versus just making a word cloud at the end of the day of an entire survey thing. So through a way of just having your inputs and knowing what your output is, which is going to be sentiment around every single sentence that's being talked about, that in-between space now provides a way that you can just pass the sentences through this building block or this tool and get your output. S1 07:46 Awesome. I love that. And I actually do remember a handful of years ago we were working with a Middle Eastern government and their objective was to monitor on Twitter the sentiment that each person was saying related to different policies. So anytime a new policy would happen, it would have a specific hashtag. And the government wanted to know, "Do people like this? Or do people not like it?" And we had to undertake the full manual process that you have just described in order to achieve that. So pretty cool that things have come a long way and you're able to just do that very, very easily. I wonder if also though if something becomes very easy like that, if the tool could potentially be used by people that don't fully understand it, and if there are pitfalls here. Do you think that there are risks as we start to democratize some of these more complicated machine learning algorithms to your average, everyday user? S2 08:32 Yeah. No, most definitely. And I think that could be a whole conversation about ethics with data science in general, right? In particular with this example, I think that making sure that people understand what they're running is still very important. So while we can still enable folks to do enrichment through a join tool, because that's going to be pretty straightforward, once we start to get to predicting behavior, making decisions based on that, you have to be very careful with that. So one example my colleague John Emery did was create a tool in Alteryx using the sentiment analysis but built with R. So he's also a person that works in the space of code-friendly that can create these custom tools to enable folks that aren't coders to use these packages. And one of the things that we had to address on that front was if you're dealing with the hospital that's talking about cancer, and they want to scrape hundreds or thousands of articles off the Internet or off their own website and provide sentiment analysis about whether something is being talked about either positive or negative, that's going to introduce some problems because if you're traditionally scoring something that might tag cancer as something that's very negative, well, you want to make sure you pair that with that says cure cancer. That's going to be probably a positive sentiment in most scenarios. So there's still a space, a need, great importance around making sure whatever you are using from a data science or modeling standpoint, that you have someone who understands the underlying package and that you're doing a responsible job at using it. S1 10:04 Great. So we can usually get you some preliminary results, but obviously, you'd want to be able to tailor and ensure that all of the biases that are perhaps embedded with that system are being taken into account. S2 10:15 And that's happening more and more in the AutoML space, a lot of times it's referred to. A lot of folks that don't have a data science background want to start to kind of play around with maybe, "Confirm a hunch that I have about certain thing happening." I think that this variable is driving this behavior. "Sales is being driven one way or another. And I want to do some predictions around whether that's going to happen again in the same quarter from last year that it is this year." That space is being opened up. And it's accelerating that modeling. But I think there's also some responsibilities that need to be taken in that space too. S1 10:45 Cool. And for the users or for the listeners who have not heard of AutoML before, what would that be? Would that be somewhere in your data pipeline, you have a prebuilt model, and you're essentially just pushing some data directly into this prebuilt model? It might not be tuned specifically for the purpose that you need it for, but it might just give you some raw results. Is that how it works? S2 11:05 Yeah. So there's a couple kind of examples now in this space. So first and foremost you need that clean data set. You need most likely a very robust data set so you can start to do some modeling around that. So that's where your Alteryx, your KNIME, your Tableau Prep, Power Query comes into play with prepping, cleansing, and enriching that data. Then you're ready to make some decisions or create some visualizations on that. And making decisions part-- create this model. There's something like AutoML that's available through Alteryx where you can pass through a data set. Let's say a customer churn modeling is what you're after and want to see how likely someone is to leave your company as a customer. What we would need, of course, on that front is most likely historical data around who left, who stayed, and then their kind of customer segmentation or their customer profile to predict future behavior. So something like that is a really great use case that I can see some good results. There's always gonna be - I think you mentioned; Jelly - the tweaking and making sure that you're accounting for business-specific logic. So the folks that have been a cell phone company for a long time know certain triggers that would cause somebody to leave that might not be in the data set. Then you might have to go hunt down and create that data set and enrich your data set more. DataRobot, of course, is a big vendor in that space. If you have some pretty good data sets, you can go a long way with that. And also there's ThoughtSpot. So they have this product called SpotIQ. So if you can run their data set through that, they can have some good AI-driven insights that are available. Speaking of AI-driven insights, Alteryx just released Auto Insights. Lot of cool things in the market there. S1 12:40 Yeah, that's super interesting. I find that a lot of firms, especially as they are transitioning up the ladder in their realm of digital maturity, are suddenly coming across more data than they know what to do with. And it feels like the limiting factor is often human attention. "We don't know what to look at. There's just way too much. I'm not going to find the things that are going to move the needle or drive business in the correct direction." So the idea of using AI to have those features pop out and have someone take a look at it in further detail seems super, super useful. But I guess similar to our previous points, deploying AutoML or sometimes using Auto Insights as the only and sole method of achieving these gains might be a little risky, right? I mean, maybe it gets you 80% there, but it's not production-ready. So pushing it into production, you might miss things or not see them. Again, Auto Insights, it sounds like it might be super great to call out a lot of things, but it probably won't get everything, and so you can't rely on that as your only method of finding those items. I guess if people are starting to think about these tools, everything from the analytics engineering components all the way through to some of this auto-machine learning that can be applied at certain points, what are some easy use cases that people can think about? What are maybe some industries or areas that they'd be familiar with that perhaps could benefit from this the most? S2 13:56 Yeah. Great consideration. So we definitely got into the AutoML space and the code-friendly space with Python packages. But at the end of the day, the majority of these users of these analytical engineering platforms are going to be the folks that are in the business just trying to automate their work and accelerate their work so that they can kind of move on to what they want to do, which is analyze what's going on, make decisions about what's going on, move it into that reporting and analytics space. So I think a lot of industries that are secretly or not so secretly thriving in this analytics space is going to be your finance, your tax, and your audit, which might not be the first thing. If you're a student in college right now and you're using Alteryx, you're using R and Python and stuff, then you're like, "What? I don't think I could be a finance person in analytics," which is just not true. So even my background. I came from a finance and e-com background. And the whole reason I got into the data analytics space is because I was so frustrated at how long it was taking me to process my month-end close information. And then all of a sudden it would be day six, day seven. And then I was moving on to this one project maybe that I could spend a week on. Then it was all of a sudden month-end close cycle all over again. So I think there's a lot of folks out there right now in my space that kind of backed into the data analytics field because they were using these tools in a way that maybe the tools weren't initially created to do. S2 15:17 So Alteryx, for example. They started out as a spatial company, cutting and slicing and dicing and pushing spatial data for their analysis. SRC is the name that they had before. I think they improved it by calling it Alteryx. But here you have the spatial analytics processing tool that became one of the biggest now for finance organizations like PwC, [inaudible]. Those organizations have hundreds if not thousands of folks that are driving that tool and platform to do audits and to do tax and to do advisory. So there's a lot of industries out there that are thriving right now with analytics tools that you probably wouldn't imagine. So it's not just about predicting data and stuff with these tools, it's definitely about getting the job done and bringing it from good to better. S1 16:03 Yeah. It feels incredibly liberating to have these tools because I feel like a number of people probably lose a huge percentage of their day in their day-to-day functions spending it on data prep or data engineering or things that aren't their core competency and certainly not what they were brought on board to do. It seems like this automation really moves people more towards being able to make business decisions, think more critically about the information that they're seeing, and really keep moving the organization forward. I think in the data science industry in general-- or, I guess data science in any industry, I should say, this is a necessary component. And a lot of people might not have a data science background, but yet it seems like you may be able to get 50, 60, 70 percent of the way there with some of these tools and enable and empower your average person in an organization to be able to make better decisions with that data. So I think it is pretty cool and it's very interesting that we're moving along that spectrum. When you mention some good and some better solutions, I guess every organization is overwhelmed by this advancement and how fast things are changing. Could you maybe talk a little bit about where they can start? What would be a good solution? How would they move to something that's a little better? And then maybe what would be the best or optimal solution with what we've got right now? S2 17:14 Yeah, most definitely. So I like referring to that kind of mentality. And I think a lot of times I approach projects with that. I think the good solution is what a lot of people are doing today. They're getting the job done. It might be very frustrating, take a long time. They might be having to read from a manual piece of paper about a routine that they have to follow at the end of every month so that they're getting the right tax entries at the end of the day. So it's a solution. So we can call it good. Moving to better, I think is definitely the whole premise of this kind of analytics engineering space and this kind of data processing low, no-code. Automating that manual work is going to be something that's going to allow people to save hours, if not at least a day, if not a week, and then really move into that space. The best solution, I think, yeah, is we're going to be talking about stuff that's going to be highly available. It's going to be stuff that's supported. It's going to be stuff that supports break fixes. It's going to be something that's going to be in a state that has data governance checks around it, data quality, integrity, all of it around that. And so moving from Excel to some kind of drag-and-drop GUI tool where I can automate some stuff, maybe even schedule it without a server available to me, and then eventually moving into a platform first ready solution like a Snowflake or something in that space. That's how, at least for me, I think of kind of the good, better, best solution is moving from fat files to spare data to more automated way to doing things, and then finally an IT kind of backed and supported fully available solution. S1 18:42 Yeah. And from a technology perspective, I think that's a great illustration of the good, better, best journey. But I guess that alone probably wouldn't get people where they need to go because in parallel you also have the human component associated with that. You need to make sure that your people are speaking the same language and of data fluency. And you also need to make sure that data maturity is increasing. In every organization, obviously, they mean something slightly different. But how do you see that evolving together with a good, better, best migration from a firm? S2 19:12 What we find is-- or, what we found is that's the biggest failure of a company when they do try to go on this digital transformation with a platform or technology, a lot of times with lacking there, is that process piece. So you think about either people, process, and technology, which is a very popular term. We also want to make sure that community and upscaling and training is at the forefront of that conversation. So making sure that not only does someone know how to drag an Alteryx tool to canvas, we need to know that they know what a canvas is. We need to know if you say, "Go to the configuration window," they know exactly where to point. If we say, "Hey, drag a join tool onto the Tableau Prep canvas over here," we know that they are connecting that to their Excel routes and saying, "Okay. I'm basically going to be creating a VLOOKUP here, but I'm just going to be using a join tool." And getting that fluency into the business is not only going to accelerate people building these data products, it's also going to be able to get people on the same page when they're talking to their IT counterparts. So if at some point you do want to migrate a certain process from a canvas kind of analytic engineer solution to, let's say - I don't know - Snowflake, or creating a pipeline in that space, we're going to be able to hand off that mapping to maybe IT to make something highly available. And having a business be able to say, "Yeah. I joined this data set, this data set," is going to accelerate that conversation with our more technical counterparts. That piece alone is going to be huge. So one way that we do that at phData is we have the datacoach.com platform. Subtle plug here. But the way that we do handle that, no biases-- I think we do a pretty good job of making sure people care about the data process they're creating and that they're learning about. S2 20:54 So one of the things I know when I was learning-- it was always kind of tough to get started with a new technology or a new space. When I had these practice problems going on, I didn't really care too much about the data. I cared about the outcome, which was me learning. But I didn't really care about too much of the output because, at the end of the day, I would probably never use that data set or that process ever again. So the capstone project that we have in place is creating a capstone around a true business problem. And we always encourage people to pick your most frustrating problem and start with that one because you're going to care to solve that and automate that and then taking that to the next level because nothing could be more motivating to somebody than saving 8 hours at the end of the month. Enablement, training, fluency, it's all a key piece of this kind of transformation that we're trying to drive here. S1 21:40 Awesome. And before we get into the conclusion to wrap up, I feel like it's a great time to ask where can people learn more about that platform, you, the firm, and where can they continue to read and learn up and increase their data maturity? S2 21:53 Yeah. You got it. So visit our website either datacoach.com for that enablement. Otherwise, phdata.io is a great place to reach out to us. We're also on Twitter. phdatainc is our handle. Or feel free to reach out directly to me at nick612haylund. S1 22:10 I love it. We covered so many topics today from the evolution and usability of engineering and getting data into a clean and usable format. We talked about some of the automation that's out there for dragging and dropping ML directly into it, or Auto Insights to be able to show you what data you should be looking at; the journey that everyone's going through from good to better to best and how that also associates with our learning as humans and the community and the organization and how we grow it. With all of those things in mind, is there anything that you would love to leave people with here related to these topics? S2 22:47 I think the best way to get started with this stuff is to go to these partners that we're talking about here; grab a trial; throw a data set that you care about onto the canvas, into the white space, and you start building and create the portfolio that you want to build and have some fun with it. S1 23:01 Awesome. Well, thank you so much, Nick. This has been such a pleasure. S2 23:04 Yeah. Likewise, Jelly. Thanks a lot. S1 23:05 Looking forward to talking to you soon. Thanks. S1 23:08 The views and opinions expressed in this podcast are those of the participants and do not necessarily reflect the opinions, position, or policy of Berkeley Research Group or its other employees and affiliates.
The opinions expressed in this blog are those of the individual authors and do not represent the opinions of BRG or its other employees and affiliates. The information provided in this blog is not intended to and does not render legal, accounting, tax, or other professional advice or services, and no client relationship is established with BRG by making any information available in this publication, or from you transmitting an email or other message to us. None of the information contained herein should be used as a substitute for consultation with competent advisors.