Season 1/Episode 14

Observability Isn't About Tools—It's About People

with:

Marc LeBlanc

Host

Adriana Villela

Guest

Listen on:

Apple
Podcasts

Available on
YouTube

Spotify

The reality is, observability is a team sport. It has to be practiced by everyone in the organization.

Adriana Villela

Principal Developer Advocate at Dynatrace

About the Episode

With a growing number of interconnected applications distributed across different environments, enterprise software systems are becoming more complex every day. At the same time, todays organizations and their customers expect applications to be available 24/7. So how can IT teams ensure all of these applications and systems are running at their best? Observability offers a solution.

In this episode, Marc LeBlanc talks with Adriana Villela, Principal Developer Advocate at Dynatrace about the importance of observability for today's enterprises and why people and culture are the keys to success more than tools.

‍

Transcript

MOBIA_Solving_for_Change_Episode_14

Adriana Villela: [00:00:00] The definition of observability that I like, which I give total credit to Hazel Weakly for, because she came up with this definition and it's: The ability to ask meaningful questions, get useful answers, and then act upon what you've learned.

The reality is, observability is a team sport. It has to be practiced by everyone in the organization.

The argument that I will make for an observability team, I would see it as more of like an observability practices team or a center of excellence, as some people may call it, where there's gotta be some oversight as far as observability practices go, right? It can't be a free -for- all. We need to save developers from themselves because we like to do what we like to do.

Marc LeBlanc: this is Solving For Change, the podcast where you'll hear stories from business leaders and technology industry experts about how they executed bold business transformations in response to shifts in the market or advances in technology. In every episode, we'll explore real-world strategies that fuel [00:01:00] successful evolution.

I'm your host this month Marc LeBlanc.

I'm excited today to welcome Adriana from Dynatrace. She's a principal developer advocate. Thanks for joining us on Solving for Change today, Adriana.

Adriana Villela: Thanks for having me.

Marc LeBlanc: So we wanted to have an episode talking about observability and how it fits into the challenges of today and the future.

Maybe just before we get into some of those conversations, can you explain to me a little bit about what does your role entail? How do you get engaged and what is it you're trying to help customers with?

Adriana Villela: So my role is.. So, as a developer advocate, I bridge the gap between end users and engineers within the organization to make sure that we get that feedback from the end users back to the org.

But also, my area of focus is observability. So part of my job is to really educate folks [00:02:00] on observability practices, what it's all about. I'm heavily engaged in the OpenTelemetry community and OpenTelemetry for those who aren't familiar with it: it's basically an open source framework for instrumenting your application.

And the intent is that it's vendor neutral. So the idea is, most of the major observability vendors out there, including Dynatrace, can ingest OpenTelemetry data so everyone's ingesting the same data. And really what differentiates things is what is being done with the data to make it useful to you.

Marc LeBlanc: Very interesting. I'm wondering, thinking about your experience, do you have one moment you can share where a customer or an organization had that "Aha" moment and they saw observability just click?

Adriana Villela: I think the "Aha" moment is when your code is instrumented and you're like, "Oh my God! All of a sudden I can see what's actually happening in my [00:03:00] system."

I can even speak for myself. For me, distributed tracing is, I would say, the backbone of observability and being able to see a distributed trace of your service calls from start to finish.

Like, I click a button on my application to add something to my shopping cart and then it gets added. And seeing what that looks like from start to finish I think is the real "aha" moment in being able to get those insights from your application. That's, I think, what hooks people on observability.

Marc LeBlanc: I agree. And let's break that down a little bit because I think, if you put on the hat of a business owner or an organization and they're thinking about their applications, they're just thinking about time to market. They're thinking about functionality. "How do I get in front of my customer?"

What goal is... You know, as a developer advocate, or even just someone that represents observability, what is it you're trying to bring to the table? Because that could sometimes maybe be seen as [00:04:00] a gate.

Adriana Villela: Yeah, absolutely. And I think, here's the trick, right? It's often hard, especially if your application code isn't instrumented or maybe it's using some other old type of vendor-specific instrumentation. If you're saying, "Hey, you need to instrument your code now," you're basically introducing new technical debt into your code, right? Because it's something that you didn't have before. And so you're introducing that new technical debt. It's new code, new bugs, right? It's as simple as adding a log message that can still introduce a bug into your code. So there's that.

There's the learning curve. Especially, you know, the industry is shifting towards OpenTelemetry, so that means having to learn how to instrument your code using OpenTelemetry. And that's something that can be a little bit scary. There is a bit of a learning curve to OpenTelemetry. I think the folks in Otel are working really hard to help bridge that [00:05:00] gap, but it's still there. So, then there's that scare factor, right? The, "Oh my God, I have to learn this new stuff. I don't have time." Many organizations are faced with the reality that they might have some stuff that is constantly on fire. And so you're saying, "My house is on fire, but you want me to instrument my code, but I have to put out the fire. But..." And that's where you kind of have that contradiction, right? Because if you don't instrument your code, you won't have those greater insights into your application. So, you have to take the time to sit down and instrument your application to like put an end to those fires that are happening. So, I think that's probably the hardest sell.

And there's also the cultural aspect. People are used to doing things the way that they want to do... That they want to do it, right? Because they've been doing it for years and years and years. I was [00:06:00] at an organization where one of the senior leaders was like, "Well, we're going to do it this way because it worked at my previous organization five years ago."

Well, guess what? Technology has changed drastically five years ago and what worked for that organization five years ago just isn't going to work for your organization now.

Marc LeBlanc: Yeah, you hit on a couple things I think are worth doing a double click on. One was: okay, if we don't get started with observability, the house may already be on fire and we just don't know. So, we've got to take the time to fix it.

And then the other element I'd like to double click on after is just around the culture of that. So double clicking in onto, you know, maybe your house is on fire and we need to take the time. When an organization decides: yes, we're going to do this, we're going to get better insights, better observability into our platform and our applications, how do they get started?

Adriana Villela: So there are a few things that you can do. One thing that really lowers the [00:07:00] barrier to entry is, if you're instrumenting your code with OpenTelemetry, OpenTelemetry has this functionality called auto instrumentation or zero-code instrumentation. Now it's not available for all languages, but you know the biggies that we see a lot like Java, .NET, Python, I wanna say PHP and Go, I think, are kind of the main five ones for which this is available. Now, as the name implies, zero-code instrumentation means you do not have to touch your code to be able to instrument it.

So you usually wrap it around... Use an agent to wrap around your code. And so it'll instrument basically any sort of popular common libraries that your code uses. It'll inject that instrumentation for you, so that lowers the barrier to entry a fair bit. So, that's one way to [00:08:00] start.

Another thing to do is basically, get in the habit of instrumenting as you code. So we're not saying, go back to your old legacy code and instrument it bit by bit. I mean, eventually it would be kind of nice to do that, but you know what? To not get overwhelmed, just start instrumenting any of your new code. So then that puts you into that habit of instrumentation and hopefully it turns into muscle memory in the same way that we've got test driven development and we get into the habits of writing our unit tests as we code, let's get into the habit of writing our instrumentation as we code. And that that's actually known as observability-driven development.

And then the final piece of advice that I would give around instrumentation is: any homegrown libraries or frameworks that you have in your organization, look at instrumenting those as well, because chances are most of your code will touch that. So, you end up with a huge win by just doing that.

[00:09:00] So, these are some ways to get started. I would add a couple of cautions because this is something I've seen out in the field, which is basically--it's very tempting when an organization decides to instrument their code with OpenTelemetry or whatever for that matter, to basically say, "Hey, you know, my developers don't have time to instrument the code. Let's get another team to do it." And they'll often try to tap an observability team to do that. And I can speak from experience because I was running an observability practices team at one organization and leadership wanted us to do that and to create dashboards. And the problem with doing that is, we are not familiar with your code. So, you're basically asking us the equivalent of, "Write comments for my code." Well, I don't know what your code does.

Similar to the dashboards, I [00:10:00] don't know what's important to you for dashboarding, so don't ask me to do that. So, you really have to invest that time to do the instrumentation to learn it.

These will definitely make it better and less overwhelming. But at the end of the day, you still have to invest the time for instrumentation to learn it.

Marc LeBlanc: That's really interesting and I really wanna get into that, but I wanna keep going on the, how do you get started? We talked a little bit just now around instrumenting your code for your application layer. Talk to me as we go up the stack a little bit. Because I think there's two more elements.

I think you know, and maybe this is an observability versus modern conversation, but what about the platform layer and then what about that user experience layer? Because at the end of the day, the company, or your customer really cares about, "Can I do the thing the application promises I can do?" So how does observability fit into there? Are there different considerations or is there a different thought pattern?

Adriana Villela: I mean, I guess at the end of the day, [00:11:00] you want to... It's not just about the application instrumentation, right? You do absolutely want to instrument your infrastructure. You want to instrument your front end and your backend, right? Because you want to gain those insights. And there's a couple of interesting things that have cropped up recently because I think like OpenTelemetry, for example, when it was initially started, I think there was a lot of emphasis put in on instrumenting for the backend, right? But as we know, front end is equally important. And so now there's a lot of attention shifting over to... Not shifting, but there's more love being given to the front end of things and even to mobile observability. Because let's face it, a lot of the time, we interact with the interwebs through our mobile phones. So, having that kind of observability into our mobile applications is super important, but also the [00:12:00] infrastructure observability. So it's still equally important to capture your infrastructure metrics and your infrastructure logs and have that be part of your overall observability story. So we can't just say, "oh, it's just the application." Well, the application runs somewhere. So, we have to care also about the stuff that's running our application.

Marc LeBlanc: And then, you know, maybe just an opinion or if you have some concrete experience. All this instrumentation, how are the senior leadership of these companies, how are they ingesting that? How is that information getting to them? Is that a dashboard? Is it something different?

Adriana Villela: Yeah, I would say for senior leadership, you probably want it in the form of a dashboard, something digestible. Something easy to understand that they can glance at.

I don't feel that most dashboards are tailored at senior leaders. Definitely you don't want... [00:13:00] Actually, let me rephrase that: I think the dashboards that your SREs will use are probably not the same dashboards that you want your senior leaders to look at. I also think, and this goes into an interesting realm, which is: bringing observability outside of that traditional realm of technology and into other aspects of how a business operates.

So, basically being able to relate business events even to make those business events observable, make our internal business workflows observable. And that is going to look, I would say, probably a little bit different than what you would use for say, your application observability or in your infrastructure observability.

Marc LeBlanc: Let's switch gears. I want to dig into the cultural aspects because you touched on a couple things that really resonate with me. [00:14:00] And if I kind of pull together everything that you said, I'm really hearing that observability is not a siloed sport. You don't have an observability team. You don't... You hear this in other disciplines as well. You don't have a DevOps team. That's a great way to set your DevOps folks for failure.

Maybe just un unpack that a little bit. What you mean when you know... Why is there risk in hiring just an observability team?

Adriana Villela: Yeah, I think you explained it perfectly. It basically silos your whole observability operation, right? You run the risk of isolating observability as just another pillar in your organization.

And the reality is observability is a team sport. It has to be practiced by everyone in the organization because, when you think about it, the developers are the ones who are going to instrument the code, but then guess what? [00:15:00] So, the developers can leverage their own instrumented code to troubleshoot their code, but also when that code gets handed off to QA, the QAs can go and analyze the data in whatever observability tool that you have and tell the developers when they're running their exploratory tests or what have you. If they find the bug, they can use the instrumented data to go back to developers and say, "Hey, I found a bug and here's what I think the problem is," because you've instrumented that code. Or if they find a bug but they're not sure what's going on, that's another opportunity to go back to developers and say, "I found a bug. I don't know what the problem is. Maybe you're missing instrumentation." So, instrumentation becomes a quality gate.

And then by the time it's handed over for production, then at that point SREs are hopefully equipped with enough information to be able to do their troubleshooting during production incidents, but also [00:16:00] the instrumented data assists them because some of the instrumentation that you're doing is metrics. And SREs rely on metrics to create SLIs and, in turn, SLOs. So, they use that to basically inform the creation of these two things.

But on top of that--and I think this is where you have observability superpower, and especially Open Telemetry enables this--is that correlation of your data, right? So, your metrics and your traces and your logs are correlated. So, when an SLO is breached, you can go back to the underlying SLI. Well, that underlying SLI is tied to a metric. That metric is tied to a trace, and then we can like start to dig deep and kind of understand what the underlying root cause is. So now we have this connection from development all the way to production. And there's no way that [00:17:00] you can have an observ... Imagine if you had an observability team instrumenting your code separately. That would make no sense.

Now, the argument that I will make for an observability team, I would see it as more of an observability practices team or a center of excellence, as some people may call it, where there's gotta be some oversight as far as observability practices go, right? It can't be a free-for-all. We need to save developers from themselves because we like to do what we like to do. And so, an observability practices team would basically put in a set of guidelines and they would be the experts on instrumentation and what are the new things that are out and about and what are the best ways to configure and deploy your Otel collectors, for example. But the implementation would be left up to like these teams that [00:18:00] are specializing in the various aspects, right? The developers for instrumenting code and your platform engineering team, or what have you, for managing your Otel collectors. But having that oversight, ensuring that you don't end up with like a tools free-for-all. That's another thing. Because oftentimes you get into some organizations where they're like, "Well, this is the organization's tool for observability." But then there's some random rogue team that's like, "And we're running our own homegrown observability stack in Bob's desktop under his desk." So having that kind of oversight, I think, it allows for standardization, it allows for collaboration. I don't think your observability practices team should operate as a standalone pillar, either. It has [00:19:00] to have like input from those other teams, because otherwise we're doomed to turn it into a little kingdom. And I think that's where things start to unravel for many organizations is with the little kingdoms.

Marc LeBlanc: Yeah. Again, just picking up on things you're saying as we go through this, and I don't think it's a new term, but you touched on your platform engineering teams, and you also touched on standardization.

How do you see that evolving? How has that evolved over the, maybe the last say, five or six years? Because I do believe that with more conversations around platform engineering and what that kind of a team can provide, we are seeing standardization. So what has that meant for observability?

Adriana Villela: I think for observability, and specifically around OpenTelemetry, that would be standardization around [00:20:00] how you deploy your OpenTelemetry collectors, for example. And for those who aren't familiar, the Otel collectors, it's basically a vendor neutral agent that is essentially a data pipeline. It ingests your telemetry data from multiple sources--and it can be from applications, from infrastructure, et cetera--it processes the data, so it can massage it, add remove attributes, do data masking, which prevents that PII data from getting out there and then sends it to a backend for storage and analysis.

And so, I think platform engineering teams can provide, basically standard deployments and configurations of your open telemetry collector and make that available also to developers for local development and make those configurations standard and available to them. So, there's that aspect and then we can also take it a little bit further. And I [00:21:00] guess that's where you get into teams that are responsible for doing CICD. We can look into the observability of our CICD pipelines, which is, that one's a really interesting area because we often forget that our CICD pipelines, although internal, they are production systems. They're internal production systems and your live pipeline is your basically production and being able to have insights into how your code gets built, from when it gets built to, when it gets deployed. Having observability of that is so important because, especially like when our pipelines are working, it's great. Yay, things are smooth, but then when things start going caca, things slow down, or maybe you think it's working, but because your pipeline is observable, you can look at [00:22:00] the telemetry and realize, "Oh, there's something weird going on." So having that standardization, the observability of those pipelines, I think is super important. And that one's kind of a tricky one because I think observability of CICD is still fairly nascent.

There is a special interest group within OpenTelemetry that is looking at standardizing observability of CICD pipelines, focusing on what are the OpenTelemetry semantic conventions like? What are the standard attributes that we are going to be using to capture telemetry on our CICD pipeline so we have that further observability and insight into our pipelines?

Marc LeBlanc: I'm thinking about... I'm going way back to the beginning of our conversation about how do you get started and I'm thinking about platform engineering. So, I know in other disciplines we've seen, [00:23:00] let's just tackle containerization.

If you go tackle containerization the way you might try to get it introduced to your organization, you'd find a really good champion group. Maybe there's an application and they're all in. That's how you get started and then you go out from there. Do you find that there's a similar approach for observability? Is that a similar tactic if an organization, they've said, "Yeah, let's do this." Is that how they get started? They find a champion group to really define what's great.

Adriana Villela: I think a champion group goes a long way. Even in one of my previous organizations when I was rolling out observability practices to the org, if it hadn't been for a couple of OpenTelemetry enthusiasts, we wouldn't have gotten nearly as far as we did. So, definitely having that kind of grassroots support, the folks who are going to be instrumenting having that enthusiasm. [00:24:00] Because it's one thing for like our team to say, "Yay, observability! Yay, Otel!" It's another to have a team that's actually using it and say, "Yes, this thing is legit."

But then there's the other side of things, which is, that support from up top, right? From executives and from management. And this becomes a little bit tricky as well because oftentimes, some execs will come in and say, "We need to do this observability thing." But they're not really sure how to go about it. And so, I think, having an observability practices team steer them in the right direction and educate them, goes a long way. And then having support from managers because they are managing the individual contributors who are responsible for rolling out observability things. So, instrumenting the code and configuring infrastructure [00:25:00] related to observability. So having their support, their buy-in their understanding is super important.

So, there's like a huge amount of education and internal advocacy, and that's something that we had to do in my previous organization is, really focus on that internal advocacy and for us at the time, for example, I was championing OpenTelemetry--I think it was back in 2021--and traces hadn't even made it to GA and so the entire organization was like, "You're telling me to instrument my code with OpenTelemetry and traces haven't even gone to GA. What the hell?" And my thought at the time was like, this is gonna be big. You need to be patient. And one of the things that I did to sort of quell people's concerns was, I got basically two folks in OpenTelemetry leadership at the time [00:26:00] to come in and they were from competing companies, from competing observability companies. But this speaks to the power of the OpenTelemetry community is that they were there on behalf of OpenTelemetry, not on behalf of their respective organizations. We weren't even using their product, either of their products. But they were there as a unified front to talk about OpenTelemetry, and they came in for a Q and A on OpenTelemetry.

It was funny because I remember telling my observability practices team, "Oh my God! What if nobody asks questions and we have them for an hour. Let's maybe pad some questions just in case because I don't know how people are going to react." And we didn't even have to crack open any of our questions because the developers were full of questions. It was a really engaging session and at the end of the day, everyone's feeling a lot better about using OpenTelemetry as a result.

So, really having that [00:27:00] conversation with external people. So it's like, "Don't take our word for it. Talk to other people who are out there doing it." I think that that's really important as well, like really showing people how OpenTelemetry is used in the wild. That's actually one of the things that I do.

I'm one of the maintainers of the Otel end user sig and we basically bridge the gap between end users and maintainers, but we also connect end users with each other. And one of the things that we do is share end user stories so that other Otel end users, whether they've been around for a while or they're just getting started, they can hear how it is that people use OpenTelemetry out in the wild. So then they can be like, "Okay, this thing's doable. I feel good." You know, it gives them a sense of confidence, a sense of community as well.

Marc LeBlanc: I have a couple of closing thoughts here, and we've been talking for a while. We've talked about observability, we've talked about some of the cultural aspects. We've talked [00:28:00] about some of the technical challenges, and I kind of love that we haven't mentioned any of the tools. We talked about the OpenTelemetry framework, but it's not a tool, it's a framework.

So, how do tools like Dynatrace, like that platform, how does that help observability? We should at least touch on that. We like to try to touch on the people. We talked about the process. We talked a little bit about the tooling. So how does a platform like Dynatrace come in and how does it help facilitate observability?

Adriana Villela: Yeah, that's a great question. And for starters, I'm going to actually give you a definition of observability, because I think it'll be very helpful in helping to explain this.

So, the definition of observability that I like, which I give total credit to Hazel Weakly for, because she came up with this definition, and it's: the ability to ask meaningful questions, get useful answers, and then act upon what you've learned. And I think this definition is really important to answering your question because at the end of the day, [00:29:00] especially with something like OpenTelemetry, which is basically used for generating the data for ingesting and capturing the data, I guess, that you're instrumenting. That data has to go somewhere, right? And that somewhere is gonna be an observability backend like Dynatrace. And the interesting thing, about the observability landscape is that there are many tools out there.

Some tools have been around since before observability and they were monitoring tools that have pivoted into observability. There are tools that are good for just one thing. Like Jaeger, for example, open source--it is just for ingesting traces. Prometheus, as we know, it's for storing metrics. And then you get tools that are unified observability tools like Dynatrace that can [00:30:00] ingest the main OpenTelemetry signals. So, the traces, the logs, and the metrics. And I'd say that having a tool like Dynatrace that provides you with unified observability is super important because observability is not just a matter of analyzing each signal in isolation. Previously, you might have heard of the three pillars of observability, right? The traces, the logs, the metrics. But they're more like, to borrow a term from my former coworker, Ted Young, it's more of a braid, right? All of these signals are intertwined. They each play an important part in the observability show and you can't just do observability with one or the other. So, being able to have a platform like Dynatrace that can ingest all three signals, to not just [00:31:00] ingest but give you the ability to correlate those signals and see those correlations, is really important because then you're not just ingesting the data. It's what is Dynatrace doing with your data? Is it allowing you to ask meaningful questions, get useful answers, and act on the data? And I think that's really what differentiates observability vendors that ingest OTel data. That's what differentiates them from one another.

So, what kinds of patterns can it identify in your data, whether it's--many organizations, many vendors are leveraging AI for that sort of thing, including Dynatrace--what kinds of patterns is it seeing in your data where it can say, "Hey, you know, you might wanna take a look at this. This looks kind of interesting. Maybe this might be a problem." And, so I think [00:32:00] that that's where the tooling comes in. That's where we start to see the power of observability.

Marc LeBlanc: Yeah, I completely agree. Adriana, we're just kind of wrapping up here. I wanted to summarize probably my main two takeaways from our conversation today.

Like I said, we talked about the culture, we talked about the technology. We talked about the tooling.

On the tooling front, really, what I took from your explanation was: really understand the various tools in your ecosystem. Understand that there may be some overlap, but the important thing is to have something that's taking all that information and aggregating it together, giving you that correlated view. That's important on the tooling front. On the technology front, you have to understand what's in your ecosystem, the infrastructure, the platform, what are customers expecting.

But probably the most impactful thing was around the culture, the people part and that you can have observability [00:33:00] people and you can have developers. But if they're not talking together, your observability people don't understand what the app is doing and your application people don't understand what the observability people know. So, really it's a matter of getting those cultures meshed together for good success.

Anything else you'd like to to add there?

Adriana Villela: I think you've covered it perfectly. I think just a reminder, observability is a team sport. Do not spin up an observability team and hope that your problems are going to magically go away. And don't ask them to instrument your code for you because that's just not gonna work out.

Marc LeBlanc: Thanks so much, Adriana. Thank you so much for being a guest today.

Adriana Villela: Thanks so much for having me.

Marc LeBlanc: Thank you for listening to Solving for Change. If you enjoyed this episode, leave us a rating and review on your favourite podcast service and look for our next episode.

‍

About our hosts

Marc LeBlanc

Host

Marc LeBlanc is Director of the Office of the CTO at MOBIA. An experienced technologist who has worked in large enterprises, start-ups, and as an independent consultant, he brings a well-rounded perspective to the challenges and opportunities businesses face in the age of digital acceleration. A thoughtful and engaging speaker, Marc enjoys exploring how technology and culture intersect to drive growth for today’s enterprises. His enthusiasm for these topics made him instrumental in creating and launching this podcast.

Season 1/Episode 14

Observability Isn't About Tools—It's About People

About the Episode

Transcript

About our guest

About our hosts

Keep Listening