Best practices for devops observability

Best practices for devops observability

My first blog post in 2005 was about application logging, and I shared 10 practices I hoped my development team would follow. Today’s devops implementations around observability are more sophisticated and scalable, but several of my recommendations remain relevant, especially logging session IDs, specifying code-location details, and ensuring logs are free from security and privacy-related data.

Building observability best practices into the software development life cycle is critically important today because more apps deliver mission-critical customer and employee experiences. Devops teams should not be just concerned about faster mean time to resolve major incidents; they should ensure that apps and services are observable to continuously improve an experience’s robustness, reliability, and ease of use.

“The need for strong devops practices has never been felt so strongly,” says Arvind Jha, senior vice president, software development at Newgen. “With enterprises accelerating digital workflows and self-service options for customers, employees, and partners, IT professionals are challenged to manage, maintain, secure, and keep compliant thousands of applications.”

Tame complex cloud-native apps and microservices with observability

Today’s cloud-native applications are more complex than last generation’s two- and three-tier app architectures running in data centers. A single transaction may cross several microservices, query multiple databases, interact with third-party software as a service and run on multiple clouds. When application errors, poor performance, and data issues impact user experiences, having centralized and meaningful information to discover root causes is key to resolving issues faster and fixing defects.

What are some use cases where strong observability practices can impact operations?

  • Identifying which microservices and APIs are causing performance bottlenecks
  • Reviewing user inputs that cause application errors
  • Tracing through a customer journey to identify where users are struggling
  • Recognizing when there’s a security breach and collecting forensics
  • Pinpointing the source of data issues to be remedied with improved data validation

Here are four observability best practices for devops teams building and supporting apps, microservices, and databases.

Seek business collaboration to optimize data collection

Logging every JSON payload between all apps and services may seem like a good idea. But having too much data can slow down resolving issues, complicate finding root causes, or require expensive data science efforts to identify problem areas.

How should devops teams decide what data is important to collect and how long to retain it?

Anant Adya, executive vice president at Infosys Cobalt, suggests, “Agile development teams should closely collaborate with developer and stakeholder groups when managing devops observability tools to ensure alignment on goals. This will help teams avoid gathering and evaluating loads of data that don’t serve a specific purpose or drive a particular outcome.”

The conversation with business stakeholders can help create a strategy and identify what user activities require higher service-level objectives. Developers must also understand what data is sensitive and either leave out or implement data masking before capturing it in logs, databases, or observability tools.

Anand Krishnan, global head of presales and solutions at Persistent Systems, recommends, “In observability, it’s about what to log, not just when and where. It’s important to stitch together different traces of a customer’s journey through different systems and applications and collate that in a central repository.”

Centralizing observability data can be daunting for large devops organizations. They may use application performance management (APM) tools and AIops platforms such as BigPanda, Datadog, Dynatrace, Moogsoft, New Relic, OpsRamp, or Splunk to centralize data, correlate alerts, and monitor performance.

Prashanth Samudrala, vice president of customer success at AutoRabit, shares several decision points where business stakeholders and technical leaders can provide valuable input on observability tools and data. He says, “Best practices include clearly defining metrics and goals centered around quality and overall productivity, running frequent scans for technical debt to understand key problems, and flexing the team’s problem-solving muscle to learn and improve continuously.”

Move beyond resolving production incidents

Site reliability engineers (SREs), network operation centers (NOCs), and devops teams use application logs, infrastructure alerts, and other telemetry data for incident management. They may use APM tools, IT service management (ITSM) platforms, automation capabilities, and AIops solutions to resolve incidents and perform root cause analysis.

But SREs and devops teams should be proactive and use observability data and tools to improve apps while developing and testing them. Tomas Kratky, founder and CEO of Manta, says, “Observability is part of quality assurance—the last mile needed to improve incident detection and simplify its resolution.”

I recommend identifying the users and customers for observability data and tools. Self-organizing teams should declare which developers, QA engineers, SREs, and incident managers will become experts in using the tools, provide feedback to dev teams, and be most responsible for improving quality when developing and testing software. Improving observability should be a key activity in a continuous testing strategy.

Another best practice comes from Chris Cooney, developer advocate at Coralogix. “One approach to observability that has become ubiquitous amongst high-performing teams is full-stack observability, which enables teams to avoid the data silo,” he says. “Rather than settling for disparate observability data, with logs in one system, metrics in another, and traces in a third, teams are rendering everything together.”

Drive continuous improvement with observability

A selected group of engineers may have the lead responsibilities around software quality, but they will need the full dev team to drive continuous improvements. David Ben Shabat, vice president of R&D at Quali, recommends, “Organizations should strive to create what I would call ‘visibility as a standard.’ This allows your team to embrace a culture of end-to-end responsibility and maintain a focus on continuous improvements to your product.”

One way to address responsibility is by creating and following a standardized taxonomy and message format for logs and other observability data. Agile development teams should assign a teammate to review logs every sprint and add alerts for new error conditions.

Ben Shabat adds, “Also, automate as many processes as possible while using logs and metrics as a gauge for successful performance.”

Ashwin Rajeev, cofounder and CTO of Acceldata, agrees automation is key to driving observable applications and services. He says, “Modern devops observability solutions integrate with CI/CD tools, analyze all relevant data sources, use automation to provide actionable insights, and provide real-time recommendations.

Coordinate monitoring and observability practices

Historically, monitoring tools were more likely to be instrumented by NOCs or ITSM and other IT ops teams, whereas observability practices stemmed from developers looking to centralize application log files and use them to resolve defects and address performance issues. Today, devops teams bring these capabilities together and develop a holistic approach to observing and monitoring customer journeys and employee experiences.

Krishnan says unifying alerts and monitoring is an important step. “Previously, you could alert an engineer about a systems issue by sending an email or doing a core dump. Today it’s about integrating it into the developer ecosystem and production support systems,” he says.

Devops teams want to release frequently and address production issues quickly. Investing in observability helps find issues before they reach production and provides the forensics to discover hard-to-find problems.

Add a Comment