No Time for Panic: Gen AI and LLMs in Incident Handling

Generative Artificial Intelligence (AI), bolstered by Large Language Models (LLMs), are redefining the field of incident response management, transforming the traditional reactive approach to a proactive one, enabling us to anticipate and mitigate incidents before they spiral out of control.

Picture this: a digital assistant, tirelessly working around the clock, streamlining the incident response process. Ben Wiegelmann, a Senior Product Manager at PagerDuty, describes the transformation: "We've harnessed the power of generative AI to expedite incident updates. The person sending the message merely reviews, approves, and dispatches it. This innovation is a game-changer for our customers who operate 24/7 teams, keeping stakeholders in the loop."

But it's not just PagerDuty. Other tech giants are also riding the AI wave to enhance their incident response. Logz.io, for instance, uses AI to surface incident response recommendations, facilitating quicker and more effective responses. Rubrik and Microsoft are using generative AI models to streamline operations and improve efficiency. When the Rubrik Security Cloud machine learning models detect anomalous activity, an incident is automatically created in Microsoft's Sentinel. This proactive approach allows for quicker response times and more efficient handling of potential security threats.

Sarika Mohapatra, ex-VP of Engineering at Devo, highlights several use cases where generative AI can be a game-changer in incident handling. These include root cause analysis and suggested resolutions, SRE assistant for remediation and incident communication, and generating postmortems. The AI can monitor metrics and alerts from different tools to automatically indicate a potential failure, form a hypothesis of the root cause, and suggest actions for resolution. It can also assist in remediation by providing accurate steps for custom service and operations, reducing the MTTR drastically and reducing human error. Furthermore, it can generate a summary of the status of an incident, ask follow-up questions, and generate and send status updates for various functions. Lastly, it can integrate with various tools to generate a high-quality postmortem draft.

However, the implementation of AI is not a standalone solution. The key to successful deployment lies in striking a balance between automation and human intervention. This is where the 'Human-in-the-Loop' approach comes into play. Leoor Engel, Director of Engineering - Incident Response at PagerDuty, emphasizes, "You don't want to cap the upside of what the AI can do; it can really impress you. But we need to ensure there's a human-in-the-loop to read, refine, and correct the generated content. It's about striking the right balance between automation and human intervention for accurate and customer-friendly results."

One of the standout features of LLMs is their ability to retain context. Unlike traditional AI models that treat each input independently, LLMs process and understand a series of inputs in relation to each other. Everaldo Aguiar, Senior Engineering Manager, Data Science at PagerDuty explains how they extract data from various sources like incident notes and Slack channels to enhance the user experience and the effectiveness of the incident response process.

While AI and LLMs are impressive, they are not without their potential pitfalls. The most common failure modes of LLMs can be critical when applied to incident response. For instance, 'hallucination' can occur when the AI generates information not based on input data, leading to the dissemination of inaccurate or misleading information. 'Misprioritization' can result in the AI incorrectly ranking tasks or alerts, potentially delaying responses to critical incidents. The 'black boxing' phenomenon refers to the opaque nature of AI decision-making processes, making it challenging to comprehend why the AI made a particular decision or recommendation. Mitigating these risks calls for organizations to invest in their most valuable asset - their people. By enhancing the resilience, adaptability, and expertise of their incident response teams, they can create a robust counterbalance to the inherent uncertainties of LLMs.

The blend of AI's computational power and human judgement holds the potential to revolutionize incident response, making it more efficient, accurate, and user-friendly. However, it's crucial to be aware of the potential risks and to take steps to mitigate them, ensuring that the implementation of AI in incident response is both effective and safe.

Have questions or comments about this article? Reach out to us here.

Banner Image Credits: From the Expo at Great International Developer Summit

Hear What Attendees Say

“Once again Saltmarch has knocked it out of the park with interesting speakers, engaging content and challenging ideas. No jetlag fog at all, which counts for how interesting the whole thing was."

Cybersecurity Lead, PwC

“Very much looking forward to next year. I will be keeping my eye out for the date so I can make sure I lock it in my calendar."

Software Engineering Specialist, Intuit

“Best conference I have ever been to with lots of insights and information on next generation technologies and those that are the need of the hour."

Software Architect, GroupOn

Hear What Speakers & Sponsors Say

“Happy to meet everyone who came from near and far. Glad to know you've discovered some great lessons here, and glad you joined us for all the discoveries great and small."

Web Architect & Principal Engineer, Scott Davis

“Wonderful set of conferences, well organized, fantastic speakers, and an amazingly interactive set of audience. Thanks for having me at the events!"

Founder of Agile Developer Inc., Dr. Venkat Subramaniam

“What a buzz! The events have been instrumental in bringing the whole software community together. There has been something for everyone from developers to architects to business to vendors. Thanks everyone!"

Voltaire Yap, Global Events Manager, Oracle Corp.

No Time for Panic: Gen AI and LLMs in Incident Handling

Related Insights

The Multi-Cloud Maze: Decoding Connectivity in a Complex World

Seven Pillar DevOps: Mastering Reflection, Trust and Endurance (Part 3 of 3)

Seven Pillar DevOps: Navigating Evolution & Insight (Part 2 of 3)

Related Presentations

100 Million Lines of Code, Millions of $ Saved, 0 Wasted Builds; AI-Powered CI @ Atlassian Scale

Guardians of the Gateway: Battling SMS Fraud in Real Time

GitOps: From Commit to Deploy

See Highlights

Hear What Attendees Say

Hear What Speakers & Sponsors Say