Cortex Data Lake Platform Monitoring

A scalable data platform to help network and security administrators manage the ingestion, forwarding, and retention of their company’s firewall logs.

Role

Lead the end-to-end redesign of Cortex Data Lake, resulting in a reduction of mean time to resolution and improved observability experience.

Timline

Feb 2020 to Aug 2021

Problem Space

Being able to view firewall logs and having relevant log metrics are crucial for Network and Security admins. This is to ensure that the firewalls are working to protect the network. However, the current experience does not provide relevant log information to ensure that Firewalls are up and running.

UX Pain Points

Design Thinking

Troubleshooting User Journey Map

As a cross functional team, we mapped out a typical troubleshooting journey of a disconnected Firewall. In the old experience, the user would view logs in Explore App, which is an indirect way to figure out if a firewall was disconnected. The logging service app was practically, not usable.

Affinity Mapping

I lead a workshop where product managers and sales engineers wrote down common customer needs that they’ve heard from interviews, and cluster them into relevant categories. At the end of this session, we voted on the features that should be prioritized. This helped us align on product goals, and it also helped me get a head start on how the information architecture could look like.

Sketching, collaborating, brainstorming

The previous activity gave me a really good idea on what things I needed to prioritize, and features I needed to design. This was a very large project, with a lot of requirements. I decided to break them down into parts and sketch out potential individual components. We had daily review sessions with our product manager and engineering lead. It helped us align on concepts, and be aware of any technical constraints.

Iterations after iterations, we focused on

  • What metrics helped users the most

  • What charts effectively tell the story or solve the problem (the “telos”)

  • How much time the users will spend on this platform

Customer interviews eliminated our biases about what information we thought was important vs what customer thought was important. For example, in Layout 4, we had capacity usage chart (how many TBs used over time). Almost all customers we interviewed didn’t find this useful. Instead, they wanted the dashboard to show metrics that were particular to the ingestion and forwarding of logs. Through these interviews, we were able to nail down what they valued the most.

iterations landsc.png

Design Solutions

Dashboard

The CDL dashboard is a one-stop-shop for security and network admins to quickly visualize the health of their Firewall logs and ensure that logs are coming in and out at a healthy rate. Through multiple customer interviews, we prioritized log metrics that would help the user make the most out of this dashboard.

🖥 iMac - 1.png

Metrics Design

Connection Status

This is one of the most important widget on the dashboard. It shows how many firewalls are connected and disconnected, which is largely the cause of logs not being ingested. This also provides a way for network/security admins to keep track of firewall health.

Latency

The ingestion and forwarding latency is a metric that indicates “freshness of logs”. An increase in my ingestion latency means that there are delays in accessing incoming logs, and a decrease means that the logs are coming in and out faster.

Service Availability

Admins would also like to know the overall service availability, and what might have happened that could cause device disconnections. This is where they can see that there was a service outage in the last 24 hours.

Log Rate

Admins would also like to know the average logs being ingested and forwarded during a given time. Again, this helps them determine the health of firewalls

Log Type Table

The breakdown of log types provide a way for admins to know if they need to buy more log quota or adjust log quota.

Group 3228-1.png

Firewall Inventory

The inventory page is a way for network and security admins to take a closer look at each Firewall in the network.

Each and every firewall’s log data is now consolidated into this page, and they can see each firewall’s connection history over time.

Easy Firewall Onboarding

Users can also easily onboard new firewalls into Cortex Data Lake using this quick onboarding flow. This is a major improvement to the previously complicated experience that required switching between 3 applications.

Reduced Jumps in Troubleshooting Journey

Now the user can do most of their troubleshooting within the Cortex Data Lake app, since all of the important log and firewall metrics have been consolidated in one platform.

Results

  • Eliminate the need to go to Explore App + Panorma to detect disconnection). All log metrics are now consolidated into one place, which saves a lot of troubleshooting time.

  • Reduction in the number of support calls generated from lack of log metrics.

  • Being able to see latency metrics helps users evaluate the freshness of log data ingested into FLS. This was a highly requested feature by customers.

  • 4.6/5 ratings on G2

Lessons

  • You can work with very little data about your customer by getting the right people in the room.

  • Work with your engineering partners everyday to see what is possible, and to reduce design debt.

  • Involve your stakeholders (including QA) in brainstorming, design activities, and design review sessions. This is for better cross functional alignment and avoid surprises.

Other products I worked on at Palo Alto Networks


2020


2018