Monitor Account Activity: using Databricks System Tables

Blogs

Building Autonomous GitHub Action Agents Using CrewAI Framework
May 30, 2025
Azure MySQL Flexible Server: Explaining Key Features for Scalability and Resilience
June 3, 2025

Monitor Account Activity: using Databricks System Tables

Ever wonder what’s really happening behind the scenes in your Databricks workspace? Who’s reading or writing data, what files are being accessed, or whether any unusual activity is taking place in your storage? You don’t need a complex monitoring setup to find out. With a little understanding of system tables and audit logs, you can uncover a detailed picture of how your Databricks storage is being used.

What are System Tables?

Think of system tables like the activity log on your phone or computer. They quietly keep track of everything happening in your Databricks environment: who did what, when, and how.

These logs include things like:

  • Who logged in
  • What queries were run
  • How long each task took
  • Which tables or files were accessed

And the best part? You can look into these records just like reading any other table, by running SQL queries.

Note: To access system tables, your workspace must be enabled for Unity Catalog.

You can refer the databricks official website to see which all types of System tables are available.

Why Should You Monitor Account Activity? (Benefits)

Here are a few real-world reasons:

  1. Security: Catch unauthorized access attempts.
  2. Performance: Find out which queries are slowing things down.
  3. Audit: Know who changed what and when.
  4. Usage trends: See which users are active and which features they use.

Use Cases:

Let’s dive into a simple scenario where the system tables are more beneficial.

Scenario: Identifying Underutilized Clusters for Cost Optimization

Goal:

You want to identify clusters that:

  • Have been running for a long time,
  • Have a high number of workers,
  • But are underutilized in terms of CPU usage.

This helps in cost optimization by flagging clusters that might be over-provisioned or idle.

Tables Involved:

  • compute.clusters: Cluster metadata (e.g., size, owner, creation time).
  • compute.node_timeline: Node-level performance metrics (e.g., CPU usage).
  • compute.node_types: Hardware specs for node types.

Example Query:

WITH avg_cpu_usage AS (
SELECT
cluster_id,
AVG(cpu_user_percent + cpu_system_percent) AS avg_cpu_utilization
FROM system.compute.node_timeline
WHERE end_time >= now() – INTERVAL 1 DAY
GROUP BY cluster_id
)

SELECT
c.cluster_id,
c.cluster_name,
c.owned_by,
c.worker_count,
c.driver_node_type,
c.worker_node_type,
c.create_time,
a.avg_cpu_utilization
FROM system.compute.clusters c
JOIN avg_cpu_usage a ON c.cluster_id = a.cluster_id
WHERE a.avg_cpu_utilization < 20
AND c.worker_count >= 4
AND c.delete_time IS NULL
ORDER BY a.avg_cpu_utilization ASC;

What this Query does?

  1. Calculates the average CPU utilization per cluster over the past 24 hours.
  2. Filters for clusters with 4 or more workers and low CPU usage (< 20%).
  3. Returns metadata like cluster name, owner, and node types for further analysis.

Benefits:

  • Cost Savings: Identify clusters that can be downsized or terminated.
  • Operational Efficiency: Spot idle resources and reallocate them.
  • Governance: Understand who owns the clusters and when they were created.

Final Thoughts:

Databricks System Tables make it much easier to understand what’s happening in your data platform. Whether you’re a data engineer trying to improve performance or someone in charge of data security and compliance, these tables give you the clear, useful insights you need—all in one place.


Pallavi A

Leave a Reply

Your email address will not be published. Required fields are marked *