Integrating IIS Logs into Apache Druid for Real-Time Analytics

Blogs

Considerations for migrating from a Multidimensional Model to a Tabular Model.
February 14, 2024
Implementing Semantic Search in Elasticsearch
February 15, 2024

Integrating IIS Logs into Apache Druid for Real-Time Analytics

Introduction:

Apache Druid is a powerful tool used for analysing and exploring large-scale datasets in real-time. One common use case is analysing web server traffic data to gain insights into website usage patterns, visitor behaviour, and performance metrics. Internet Information Services (IIS) is a popular web server software used by many organizations to host websites on Windows servers.

By loading IIS logs into Apache Druid, users can unlock valuable insights from their web server traffic data. This process involves extracting relevant information from the log files and transforming it into a format compatible with Apache Druid. Once loaded into Druid, users can perform interactive queries, create visualizations, and derive actionable insights to optimize their web server infrastructure and improve user experience.

Generating and Storing IIS Log Files:

When your website hosted on an Internet Information Services (IIS) server is accessed by visitors, the server automatically generates log files to record various details about each request. These log files contain important information such as the IP address of the visitor, the requested URL, the status of the request, and more.

These log files are typically stored in a specific directory on the server, often located within the IIS installation directory. For example, in a typical Windows environment, the default path for storing IIS log files might be:

Converting IIS Log Files to JSON Format:

A Python script can automate the conversion of IIS log files into JSON format. This script extracts relevant information from the log files, transforms it into JSON, and stores it in a specified directory. The script iterates through each log file, parses the data, and creates JSON entries for each log entry.

Python script:

import json
import os

log_folder_path = 'C:/inetpub/logs/LogFiles'
output_folder_path = 'C:/inetpub/logs/JsonFiles'
field_names = ["date", "time", "s-ip", "cs-method", "cs-uri-stem", "cs-uri-query", "s-port", "cs-username", "c-ip","cs(User-Agent)", "cs(Referer)", "sc-status", "sc-substatus", "sc-win32-status", "time-taken"]

for log_file_name in os.listdir(log_folder_path):
if log_file_name.endswith(".log"):
log_file_path = os.path.join(log_folder_path, log_file_name)
json_output_path = os.path.join(output_folder_path, f"{log_file_name.split('.')[0]}.json")
log_entries = []
with open(log_file_path, 'r') as log_file:
found_fields = False
for line in log_file:
if line.startswith("#Fields:"):
found_fields = True
continue
if found_fields:
values = line.strip().split()
log_entry = {field: value for field, value in zip(field_names, values)}
log_entry["datetime"] = log_entry["date"] + " " + log_entry["time"]
del log_entry["date"]
del log_entry["time"]
ordered_log_entry = {"datetime": log_entry["datetime"]}
ordered_log_entry.update(log_entry)
log_entries.append(ordered_log_entry)
with open(json_output_path, 'w') as json_file:
for log_entry in log_entries:
json_file.write(json.dumps(log_entry) + 'n')
print(f"Conversion complete for {log_file_name}. JSON data saved to {json_output_path}")

Output:











Loading JSON Formatted Log Files into Apache Druid:

Once you have converted your IIS log files into JSON format using the Python script provided earlier, the next step is to load these JSON files into Apache Druid for analysis. Below is a simple guide to transfer and load these files into Apache Druid using SCP (Secure Copy Protocol) and SSH (Secure Shell):

  1. SCP Command Overview:
    • SCP (Secure Copy Protocol) is a secure file transfer protocol used to securely transfer files between a local host and a remote host.
    • In this case, we will use SCP to transfer the JSON formatted log files from the local system to the remote server where Apache Druid is installed.
  2. SCP Command Syntax:

scp /path/to/local/files/*.json sshuser@remote_server:/path/to/remote/directory

  1. Command Breakdown:
    • /path/to/local/files/*.json: Specifies the path to the JSON formatted log files on the local system. The wildcard *.json is used to select all JSON files in the specified directory.
    • sshuser@remote_server: Specifies the SSH username and IP address (or hostname) of the remote server where Apache Druid is installed.
    • :/path/to/remote/directory: Specifies the path to the directory on the remote server where the JSON files will be copied.
  2. Loading JSON Files into Apache Druid:
    • Once the JSON files are transferred to the remote server, you can proceed to load them into Apache Druid using the Druid indexing service.

Loading JSON Files into Apache Druid Web Console:

To load JSON files into the Apache Druid Web Console, you can utilize the “Load data local” feature, which allows you to ingest data directly from files stored on the local file system. Here’s a step-by-step guide to accomplish this:

  1. Access the Apache Druid Web Console:
    • Open your web browser and navigate to the Apache Druid Web Console URL. Typically, this can be accessed at http://localhost:8888.

2. Navigate to the “Load data” Page:

    • Once logged in, navigate to the “Load data” page within the Druid Web Console. This page is where you can initiate the data ingestion process.

3. Choose Directory with JSON Files:

    • Within the “Load data local” interface, specify the directory where your JSON files are stored. This directory should be accessible from the Apache Druid server and perform interactive queries for IIS Logs data.

Conclusion:

Integrating IIS logs into Apache Druid empowers organizations to gain valuable insights from their web server traffic data in real-time.

By following the steps outlined in this guide, users can seamlessly convert, transfer, and load their IIS log files into Apache Druid, enabling them to perform interactive queries, create visualizations, and derive actionable insights to optimize their web server infrastructure and improve user experience.


Neha Vittal Annam

Leave a Reply

Your email address will not be published. Required fields are marked *