Apache Druid is a powerful tool used for analysing and exploring large-scale datasets in real-time. One common use case is analysing web server traffic data to gain insights into website usage patterns, visitor behaviour, and performance metrics. Internet Information Services (IIS) is a popular web server software used by many organizations to host websites on Windows servers.
By loading IIS logs into Apache Druid, users can unlock valuable insights from their web server traffic data. This process involves extracting relevant information from the log files and transforming it into a format compatible with Apache Druid. Once loaded into Druid, users can perform interactive queries, create visualizations, and derive actionable insights to optimize their web server infrastructure and improve user experience.
When your website hosted on an Internet Information Services (IIS) server is accessed by visitors, the server automatically generates log files to record various details about each request. These log files contain important information such as the IP address of the visitor, the requested URL, the status of the request, and more.
These log files are typically stored in a specific directory on the server, often located within the IIS installation directory. For example, in a typical Windows environment, the default path for storing IIS log files might be:
A Python script can automate the conversion of IIS log files into JSON format. This script extracts relevant information from the log files, transforms it into JSON, and stores it in a specified directory. The script iterates through each log file, parses the data, and creates JSON entries for each log entry.
import json import os log_folder_path = 'C:/inetpub/logs/LogFiles' output_folder_path = 'C:/inetpub/logs/JsonFiles' field_names = ["date", "time", "s-ip", "cs-method", "cs-uri-stem", "cs-uri-query", "s-port", "cs-username", "c-ip","cs(User-Agent)", "cs(Referer)", "sc-status", "sc-substatus", "sc-win32-status", "time-taken"] for log_file_name in os.listdir(log_folder_path): if log_file_name.endswith(".log"): log_file_path = os.path.join(log_folder_path, log_file_name) json_output_path = os.path.join(output_folder_path, f"{log_file_name.split('.')[0]}.json") log_entries = [] with open(log_file_path, 'r') as log_file: found_fields = False for line in log_file: if line.startswith("#Fields:"): found_fields = True continue if found_fields: values = line.strip().split() log_entry = {field: value for field, value in zip(field_names, values)} log_entry["datetime"] = log_entry["date"] + " " + log_entry["time"] del log_entry["date"] del log_entry["time"] ordered_log_entry = {"datetime": log_entry["datetime"]} ordered_log_entry.update(log_entry) log_entries.append(ordered_log_entry) with open(json_output_path, 'w') as json_file: for log_entry in log_entries: json_file.write(json.dumps(log_entry) + 'n') print(f"Conversion complete for {log_file_name}. JSON data saved to {json_output_path}") Output:![]()
Once you have converted your IIS log files into JSON format using the Python script provided earlier, the next step is to load these JSON files into Apache Druid for analysis. Below is a simple guide to transfer and load these files into Apache Druid using SCP (Secure Copy Protocol) and SSH (Secure Shell):
scp /path/to/local/files/*.json sshuser@remote_server:/path/to/remote/directory |
To load JSON files into the Apache Druid Web Console, you can utilize the “Load data local” feature, which allows you to ingest data directly from files stored on the local file system. Here’s a step-by-step guide to accomplish this:
2. Navigate to the “Load data” Page:
3. Choose Directory with JSON Files:
Integrating IIS logs into Apache Druid empowers organizations to gain valuable insights from their web server traffic data in real-time.
By following the steps outlined in this guide, users can seamlessly convert, transfer, and load their IIS log files into Apache Druid, enabling them to perform interactive queries, create visualizations, and derive actionable insights to optimize their web server infrastructure and improve user experience.
Neha Vittal Annam