Logs are a critical part of any system, they provide vital information about the application and answer questions on what the system is doing and what has happened. Most of the processes running on the system generate logs in one form or other. For convenience, these logs are often collected in files on a local disk with the log rotation option. When the system is hosted on one machine, file logs are easy to access and analyze, but when system grows to multiple hosts, log management is becoming a nightmare. It is difficult to look up a particular error across thousands of log files on hundreds of servers without the help of specific tools. A common approach to this issue is to deploy and configure a centralized logging system, so that data from each log file of each host is pushed to a central location.
There are many different ways of building centralized logging solutions. The easiest way is to schedule replicating log files onto the central sever using scripts or specialized tools like cron or rsync. Nevertheless, this approach has a bunch of drawbacks: it doesn’t aggregate log data, only provides centralized storage for them.
Another option is to use syslog. It is a standard for message logging that permits three entities: a message generation client, a message storage system and a system for reporting and analysis. There are two most popular implementations of the syslog standard: syslog-ng and rsyslog. By using these systems, you can send log messages in the syslog format from multiple clients to the log server. This approach is very popular for UNIX-like systems because these daemons are usually preinstalled on target systems. Many hardware and software vendors make their product compatible with the syslog standard.
Years ago, when there was a big increase in the amounts of generated data, the new type of log-collecting and processing software has been designed to meet the requirements for high-volume and high-throughput systems. Originally, such systems were designed for event streaming and processing, but they also were widely used for log processing. Every system of this kind has specific features and differences, but their high-level architectures are pretty similar. They usually consist of multiple logging clients or agents which are installed for each host. These clients send log messages to a cluster of aggregators which then sends the messages to a distributed and scalable data storage. The main concept is that log collectors are distributed and horizontally scalable, which provides the ability to handle thousands of messages per second from multiple logging hosts. Let’s talk about 2 of the most popular open-source solutions that are widely used in high-throughput systems: Apache Flume and Logstash.
Apache Flume is a scalable solution for streaming logs into Apache Hadoop. It is a distributed data ingestion system for efficiently collecting, aggregating and moving huge amounts of data into HDFS. In the Flume terminology, “single message” is called “event” and events flow through one or more Flume agents to reach their final destination (usually HDFS). An event consists of message body and header that carry metadata such as timestamp, event originator and other. An agent is a Java process that hosts the components through which events flow. Those components are sources, channels and sinks.
The Flume source consumes messages what where posted by external sources. When the source receives the event, it is stored in one or more Flume channels. A channel is a simple storage that keeps an event until it is consumed by the Flume sink. The sink takes the event from the channel and puts it into the external storage (HDFS) or forwards it to another Flume agent.
Apache Flume supports the following event source listeners:
- Avro Source. Listens on Avro port and receives events from external Avro client streams.
- Thrift Source. Listens on Thrift port and receives events from external Thrift client streams.
- Exec Source. Runs a given Unix command on start-up and expects that process to continuously produce data on standard out.
- NetCat Source. Listens on a given port and turns each line of text into an event.
- Syslog Sources. Reads syslog data and generates Flume events.
Additionally, there is a set of standard sinks:
- HDFS Sink. Writes events into the Hadoop Distributed File System (HDFS).
- Logger Sink. Logs event at the INFO level. Typically useful for testing/debugging purpose.
- Avro Sink. Forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname/port pair.
- Thrift Sink. Forms the other half half of Flume’s tiered collection support. Flume events sent to this sink are turned into Thrift events and sent to the configured hostname/port pair.
- File Roll Sink. Stores events on the local filesystem.
- HBase Sink. Writes data to HBase.
- ElasticSearch Sink. Writes data to an elasticsearch cluster.
Flume’s extensibility and scalability model can be leveraged to apply real-time analytics to data flows. Performance tests proved that the system is capable of logging up to 70K events per second. Apache Flume comes with a log4j client, so you can easily put log messages from you Java application into Flume. Moreover, community has developed clients for other languages as well. So there shouldn’t be any issues to find one that is appropriate for you.
Apache Flume advantages:
- guarantees data delivery
- scales horizontally
- high throughput
- fault tolerance
Logstash is a tool for managing events and logs. It is written in JRuby and requires JVM to run it. Usually one client is installed per host, and can listen to multiple sources including log files, Windows events, syslog events, etc. The downside of using JVM is that memory usage can be higher than you would expect for log transportation. However, community has developed Lumberjack that is deployed on each host. It collects and ships logs to Logstash which is running centralized log hosts. Logstash itself is only a client (shipper) that can send log message to centralized storage.
Typically, a centralized logging solution based on Logstash consists of the following components:
- Multiple Collector Agents. Used for collecting log messages from each host (Logstash shipper or its lightweight version Lumberjack).
- Broker. Used as a queue and broker to feed messages and logs to Logstash (Redis or RabbitMQ).
- Indexer. Responsible for querying messages from broker and putting them into an external storage.
- Distributed Storage. Used for permanent storage and indexing of all the log messages (Elasticsearch).
- Data Visualization. Web interface for searching and analyzing logs stored by ES (Kibana).
Similarly to Flume, Logstash has many event sources that listen to incoming messages from log files, syslog, Windows event logs, standard output, etc. One of the coolest things about Logstash is the ability to filter messages, modify them and transform from one format to another. You can configure your own rules for each separate event source. Once the log messages are within Elasticsearch, you can query them in any way you want. Fortunately, there is a web interface called Kibana for interactive queries and log visualization.
As a conclusion, I want to say that nowadays log management goes far beyond searching for specific keywords in log files. As systems become more powerful and complicated and each component of the system produces logs in a different format, you need more versatile tools that let you see what the system is doing and what has happened. In most cases, you don’t need to reinvent the wheel because many tools are available on the market, both open-source and proprietary. All you need is to carefully consider which will fit better in your particular case. Flume, Logstash, Splunk – all of them are great software products that are solving log management tasks in slightly different ways but equally make your life easier.