What do we offer!

How can we manage log contents in near real time ?

This is the question ! How many times we have run troubleshooting or attack analysis on log data having to do with different files , different formats, different content ect. 

Big Data can give an answer to this. All demo below tries to give an answer to tha question and has been produced with HortonWorks


First of all we need to collect log data in near real time. One possible solution is to use the combination of Storm and Kafka. Storm is a streming software while Kafka is a publish subscribe message broker. What we have implemented is a "producer" job that gets' log messages in input and produce a message in the Kafka queue according to a particular topic.

For this demo we have used following log data:

  • Apache webserver log
  • Tomcat application server log
  • Oracle Database audit data

For Apache an Tomcat logs we have build two producers in java which accepts line in standard input and send it to the Kafka broker with a suitable topic. 

For Oracle Database audit data we have implemented a java coverter which transforms the plain text audit file in a fixed format file. We have then imported the file in HDFS.


Working with Storm we talk about "topologies". Topologies are build by spout and bolts. Spouts are the components in charge of "streaming" the data in the form of "tuples". Bolts are the consumers of the tuples specialised in different purposes. Here below the example:


This is the topology for the Apache webserver log. kafkaSpoutOHS is the defined Spout in Storm getting messages from Kafka queue. The three Bolts used are specialised for the following:

logOHSEventBolt. This bolt writes a line in the storm topology log file for each tuple received from the Spout

hdfsBoltOHS. Writes the line received from the Spout in the HDFS file system. Temporary files are created in a staging area and then collapsed and moved to the warehous area of Hive in partitioned directories

hbaseBoltOHS. Writes the line received from the Spout in a HBase table.

Another topology is created for Tomcat log.

A log tailer is on the Apache server. We have built a java class but can be a normal "tail -f" shell command. The java class might be better because it will open the new file when the log "tailed" in that moment is rolled.

The Kafka producer is another java class which is in charge of keeping every single line of the log and send it to Kafka message queue as a single message.

The KafkaSpoutOHS subscribed to the Kafka topic and get the log messages.

Data have been organised in HDFS, HBase and Hive as follows:


The hdfsBoltOHS writs files in a HDFS area (in our case we have left the default) suitable for Hive to create table based on that files.







Messages containing log content have been parsed with regex. Example with apache access log: - - [21/Apr/2015:16:50:25 +0200] "GET /private/index.html HTTP/1.1" 302 303 - - [21/Apr/2015:16:54:35 +0200] "POST /UpdateAgentCacheServlet?shortcircuit=false HTTP/1.1" 200 2 - - [21/Apr/2015:16:55:44 +0200] "POST /UpdateAgentCacheServlet?shortcircuit=false HTTP/1.1" 200 2

and here below the regular expression used to parse each line

String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+) \\[([0-9]{2}\\/[a-zA-Z]{3}\\/[0-9]{4}\\:[0-9]{2}\\:[0-9]{2}\\:[0-9]{2})\\s[\\+\\-](\\d{4})\\]\\s\\\"(.*?)\\\"\\s(\\S+)\\s(\\S+)";

Here below an example of the files stored in hdfs in the hive area. /app/hive/warehouse/logohs_events_text_partition/date=2015-05-03-18. The partitioning rule is based on the "hour". 2015-05-03-18 means alldata in that partition have been collected from 18:00 to 18:59.

HDFS Hive Area

Tables have been defined in Hive and Hbase :

Hive Table

HBase table

create table logohs_events_text_partition  

(ipaddress string,  

 timestamp timestamp,  

 operation string,  

 code1       string,  

 code2       string,  

 fullline  string)  

partitioned by (date string)  



hbase(main):003:0> describe 'LogOHS_events'
'LogOHS_events', {NAME => 'events',DATA_BLOCK_ENCODING => 'NONE',




BLOCKSIZE => '65536',IN_MEMORY => 'false',

BLOCKCACHE => 'true'}

Data Example

column=events:ip,  timestamp=1430679602285, value=                                               
column=events:op,  timestamp=1430679602285, value=GET /private/lista.html HTTP/1.1                            
column=events:tim, timestamp=1430679602285, value=\x00\x00\x01L\xE7;\xC08                                    
column=events:cd1, timestamp=1430679602041, value=304                                                        
column=events:cd2, timestamp=1430679602041, value=- 


Similarly we have created other two tables to contain log data coming from Tomcat and Database Audit data. (logohs_events_text_partition, logdatabaseaudit)

Data will continue to flow into the tables as soon as the "tail" procedure captures new log messages and stream down to Kafka message broker , Storm and so on.

Meanwhile it is possible to run some queries to see potential security breach. For the scope of this simple demo we have created a new table log_monitor populated with the data of a single day coming from all logs. Here below the script:

insert into table log_monitor

select 'DB', timestamp,  CONCAT(COALESCE(dbuser, '-'),COALESCE(action, '-')) as operation

from logdatabaseaudit

where timestamp > '2015-04-27'


insert into table log_monitor

select 'TOMCAT', timestamp, operation

from logapa_events_text_partition

where timestamp > '2015-04-27'


insert into table log_monitor

select 'OHS', timestamp, operation

from logohs_events_text_partition

where timestamp > '2015-04-27'


 Now if we select data from log_monitor ordering result by timestamp we can see important information. For example how many operation we have got on the single servers in the single hours:

SELECT module, hour(timestamp) as hour, count(*)
FROM `default.log_monitor`
group by module, hour(timestamp)
order by hour;


log monitor1 

We can see a lot of activity into the Database in hour 9. So we can drill down in the single hour and see the details of operations

SELECT timestamp, module, operation

FROM `default.log_monitor`

where hour(timestamp) = 9

order by timestamp;

And here the result:

log monitor2

Here the analysis leads to the following: despite normal operations in OHS and TOMCAT addressing "lista.html" and "listaOrdini.jsp", there is a suspicious access to the table "EMP" addressed by the SYS user directly in the Database outside application control.


Share your social media marketing partner
Category: Contenuto Sito

Your Next Step...