This article will show you how to do some Apache log analysis using Riak and MapReduce. Specifically it will give an example of how to extract URLs from Apache logs stored in Riak (the map phase) and provide a count of how many times each URL was requested (the reduce phase).
So what is Riak? According to Wikipedia it’s “a NoSQL database implementing the principles from Amazon’s Dynamo paper”. Or, put another way, it’s a distributed key-value store that has built-in support for MapReduce. If you aren’t familiar with MapReduce a good starting point would be to read Google’s MapReduce paper. I am not going to go over how to install Riak; there’s a good tutorial for that on the Riak website. Riak also has a lot of other features that won’t be covered here.
Right, off we go.
The first thing to do is get your log data into Riak. The example will use Riak’s HTTP API that allows you to create/delete content using HTTP GET, POST etc. Run the following command from wherever your log data is stored:
curl -v -X PUT http://localhost:8091/riak/logs/2011-08-23 \
-H "Content-Type: text/plain" --data-binary @2011-08-23.log
In this case I am storing the log data in a bucket called “logs” and I am also providing a key (“2011-08-23”) for this particular log. If you do a POST to just the bucket, e.g. you don’t specify a key, Riak will generate a key for you; you will be able to see the generated key in the “Location” header in the HTTP response. Also, note the use of the --data-binary
flag. It’s really important because if you use -d
instead, curl will very kindly strip out all of the newline characters from the text – as I eventually found out! Not what you want.
Now that the log data is stored in Riak, you can query it. This is where MapReduce comes in. Riak’s MapReduce supports writing map and reduce functions in either JavaScript or Erlang. Currently there is no support for applying an optional combiner function after each map task; other frameworks, such as Hadoop, do. For example, when counting the number of words in a set of documents, each map task may produce lots of records of the type <“at”, 1>. Rather than sending all of these individual records over the network, it would be beneficial to merge the counts for each individual record on each node before sending them to the reduce phase; however, in this scenario, it wouldn’t be difficult to just add the logic for doing the merge in the map function itself. I used JavaScript for my map and reduce functions. Queries are specified using JSON. The query for analysing the log(s) looks something like this:
{ "inputs": [["logs", "2011-08-23"]], "query": [ { "map": { "language": "javascript", "name": "LogAnalyzer.mapLogEntry" } }, { "reduce": { "language": "javascript", "name": "LogAnalyzer.reduceLogEntry" } } ] }
The inputs consist of an array of arrays; each entry specifies the bucket name and the corresponding key of the log we want to process. In this case the bucket name and key correspond to the log that was loaded previously. Only one map phase is defined but you can specify more than one if you want to. During the map phase the request URL will be extracted from each line in the logs. The reduce phase will take the results from the map phase and sum the counts for each URL, returning the totals to the client. Both the map and reduce phases in this example use named queries. The source for the functions is here.
You can define anonymous JavaScript functions directly in your queries; there is an example of using an anonymous function halfway down this page. Unless your functions are trivial I recommend that you name your functions and have Riak load them when it starts up. To do that you will need to modify the configuration file (<node>/etc/app.config)
for each node in your cluster. Open the config file, locate the variable “js_source_dir” and set it to wherever you have your JavaScript files, e.g. {js_source_dir, "/Users/simon/Projects/Riak/js"}
. Make sure it’s uncommented. You will need to restart your nodes for the changes to have an effect.
To run the query, save it to a file, open up a terminal and run the following command:
curl -v -H "Content-Type: application/json" \
http://localhost:8091/mapred -d @log-query.json
Hopefully, you should get back something like this:
[{"www.simonbuckle.com/feed/" : 19}, {"www.simonbuckle.com/2006/01/19/design-revamp-2" : 1}, ...]
So that’s how to do some analysis on your Apache logs using Riak. All the code can be found on GitHub. There’s also an example of how to do a distributed word count that I didn’t cover here.
There is a lot more information about Riak on the Riak website. At some point it would be nice to be able to specify queries in languages other than JavaScript and Erlang. The map and reduce phases in this example are trivial but I can envisage a scenario where you might want to do some rather complex analysis during each phase so it would be nice to be able to use external libraries rather than having to write stuff from scratch each time.
Feel free to leave a comment if you have any questions.
i want to retrive entry from log file in which searched content are restricted.. or i have to access log entry for specific url and show the details like ip, name, and date..