This post is not about a solution to a generic issue. This post is about how I found some way to cut and assemble a large log file using a few Bash tools (and subliminally why Unix principle of implementing stuff may be the superior way).

So today, there came a need to sort through a large access log file for a list of request counts for each key. It’s not just a key — request count relationship. To elaborate briefly,

In WSO2 API Manager, authentication happens through OAuth2. This involves a Consumer Key for a particular application and user, and a pair of access, refresh tokens to be used in actual API invocation. A particular Consumer Key can be used to generate multiple Access Tokensand Refresh Tokens. The requirement was to find how many requests were done using Access Tokens mapped to each Consumer Key. It’s only the Access Token that gets logged into the access log, so getting to this list of numbers required some kind of scripting.

I start with one-liners.

First I need to get the actual list of requests that were made with an Access Token.

grep "Bearer " access-log-with-thousands-of-lines.log

This gives me a search result with too many information. I need only the Access Tokens used. In other words, I need only the next 37 characters (36 being the number of characters in an Access Token) after a match for Bearer.

grep -o -P "Bearer.{0,37}" access-log-with-thousands-of-lines.log

Now it gives me a manageable list of search results.

Bearer fgjk7d29-b6b8-3024-8a76-5639ddlcwslo
Bearer sdlof320-c797-3c50-80b1-065nslgnxcld
Bearer worh366f-e133-3093-9c68-0495ndlgskdl
Bearer pyof8a2c-b534-37eb-9afc-0349ruaslkfw
Bearer "-" "Mozilla/5.0"
Bearer 2bfeslgo-056a-3e60-92be-34erlkgjarf3
Bearer ertlkjix-056a-3e60-92be-lkdsjf023498
Bearer lsn24fx0-8dd1-385e-8c54-09wpoisajf09
Bearer 03jdl59f-ded7-33a7-ba6d-q2308wafhj00
Bearer "-" "Mozilla/5.0"
Bearer "-" "Mozilla/5.0"
Bearer mbn05ns9-f97d-3401-881e-3940859ajaf2

I need to remove the Bearer part and get a unique list of access tokens (because from another discussion I heard that a new Access Token was generated for each request).

grep -o -P "Bearer.{0,37}" access-log-with-thousands-of-lines.log | sed 's/^Bearer //' | sort | uniq

This gives me a further cleaned up list of access tokens with an additional line that I don’t want to count in, "-" "Mozilla/5.0". I need to remove this last line as well.

fgjk7d29-b6b8-3024-8a76-5639ddlcwslo
sdlof320-c797-3c50-80b1-065nslgnxcld
worh366f-e133-3093-9c68-0495ndlgskdl
pyof8a2c-b534-37eb-9afc-0349ruaslkfw
2bfeslgo-056a-3e60-92be-34erlkgjarf3
ertlkjix-056a-3e60-92be-lkdsjf023498
lsn24fx0-8dd1-385e-8c54-09wpoisajf09
03jdl59f-ded7-33a7-ba6d-q2308wafhj00
mbn05ns9-f97d-3401-881e-3940859ajaf2
"-" "Mozilla/5.0"

> grep -o -P "Bearer.{0,37}" access-log-with-thousands-of-lines.log | sed 's/^Bearer //' | sort | uniq | sed '$ d' > tokens.txt
> cat tokens.txt
fgjk7d29-b6b8-3024-8a76-5639ddlcwslo
sdlof320-c797-3c50-80b1-065nslgnxcld
worh366f-e133-3093-9c68-0495ndlgskdl
pyof8a2c-b534-37eb-9afc-0349ruaslkfw
2bfeslgo-056a-3e60-92be-34erlkgjarf3
ertlkjix-056a-3e60-92be-lkdsjf023498
lsn24fx0-8dd1-385e-8c54-09wpoisajf09
03jdl59f-ded7-33a7-ba6d-q2308wafhj00
mbn05ns9-f97d-3401-881e-3940859ajaf2

So now I have a cleared up list of access tokens. I need to find the Consumer Key corresponding to each of the access token next. For this, I need to get into the database. The select statement needed for this is something similar to the following.

select consumer_key,username,app_name,access_token from my_table where access_token='fgjk7d29-b6b8-3024-8a76-5639ddlcwslo' OR access_token='sdlof320-c797-3c50-80b1-065nslgnxcld' OR access_token='worh366f-e133-3093-9c68-0495ndlgskdl' OR access_token='pyof8a2c-b534-37eb-9afc-0349ruaslkfw' OR access_token='2bfeslgo-056a-3e60-92be-34erlkgjarf3' OR access_token='ertlkjix-056a-3e60-92be-lkdsjf023498' OR access_token='lsn24fx0-8dd1-385e-8c54-09wpoisajf09' OR access_token='03jdl59f-ded7-33a7-ba6d-q2308wafhj00' OR access_token='mbn05ns9-f97d-3401-881e-3940859ajaf2';

To come up with that with the list of access tokens, I make the following move.

stmt="select consumer_key,username,app_name,access_token from my_table where ";while read token; do
stmt+="access_token='$token' OR "
done < tokens.txt; echo "${stmt:0:-4};";

Since that Bashfu could be a little bit hard to see to a lay person, let me explain what happens.

  • Write the first part of the non-repetitive statement
stmt="select consumer_key,username,app_name,access_token from my_table where ";
  • Loop through the access tokens and write the repetitive part
while read token; do
stmt+="access_token='$token' OR "
done < tokens.txt;
  • Get the final statement but without the last OR removed and a semi-colon added
echo "${stmt:0:-4};";

With the help of a friend (both because I don’t have access to the database server and because I only learn SQL absolutely when I need to — latter being the strongest reason) I get the consumer key, username, application name, and the access token in the following format, in a file called db_output.txt .

consumer_key|username|app_name|access_key

Now what I need to do is to go through each consumer key, get all the access keys mapped to it, and sum up the number of requests done with each access key. For this, I’m going to have to use some advanced Bashfu, which forces me to a scripting position.

First I need to sort the retrieved list of consumer keys in db_output.txt file, by the contents of the first column.

cat db_output.txt | sort -k1 > db_output_sorted.txt

Following is the ad-hoc script I come up with to perform the advanced bashfu.

#!/bin/bash

prev_consumer_key=""
prev_access_token=""
prev_username=""
prev_appname=""

total_count=0

# start looping
while read line; do
  # cut up the line by separator `|`
  consumer_key=$(echo $line | cut -d '|' -f1)
  username=$(echo $line | cut -d '|' -f2)
  appname=$(echo $line | cut -d '|' -f3)
  access_token=$(echo $line | cut -d '|' -f4)

  # search for access token in the log file and get the line count for the results
  grep_count=$(grep $access_token access-log-with-thousands-of-lines.log | wc -l)

  # first loop
  if [[ $prev_consumer_key == "" ]]; then
    prev_consumer_key="$consumer_key"
    prev_appname="$appname"
    prev_username="$username"
    prev_access_token="$access_token"
  fi

  if [[ $prev_consumer_key == $consumer_key ]]; then
    # Add up to the request count found for earlier access token
    total_count=$(( total_count + grep_count ))
  else [[ $prev_consumer_key != $consumer_key ]];
    # Display the total for the previous consumer key
    echo -e "$prev_consumer_key\t$prev_username\t$prev_appname\t$total_count"

    # Start counting for the next one
    total_count="$grep_count"
    prev_consumer_key="$consumer_key"
    prev_appname="$appname"
    prev_username="$username"
    prev_access_token="$access_token"
  fi
done < db_output_sorted.txt

# show the final line too
echo -e "$prev_consumer_key\t$prev_username\t$prev_appname\t$total_count"

Pretty self-explanatory!!

With this script I get the number of requests made with access tokens mapped to each consumer key.

> ./filter_log.sh 
consumer_key100   username100   appname100   1
consumer_key335   username335   appname335   1
consumer_key202   username202   appname202   13
consumer_key410   username410   appname410   1
consumer_key099   username099   appname099   45
consumer_key001   username001   appname001   18

But I still need it sorted by the numerical request count , with the most consumer key with the most request count on the top.

> ./filter_log.sh | sort -k4 -nr
consumer_key099   username099   appname099   45
consumer_key001   username001   appname001   18
consumer_key202   username202   appname202   13
consumer_key410   username410   appname410   1
consumer_key335   username335   appname335   1
consumer_key100   username100   appname100   1

Flawless Victory!!!

The path taken here isn’t the only way to achieve the same result. The following much simpler script takes a different approach using awk to sum up the request counts, and it turns out to be a few seconds faster than the first script (even with the additional IO and nested looping)!!

#!/bin/bash

# loop, read last line for the input values
while read line; do
  # read values from the line delimitted by `|`
  consumer_key=$(echo $line | cut -d '|' -f1)
  username=$(echo $line | cut -d '|' -f2)
  appname=$(echo $line | cut -d '|' -f3)

  # get all the access tokens for the consumer key
  grep "$consumer_key" db_output_sorted.txt  | cut -d '|' -f4 > tmp_access_tokens.txt

  # collect all the counts for access keys found and sum them together using awk
  while read line; do
    grep $line access-log-with-thousands-of-lines.log | wc -l
  done < tmp_access_tokens.txt | awk '{ SUM += $1 } END { print SUM }' > tmp_total_count.txt

  total_count=$(cat tmp_total_count.txt)

  # remove temp files
  rm -rf tmp_access_tokens.txt tmp_total_count.txt

  # output!
  echo -e "$consumer_key\t$username\t$appname\t$total_count"
done < <(cat db_output_sorted.txt | cut -d '|' -f1,2,3 | sort | uniq)
# input the list of unique consumer keys, their usernames, and app names

There you go! Not much of a help in terms of copy value, but I hope it helps you get an idea on how to bash (tehee!) your way out of a complex situation.


Written on May 1, 2018 by chamila de alwis.

Originally published on Medium