This post is not about a solution to a generic issue. This post is about how I found some way to cut and assemble a large log file using a few Bash tools (and subliminally why Unix principle of implementing stuff may be the superior way).
So today, there came a need to sort through a large access log file for a list of request counts for each key. It’s not just a key — request count relationship. To elaborate briefly,
In WSO2 API Manager, authentication happens through OAuth2. This involves a Consumer Key for a particular application and user, and a pair of access, refresh tokens to be used in actual API invocation. A particular Consumer Key can be used to generate multiple Access Tokensand Refresh Tokens. The requirement was to find how many requests were done using Access Tokens mapped to each Consumer Key. It’s only the Access Token that gets logged into the access log, so getting to this list of numbers required some kind of scripting.
I start with one-liners.
First I need to get the actual list of requests that were made with an Access Token.
grep "Bearer " access-log-with-thousands-of-lines.log
This gives me a search result with too many information. I need only the Access Tokens used. In other words, I need only the next 37 characters (36 being the number of characters in an Access Token) after a match for Bearer
.
grep -o -P "Bearer.{0,37}" access-log-with-thousands-of-lines.log
Now it gives me a manageable list of search results.
Bearer fgjk7d29-b6b8-3024-8a76-5639ddlcwslo
Bearer sdlof320-c797-3c50-80b1-065nslgnxcld
Bearer worh366f-e133-3093-9c68-0495ndlgskdl
Bearer pyof8a2c-b534-37eb-9afc-0349ruaslkfw
Bearer "-" "Mozilla/5.0"
Bearer 2bfeslgo-056a-3e60-92be-34erlkgjarf3
Bearer ertlkjix-056a-3e60-92be-lkdsjf023498
Bearer lsn24fx0-8dd1-385e-8c54-09wpoisajf09
Bearer 03jdl59f-ded7-33a7-ba6d-q2308wafhj00
Bearer "-" "Mozilla/5.0"
Bearer "-" "Mozilla/5.0"
Bearer mbn05ns9-f97d-3401-881e-3940859ajaf2
I need to remove the Bearer
part and get a unique list of access tokens (because from another discussion I heard that a new Access Token was generated for each request).
grep -o -P "Bearer.{0,37}" access-log-with-thousands-of-lines.log | sed 's/^Bearer //' | sort | uniq
This gives me a further cleaned up list of access tokens with an additional line that I don’t want to count in, "-" "Mozilla/5.0"
. I need to remove this last line as well.
fgjk7d29-b6b8-3024-8a76-5639ddlcwslo
sdlof320-c797-3c50-80b1-065nslgnxcld
worh366f-e133-3093-9c68-0495ndlgskdl
pyof8a2c-b534-37eb-9afc-0349ruaslkfw
2bfeslgo-056a-3e60-92be-34erlkgjarf3
ertlkjix-056a-3e60-92be-lkdsjf023498
lsn24fx0-8dd1-385e-8c54-09wpoisajf09
03jdl59f-ded7-33a7-ba6d-q2308wafhj00
mbn05ns9-f97d-3401-881e-3940859ajaf2
"-" "Mozilla/5.0"
> grep -o -P "Bearer.{0,37}" access-log-with-thousands-of-lines.log | sed 's/^Bearer //' | sort | uniq | sed '$ d' > tokens.txt
> cat tokens.txt
fgjk7d29-b6b8-3024-8a76-5639ddlcwslo
sdlof320-c797-3c50-80b1-065nslgnxcld
worh366f-e133-3093-9c68-0495ndlgskdl
pyof8a2c-b534-37eb-9afc-0349ruaslkfw
2bfeslgo-056a-3e60-92be-34erlkgjarf3
ertlkjix-056a-3e60-92be-lkdsjf023498
lsn24fx0-8dd1-385e-8c54-09wpoisajf09
03jdl59f-ded7-33a7-ba6d-q2308wafhj00
mbn05ns9-f97d-3401-881e-3940859ajaf2
So now I have a cleared up list of access tokens. I need to find the Consumer Key corresponding to each of the access token next. For this, I need to get into the database. The select
statement needed for this is something similar to the following.
select consumer_key,username,app_name,access_token from my_table where access_token='fgjk7d29-b6b8-3024-8a76-5639ddlcwslo' OR access_token='sdlof320-c797-3c50-80b1-065nslgnxcld' OR access_token='worh366f-e133-3093-9c68-0495ndlgskdl' OR access_token='pyof8a2c-b534-37eb-9afc-0349ruaslkfw' OR access_token='2bfeslgo-056a-3e60-92be-34erlkgjarf3' OR access_token='ertlkjix-056a-3e60-92be-lkdsjf023498' OR access_token='lsn24fx0-8dd1-385e-8c54-09wpoisajf09' OR access_token='03jdl59f-ded7-33a7-ba6d-q2308wafhj00' OR access_token='mbn05ns9-f97d-3401-881e-3940859ajaf2';
To come up with that with the list of access tokens, I make the following move.
stmt="select consumer_key,username,app_name,access_token from my_table where ";while read token; do
stmt+="access_token='$token' OR "
done < tokens.txt; echo "${stmt:0:-4};";
Since that Bashfu could be a little bit hard to see to a lay person, let me explain what happens.
- Write the first part of the non-repetitive statement
stmt="select consumer_key,username,app_name,access_token from my_table where ";
- Loop through the access tokens and write the repetitive part
while read token; do
stmt+="access_token='$token' OR "
done < tokens.txt;
- Get the final statement but without the last
OR
removed and a semi-colon added
echo "${stmt:0:-4};";
With the help of a friend (both because I don’t have access to the database server and because I only learn SQL absolutely when I need to — latter being the strongest reason) I get the consumer key, username, application name, and the access token in the following format, in a file called db_output.txt
.
consumer_key|username|app_name|access_key
Now what I need to do is to go through each consumer key, get all the access keys mapped to it, and sum up the number of requests done with each access key. For this, I’m going to have to use some advanced Bashfu, which forces me to a scripting position.
First I need to sort the retrieved list of consumer keys in db_output.txt
file, by the contents of the first column.
cat db_output.txt | sort -k1 > db_output_sorted.txt
Following is the ad-hoc script I come up with to perform the advanced bashfu.
#!/bin/bash
prev_consumer_key=""
prev_access_token=""
prev_username=""
prev_appname=""
total_count=0
# start looping
while read line; do
# cut up the line by separator `|`
consumer_key=$(echo $line | cut -d '|' -f1)
username=$(echo $line | cut -d '|' -f2)
appname=$(echo $line | cut -d '|' -f3)
access_token=$(echo $line | cut -d '|' -f4)
# search for access token in the log file and get the line count for the results
grep_count=$(grep $access_token access-log-with-thousands-of-lines.log | wc -l)
# first loop
if [[ $prev_consumer_key == "" ]]; then
prev_consumer_key="$consumer_key"
prev_appname="$appname"
prev_username="$username"
prev_access_token="$access_token"
fi
if [[ $prev_consumer_key == $consumer_key ]]; then
# Add up to the request count found for earlier access token
total_count=$(( total_count + grep_count ))
else [[ $prev_consumer_key != $consumer_key ]];
# Display the total for the previous consumer key
echo -e "$prev_consumer_key\t$prev_username\t$prev_appname\t$total_count"
# Start counting for the next one
total_count="$grep_count"
prev_consumer_key="$consumer_key"
prev_appname="$appname"
prev_username="$username"
prev_access_token="$access_token"
fi
done < db_output_sorted.txt
# show the final line too
echo -e "$prev_consumer_key\t$prev_username\t$prev_appname\t$total_count"
Pretty self-explanatory!!
With this script I get the number of requests made with access tokens mapped to each consumer key.
> ./filter_log.sh
consumer_key100 username100 appname100 1
consumer_key335 username335 appname335 1
consumer_key202 username202 appname202 13
consumer_key410 username410 appname410 1
consumer_key099 username099 appname099 45
consumer_key001 username001 appname001 18
But I still need it sorted by the numerical request count , with the most consumer key with the most request count on the top.
> ./filter_log.sh | sort -k4 -nr
consumer_key099 username099 appname099 45
consumer_key001 username001 appname001 18
consumer_key202 username202 appname202 13
consumer_key410 username410 appname410 1
consumer_key335 username335 appname335 1
consumer_key100 username100 appname100 1
Flawless Victory!!!
The path taken here isn’t the only way to achieve the same result. The following much simpler script takes a different approach using
awk
to sum up the request counts, and it turns out to be a few seconds faster than the first script (even with the additional IO and nested looping)!!
#!/bin/bash
# loop, read last line for the input values
while read line; do
# read values from the line delimitted by `|`
consumer_key=$(echo $line | cut -d '|' -f1)
username=$(echo $line | cut -d '|' -f2)
appname=$(echo $line | cut -d '|' -f3)
# get all the access tokens for the consumer key
grep "$consumer_key" db_output_sorted.txt | cut -d '|' -f4 > tmp_access_tokens.txt
# collect all the counts for access keys found and sum them together using awk
while read line; do
grep $line access-log-with-thousands-of-lines.log | wc -l
done < tmp_access_tokens.txt | awk '{ SUM += $1 } END { print SUM }' > tmp_total_count.txt
total_count=$(cat tmp_total_count.txt)
# remove temp files
rm -rf tmp_access_tokens.txt tmp_total_count.txt
# output!
echo -e "$consumer_key\t$username\t$appname\t$total_count"
done < <(cat db_output_sorted.txt | cut -d '|' -f1,2,3 | sort | uniq)
# input the list of unique consumer keys, their usernames, and app names
There you go! Not much of a help in terms of copy value, but I hope it helps you get an idea on how to bash (tehee!) your way out of a complex situation.
Written on May 1, 2018 by chamila de alwis.
Originally published on Medium