AndBlu's Notes: 2017-06

2017-06-02

Example: How to Extract Parts of Lines With Regexp in AWK

Here I am describing how to extract parts (elements) of text file lines using AWK's regular expression.

In this example I need to extract DB names and their DB-service names from a text file:

Here how the last five relevant lines are listed with AWK:

OS> awk '/^::DB:/{ print }' HostsDBs.lst | tail -5

::DB: host0033 EE DW1[P]

::DB: host0192 EE DB01T_SITE1[P](DB_SERVICE_01)

::DB: host0284 EE DB02T_SITE1[P](DB_SERVICE_02)

::DB: host0285 EE DB02P_SITE2[P](DB_SERVICE_02)

::DB: host0286 EE DB02P_SITE1[S]

In awk (here I am using the gnu variant of awk, gawk) elements are per default separated by withespaces.

So, below, the fourth element ($4) is given to the AWK function "match()". (For example: DB01T_SITE1[P](DB_SERVICE_01))

The second parameter of "match() " delimited by "//" is the regular expression.

It matches only elements that:

start with a sequence on non whitespace characters (specified by ".+")
followed by "[" (specified by "\[")
followed by a sequence on non whitespace characters (specified by ".+")
followed by a sequence on non whitespace characters in parenthesis (specified by "$.+$")

If the regular expression matches, the parts of the expression in non escaped parenthesis (specified by "(.+)") are written into the array "arr".

If the regular expression matches, "match() " returns true and the two elements of the array (arr[1] and arr[2) are printed

Here the lines were the DB-service is given (for example DB_SERVICE_01) are listed:

OS> gawk '/^::DB:/{ if(match($4, /(.+)\[.+\((.+)\)/, arr)){ print arr[1]" "arr[2]} }' HostsDBs.lst | tail -5

DB01T_SITE1 DB_SERVICE_01

DB02T_SITE1 DB_SERVICE_02

DB02P_ SITE2 DB_SERVICE_02