2017-06-02

Example: How to Extract Parts of Lines With Regexp in AWK


Here I am describing how to extract parts (elements) of text file lines using AWK's regular expression.

In this example I need to extract DB names and their DB-service names from a text file:
  • the source file is HostsDBs.lst
  • the relevant lines (containing the DB-name and DB-service) start with "::DB:"
  • the DB-name is followed by "[P]" or by "[S]"
  • the DB-services for a given DB are surrounded by "()"

Here how the last five relevant lines are listed with AWK:
OS> awk '/^::DB:/{ print }' HostsDBs.lst | tail -5
::DB: host0033 EE DW1[P]
::DB: host0192 EE DB01T_SITE1[P](DB_SERVICE_01)
::DB: host0284 EE DB02T_SITE1[P](DB_SERVICE_02)
::DB: host0285 EE DB02P_SITE2[P](DB_SERVICE_02)
::DB: host0286 EE DB02P_SITE1[S]

In awk (here I am using the gnu variant of awk, gawk) elements are per default separated by withespaces.
So, below, the fourth element ($4) is given to the AWK function "match()".  (For example: DB01T_SITE1[P](DB_SERVICE_01))
The second parameter of "match() " delimited by "//" is the regular expression.
It matches only elements that:
  • start with a sequence on non whitespace characters (specified by ".+")
  • followed by "[" (specified by "\[")
  • followed by a sequence on non whitespace characters (specified by ".+")
  • followed by a sequence on non whitespace characters in parenthesis (specified by "\(.+\)")
If the regular expression matches, the parts of the expression in non escaped parenthesis (specified by "(.+)") are written into the array "arr".
If the regular expression matches, "match() " returns true and the two elements of the array (arr[1] and arr[2) are printed

Here the lines were the DB-service is given (for example DB_SERVICE_01) are listed:
OS> gawk '/^::DB:/{ if(match($4, /(.+)\[.+\((.+)\)/, arr)){ print arr[1]" "arr[2]} }' HostsDBs.lst | tail -5
DB01T_SITE1 DB_SERVICE_01
DB02T_SITE1 DB_SERVICE_02
DB02P_ SITE2 DB_SERVICE_02