We have seen in a previous awk introduction article that awk can be an effective tool for everything from small one-liners up through some interesting applications. There are certainly more complex languages at our disposal if a situation calls for it; perl and python come to mind. Applications requiring networking support, database access, user interfaces, binary data or more extensive library support and complexity are best left to other languages with better support in these areas.
Nevertheless, awk is a superb language for testing algorithms and applications with some complexity, especially where the problem can be broken into chunks which can streamed as part of a pipe. It’s an ideal tool for augmenting the features of shell programming as it is ubiquitous; found in some form on almost all Unix/Linux/BSD systems. Many problems dealing with text, log lines or symbol tables are handily solved or at the very least prototyped with awk along with the other tools found on Unix/Linux systems.
While awk lends itself well to operating on input a line at a time, processing and then sending some output for each line, it may also be used in a more traditional batching-style application where it reads all the input, processes and then sends the processed output onward.
Yet Another Sudoku Puzzle Solver – YASPS for Awk
The application I chose to use as an example is “yet another sudoku puzzle solver”. I must confess at the outset that I have never sat down to solve one of these puzzles myself, but sketched this out over a few days while commuting on a train and watching other people work on them. It was far more fun I think than actually doing any of the puzzles..
Download YASPS Program for Awk: solve.awk
The input format I’ve chosen is one which is easy to parse in awk and is fairly traditional in the Unix environment. Blank lines and those starting with a hash (#) character are ignored making it easy to insert comments. Extra spaces may be used between columns for readability. An example is shown in the following figure:
------------------------------------------------------------------------------- # I forget where I got this.. # this is one of the hardest ones I've found for this algorithm, although # after transposing the matrix it can be solved in a fraction of the time 2 0 0 6 7 0 0 0 0 0 0 6 0 0 0 2 0 1 4 0 0 0 0 0 8 0 0 5 0 0 0 0 9 3 0 0 0 3 0 0 0 0 0 5 0 0 0 2 8 0 0 0 0 7 0 0 1 0 0 0 0 0 4 7 0 8 0 0 0 6 0 0 0 0 0 0 5 3 0 0 8 -------------------------------------------------------------------------------
There is almost no error checking in this program, but it would be easy to add in front or as part of a wrapper script. I’ll leave that as an exercise for the reader.
This program uses a very simple depth-first recursive backtracking algorithm with up-front and ongoing elimination of invalid entries. Awk may not have the expressive power for representing complex data that perl or other languages have, but with care, many moderate sized problems and data sets can be used. This algorithm may not be the best one around, but it is certainly fast enough for most problems and is easy to implement.
With any problem, representing the data effectively makes the task of designing a program much easier. I have represented the complete state of the puzzle in a matrix called “master”. This is hardly used for much except keeping a record of what is where and for doing the final output.
The main workhorse variables are three other arrays. Intuitively we know from the recursive trial and error method we’ve chosen that we will need to check the validity of trial numbers quite often. To accelerate that process we maintain associative arrays for storing the current state for the rows, columns and each region (which, although not technically correct, we’ll call a “quadrant”). These are the arrays R, C and Q. (Note that awk is case sensitive.)
Sometimes it helps when trying to factor out common computations from nested for-loops or recursive function calls to pre-compute values which are used often. I had tried this with the “regmap” matrix which would store a quadrant number given the row, column values but I found the time savings in this case to be ambiguous. I’ve left it commented out, but your mileage may vary and the technique is often very useful.
Recursive algorithms are often best designed and therefore described in a top-down manner. The top function in this program is called “search()” and is called from the “END” pattern after the problem data has been read into the arrays.
At a high level, search() starts with the supplied row and column parameters and looks for the next empty space to check. If there aren’t any, the problem has been solved and it returns with the solution. If there is an empty space (represented by zero), it tests available numbers (with the inuse() function, for numbers not in use in that row, column or quadrant) by inserting a number into the arrays using mark() and calling itself recursively. If the recursive search() function returns a zero it means that it has failed, so the mark() function is again called to un-mark the trial number. It then loops around until the possibilities are exhausted or the search() call returns success.
The beauty of many recursive algorithms is the inherent elegance and simplicity of the solutions. While they are sometimes not the fastest algorithms, they are often “fast enough” and easier to design. This program solves most puzzles in less than a few seconds. One thing I noticed while trying this program on different puzzles is that if a problem took a longer period of time to solve (in tens of seconds), simply transposing the matrix would often give the same solution in a fraction of a second. With today’s multi-core CPUs, this suggests one possibility for speeding it up: Write a wrapper script which starts several instances of the program with different transpositions of the matrix. An example is shown with the previous puzzle shown above and the transposed version in the following figure where the transposed problem was solved four times faster.
------------------------------------------------------------------------------- marion:~/sudoku$ time awk -f solve.awk THE-HARDEST-SO-FAR.dat # Searches=134214 2 8 3 6 7 1 9 4 5 9 7 6 5 4 8 2 3 1 4 1 5 3 9 2 8 7 6 5 6 7 4 1 9 3 8 2 8 3 4 2 6 7 1 5 9 1 9 2 8 3 5 4 6 7 3 2 1 7 8 6 5 9 4 7 5 8 9 2 4 6 1 3 6 4 9 1 5 3 7 2 8 real 0m10.009s user 0m9.889s sys 0m0.024s marion:~/sudoku$ time awk -f solve.awk /tmp/transposed.dat # Searches=32253 8 3 4 7 9 2 6 1 5 2 1 9 6 5 8 7 3 4 7 6 5 4 1 3 8 2 9 3 4 6 5 7 9 2 8 1 5 2 8 3 6 1 9 4 7 1 9 7 8 2 4 3 5 6 9 8 1 2 4 7 5 6 3 4 5 2 9 3 6 1 7 8 6 7 3 1 8 5 4 9 2 real 0m2.487s user 0m2.456s sys 0m0.008s -------------------------------------------------------------------------------
When something even faster is required, it can often be accomplished by translating the algorithm into another language with a more direct representation of the data sets. I did a translation of this program to C once with an interesting twist on the data indexing. This version probably executes a few orders of magnitude faster, largely due to the way that the data is represented. Probably we’ll release “Yet Another Sudoku Puzzle Solver Using C” as an another article later.
I believe that awk deserves a place in everyone’s toolkit. Its simplicity relative to other languages is perhaps seen as a weakness, but I see it as one of its strengths. The language can be learned in an afternoon and used without resorting to reference books for solving many day-to-day problems. I use it on a regular basis straight from the command line, right through to implementing things such as compilers for DSLs (Domain Specific Languages).
Recommended Awk Books
- The AWK Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. The original book by the authors of the language have some excellent examples of moderately complex programs and is still my favorite book on awk. Published by Addison-Wesley, 1988. ISBN 020107981X.
- More Programming Pearls: Confessions of a Coder by Jon Bentley, AT&T Bell Labs, Murray Hill, NJ. Another book for using awk as a prototyping tool for algorithms may be found in More Pearls. Excellent reading. Year of Publication: 1988. ISBN:0201118890
Comments on this entry are closed.
Nice article!
thanks for this – it sparked my interest in learning more about awk. very well-written. keep ’em coming!
Thanks for the feedback! Appreciate the comments..
Cheers.
Awk is great for reports but I haven’t tried computing any algorithms with it! Interesting article.
Great article … it’s amazing what can be done with awk in short order. It’s refreshing to see an awk article that incorporates both the text manipulating power of awk and its ability to compute algorithms. Too many awk articles focus just on text file manipulation, nice job!
Thanks everyone for your comments!
I had put a teaser in the article suggesting a better/faster way to index the data in C. Anyone with some ideas on this?
Exchange line 100 of the original script from:
return 1
to this:
dump()
return 0
and change line 143 of the original script from:
dump()
to this
# dump()
and voila, you will get all solutions, not just the first one.
You can test it on this:
# multiple solutions
3 0 8 6 0 9 0 1 4
1 0 4 3 0 8 0 6 9
6 0 9 4 0 1 0 3 8
7 4 3 1 8 5 6 9 2
8 6 2 9 4 7 3 5 1
9 1 5 2 3 6 8 4 7
4 8 7 5 9 3 1 2 6
2 3 6 8 1 4 9 7 5
5 9 1 7 6 2 4 8 3
Have a nice evening
I cant believe the power of Awk… I was refused to change my grep for awk, but now… I’ll try awk!
Thanks for all your comments. There are some more interesting awk applications coming, so stay tuned!
Thank for the article, it is a nice way to spur awk interest.
Thanks Alfredo,
Mssion accomplished I guess; that’s what I was trying to do.. 😉
Stay tuned, as there are some follow-up articles coming.