Wednesday, February 03, 2010

Linux Newbie : Why grep almost never yields something productive

Getting Started Level 0

Every Linux newbie hears about power of grep sooner or later. But no sooner does newbie try to use the grep command the experiment ends badly. The reason grep almost never yields anything productive is because of a couple of issues. The basic problem is lack of knowledge of regular expressions and secondly knowledge of switches of grep. This realization should not deter a new user from using grep. A very nice detailed tutorial of regular expression (regex) is available here. The migrating user having used dir and similar commands in DOS and using windows search box almost never expects (nor can fathom) the power provided by grep. Firstly grep means "global regular expression print". Now this means that knowledge of regex is required to be able to use grep effectively. Now what should a newbie do? Wait till user's regex knowledge improves? NO NOT AT ALL. That would scare away the user and never to use grep. Here is the first simple command that would work to display the files in the current directory.
grep -l ".*" *
REMEMBER this only lists the files (not directories). Now this is not why grep is used but simply given to enhance confidence of the user that "yes we can" use grep. The command ls can be used to list contents of current working directory. The command grep is used to do something more productive. Now lets start doing something ls might not do. Lets try to find all the files with the word "include" IN them. The following simple command does this simply.
grep -l "include" * 
To search all the subdirectories add r to the options which means to search recursively.
grep -lr "include" *
This is all one actually needs to start using grep a little bit more effectively.

Jumping to Level 1

Now to be more productive one has to start using regular expressions. A step by step approach would be better than first going to a regular expressions tutorial and then coming back totally lost. The following can serve as the first command which uses regular expression.  The average user might not even sense any difference because primarily its the same command as above but adds huge difference with slight modification.  Now to check for files which have either trap or drap add t and d in brackets before rap and then search. The following command searches for all the files which have either trap or drap.
grep -l "[td]rap" * 

Jumping to Level 2

This simple addition has enhanced our power of using grep. Another addition to our power is provided by two characters "^" and "$". Now the following command searches files which contain words starting with alphabet a.
grep -l "^a" *
And the following command simply searches words which end with the alphabet d.
grep -l "d$" *
Now in combination the following command can be used to find the files which contain words starting with alphabet a and ending with the alphabet d.
grep -l "^a.*d$" *
This was accomplished by simply adding the ".*" between the two alphabets ensures that anything in between the start and end "pattern".

Jumping to Level 3

Now going back to the brackets discussed earlier. Multiple characters can be added to search by simply adding them in brackets. E.g, To search words starting with either a, e, i, o or u. The following command can be used.
grep -l "^[aeiou]" *
This command searches for all the words that start with a vowel i.e, a,e,i,o or u. This can be used in combination by using commands similar to the following where the files listed contain words that start with a or b and end with d or s.
grep -l "^[ab].*[ds]$" *
Now the expression in brackets can be improved by using ranges and groups.
A-Z match all the upper case alphabets
a-z match all  the lower case alphabets
0-9 matches all the digits

Now using the above ranges the following command tries to search files with words which start with an alphabet and end with a digit. Combining A-Za-z ensures that the starting alphabet could either be upper case or lower case.
grep -l "^[A-Za-z].*[0-9]$"
 The above discussion is enough for a Linux newbie to appreciate the usage of grep command and be comfortable at different levels. In order to attain more power in using grep consult man / info pages of grep or any advanced grep or regex tutorial over the Internet.

References
  1. Linux Manual grep
  2. http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

22 comments:

What is Linux? said...

This post might be useful if it were in English. Please proofread your text! I started reading, discovered that the text was horrible, and started googling to find information elsewhere. That is a shame.

Anonymous said...

Quite obviously English is not the poster's first language and I thought his post was easily understood and very helpful. I consider your comment to be needlessly offensive and unhelpful

Zahid Irfan said...

Thank you very much for the valuable feedback. True English is not my first language but will try to improve in future. "What is Linux" had to go elsewhere for help which is truly shame and would act as real motivation to write better in future.

johnno said...

Zahid your English is close to perfect, and writing technical articles in a second language is never easy. "What is Linux" should be renamed "What a Jerk/NetCoward". Anyway, thanks for the post I am a long term Linux user (almost 10 years now!), but have never really used grep unless as a piped command to filter something out (e.g. $ps aux | grep process.). So thanks for this post, hope to see some more tips like this in future!

Jinha said...

Shame on "What is Linux?". I think you should thank the author for writing this article in a language you can read even though it is not his[her] native language.

Chris said...

Your English is fine - 'What is Linux?' is a child.

However, isn't grep really for searching file content rather than file names? I thought find was the tool for searching file names?

Cheers.

Anonymous said...

I agree with johnno. What is linux is obviously a windows fan boy and probably doesn't even know where to find the terminal app. I thought you did an excellent job presenting the basic fundamentals of grep. Your English is as good if not better than most native speakers of the language. Keep up the good work.

Anonymous said...

I noticed a couple of problems in your tutorial. I tried running
grep -l "" * on a Solaris system and got grep: RE error 41: No remembered search string.. It would have been more effective, IMO, if you used ".*" instead of "", but that is a bit more advanced. The second, bigger problem was when you searched by "^a.d$" and said it would match any word starting with a and ending with d. That is incorrect; it will match any LINE that starts with an a and ends with d, with one and only one character (any character) in between. Same with the examples in Level 3.

On the other hand, interesting way to list only files in the directory. I'm not sure if I would have thought of that one.

Anonymous said...

I am a Linux lover and thought this was very
helpful. I have been using the ls command
and I like using grep better now. Thank you
very much for your efforts to help less experienced
folks get comfortable with new commands as
the fm "of RTFM" doesn't mean anything if you
don't understand the terminology. Thanks again

Anonymous said...

Your approach to this is neatly done. However, as the poster two above points out, there are errors which you really should be prepared to correct, since some poor soul is going to be following this and not getting what he expected. Come in.. a few keystrokes to correct it is all that's required.

Zahid Irfan said...

Thanks for all the comments. Specially the one who corrected a couple of mistakes. I have corrected all the errors and now it can be utilized in a robust fashion. Thank you for your help.

Jose_X said...

I really appreciate the work done to create tutorials (text or video tutorials). Let me see if I can contribute some further ideas.

>> grep -l ".*" *

You might want to explain the details of this and of other command lines, each in a distinct side bar or through a distinct hyperlink. This would be for the benefit of those that have not used man pages before or find them difficult to understand.

Example of a sidebar explanation to the above:

This line has 4 parts.

The first is the grep command. The second, third, and fourth parts are the three arguments given to the grep command.

The first argument consists of a single flag value; this requires "-" immediately followed by the actual flag value "l". The details of how this flag affects the behavior of grep can be found in the manual page, eg, http://unixhelp.ed.ac.uk/CGI/man-cgi?grep . This "l" flag changes the behavior of grep, from printing lines that match the regular expression, to printing the name of any of the files given to grep which has a line that matches the regular expression.

The second argument is ".*". What grep sees is .* but we need to place this in quotes to prevent the command line interpreter (ie, the shell) from expanding .* into something else and then giving this expanded value to grep. .* tells grep to match any value (that is what "." means) any number of times including zero times (that is how "*" affects what precedes it). This is regular expression syntax. Note that grep will not register a match if a file is empty.

The third argument is *. We have a single asterisk with no quotations. This value is interpreted and expanded by the shell into the name of every file in the current directory. This list of file names is what is handed over to the grep command.

For each file name, grep will open the file and search the contents of that file against the regular expression above (.*). Grep will ignore directories, links to directories (and perhaps other things ??). Grep causes any empty files to register as a failed match; however, any file that is not empty will have each line of such file match .*. Thus, this grep command line will output the name of every normal nonempty file in the current directory.

Note that if we omitted the -l flag, then grep would list every single line of every normal nonempty file in the current directory.

It should also be noted that the syntax understood by grep is different than that understood by egrep. Egrep understands more popular regular expression syntax. For example, egrep is closer to perl and javascript, two languages with strong regular expression support. Meanwhile, grep is closer to sed and ed, two much less frequently used tools. Also, "grep -E" is the same as "egrep".

End sidebar explanation.

Jose_X said...

I noticed a typo that was still present when I read the tutorial:

"The following simply command does this simply."

The first "simply" should be "simple".

This might help commenters like "What is Linux?", who made the first comment above.

Jose_X said...

In the United States, copyright is automatic. This means, unless given permission, you can't really copy verbatim someone else's comment to reuse it, except to a limited extent ("fair use": example if the comment is very short or you quote only a portion of it).

Since I live in the United States, I will explicitly give the author of the blog permission to use anything I wrote within comments verbatim under the condition that this tutorial also is licensed at least somewhat liberally. CC by SA http://creativecommons.org/licenses/by-sa/3.0/ might be an good license because it would allow others to take this tutorial and change it (presumably to improve it), but only under the condition that you get some credit and that they also license their changes in a similar fashion to everyone else.

BTW, I am like a stranger because I am not stating who I am; thus, I don't know if the above permission/license means anything. I am also not a lawyer.

There are some other comments to this tutorial that might be of interest: http://www.linuxtoday.com/infrastructure/2010020402535OSSW

InLoveWIthHills said...

@johnno nothing serious but I find it hard to believe that you have used linux for 10 years and have never tried to use grep to find for a string in a bunch of files.. :)

Zahid Irfan said...

@the boy who never knew
What johnno probably meant was that apart from using grep for pipe or simplest pattern matching he did not use it much. This is the idea I wanted to highlight in the text.

Golodh said...

Funnily enough this is exactly why GUI's are so useful.

A proper GUI:
- insulates you from regular expression complexities
- offers you access to all the switches grep has, with you you having to memorize them first (or look them up)
- lets you focus on getting things done instead of spending your time fighting the system

Just Google for "grep gui".

That ought to take care of 90% of your grep needs. For the remaining 10% you can either hunker down with a manual or ask around on a forum.

Alternatively, (if you're using MS Windows) install Total Commander. It's a Norton Commander clone with a very nice search function built-in.

Cheers

Anonymous said...

I actually know regex quite well but... I didn't know grep could use them ^.^'

I should have known, I guess, but I used grep before learning to work with regex...

Thanks for the article!

dutchkind said...

Great article! I use grep a lot to create scripts that are much faster than most GUI, but this article gives me ways to expand the use of grep. Thanks

Suramya Tomar said...

@Golodh, the problem with a using a GUI to search is that it can not be used within a script or automated to perform certain tasks if a set of requirements are met.

Hence the popularity of grep.

@zahid: Great article. Love the way you have explained it.

- Suramya

Anonymous said...

I read somewhere that grep was named after the ex command g/re/p (g for 'apply globally', / is a separator, 're' stands for any Regular Expression, and 'p' is the ex print command). I do not know which explanation of the grep name is really correct, though...

Anonymous said...

The g/re/p came from 'ed' not 'ex'.

'ex' is the line mode of 'vi' (visual editor).

'grep' happened because 'ed' was limited in how big the files could be and the Unix system at Bell Labs was supposed to help editing patent applications.

The large files were to big for 'ed' so they were run through 'split' to create files small enough for 'ed'.

Then each one of the files would be loaded into 'ed' and a g/re/p command would print matching lines, the matches were captured to another file and 'ed' was then moved on to the next file.

The files output by 'split' were removed after the last was examined because only the large file and the matching lines were needed later.

The person doing this was complaining about how tedious it was while on a break and was overheard by Ken Thompson.

When they came in to work on Monday users were greeted with a message that the system had a new command, 'grep', which would search for patterns in large files without invoking 'ed'.

What Ken did was extract the Regular Expression code from 'ed', put a line at a time reader in front of it and send the matching lines to standard out.