Lynx. Web data extraction

Aside from browsing and displaying web pages, lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc.

External links

Counting number of external links

Lynx sends list of links from the content of a local web page named "elinks.htm" to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (that is, external links of the web page) out of it, wc counts the number of links extracted and displays it:


lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Finding external links and save them to a file


lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Finding external links, omitting duplicate entries and saving the resulted output to a file

Lynx sends list of links from the content of a local web page named "elinks.htm" to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (that is, external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file named "file.txt":


lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Internal links

Counting number of internal links

Lynx sends list of links from the content of a local web page named "elinks.htm" to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:


lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Finding internal links and saving them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file named "file.txt":


lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Finding internal links, omitting duplicate entries and saving the resulted output to a file


lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

The reason behind using lynx -dump -listonly instead of just lynx -dump is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with kompx.com/en/elinks.htm page, to give an example. Lynx -dump would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. Lynx -dump -listonly gives only a list of links, so that there is no confusion with plain text strings looking like links.

Operating systems

Linux

Lynx. Web data extraction

External links

Counting number of external links

Finding external links and save them to a file

Finding external links, omitting duplicate entries and saving the resulted output to a file

Internal links

Counting number of internal links

Finding internal links and saving them to a file

Finding internal links, omitting duplicate entries and saving the resulted output to a file

Operating systems

More

Search

Operating systems

Sections

Navigate