Kompx.com or Compmiscellanea.com

Lynx. Web data extraction

Operating systems : Linux

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples:

Dealing with external links

Count number of external links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Find external links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Find external links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Dealing with internal links

Count number of internal links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Find internal links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Find internal links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt

The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page. "Lynx -dump" would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. "Lynx -dump -listonly" gives only a list of links, so that there is no confusion with plain text links looking strings.


Aliosque subditos et thema

 

ELinks. Configuration

 

Jump to: ELinks - Text-based or console web browser. Description: features, screenshots, download links. ELinks. Use - How to use ELinks. What ELinks can and what can not, i.e. what may be configured depends on several things. The most common are capability of the console it is run on and support of what was chosen during the compilation process of an ELinks browser source code. Under certain operating systems (Windows for instance) it is easier and more reliable to omit some possible features. The configuration of ELinks 0.12pre2 from Package Database of Zenwalk Linux (originally based on Slackware, still keeping compatibility with its binary packages) is discussed here. It is quite representative in its qualities among other Linux ELinks browser versions for desktop PC computers. User interface of Windows ELinks and that of Linux / some other *nix systems ELinks of comparable versions are the same. If an ELinks package has been compiled without some feature or the console is not capable of something then it just will not be possible to choose the feature in the configuration process; or possible, but it will not work. For example, if this ELinks is compiled without support of 256 colors or your console is not capable of displaying them, then it just will only be possible to choose 16 or "No colors (mono)" in the menu. And the ELinks will work as much well in the rest. The whole configuration of ELinks may be done through the menu of the browser - there is no need to edit config files. Even if some fine-tuning may require editing of elinks.conf or even source code files - but that is not the issue for average use. The menu of ELinks is hidden when it is displaying a web page: To access the menu, press Esc on the keyboard: Configuration options are in "Setup" group: 1 ) "Language" - set a language of ELinks user interface. It is possible to set it to the system language or to several others. System and terminal emulator / console have to be prepared for this change - corresponding localization / internationalization files installed and configured. Choose a language (English in this example): To keep it, press Esc again, then Setup, then Save options, then confirm by clicking OK.

Non-breaking space (   ) in :before and :after content

 

Non-breaking space ( &nbsp; ) in :before and :after pseudo-elements. Hex code ( \00a0 ) is used in the content property instead of the named character entity ( &nbsp; ). Example: ABC HTML / XHTML. Code: <div>ABC</div> CSS. Code: div:before {content:"\00a0";} div:after {content:"\00a0";} /* Extra CSS to make non-breaking spaces more obvious here: */ div:before {height: 1em; width: 1em; display: inline-block; background: #f00;} div:after {height: 1em; width: 1em; display: inline-block; background: #00f;} [ 1 ] As well as Netscape 6.01+, Mozilla 0.6+. [ 2 ] As well as Netscape 6.01+, Mozilla 0.6+.