Kompx.com or Compmiscellanea.com

Lynx. Web data extraction

Operating systems : Linux

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples:

Dealing with external links

Count number of external links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Find external links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Find external links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Dealing with internal links

Count number of internal links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Find internal links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Find internal links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt

The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page. "Lynx -dump" would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. "Lynx -dump -listonly" gives only a list of links, so that there is no confusion with plain text links looking strings.


Aliosque subditos et thema

 

Float bottom

 

There is no "float: bottom" in CSS, but there is a way to achieve it by some other means. Example: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Link 1 Link 2 Link 3 Float bottom HTML / XHTML. Code: <div class="box1"> <div class="content1"> <div class="left1">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</div> <div class="menu1"> <p><a href="#">Link 1</a></p> <p><a href="#">Link 2</a></p> <p><a href="#">Link 3</a></p> </div> </div> <div class="bottom1">Float bottom</div> </div> CSS. Code: .box1 {position: relative; top: 0px; left: 0px; float: left; height: auto; width: 100%;} .content1 {position: relative; top: 0px; left: 0px; float: left; height: auto; width: 100%;} .left1 {position: relative; top: 0px; left: 0px; float: left; height: auto; width: 64%;} .menu1 {position: relative; top: 0px; left: 0px; float: left; height: auto; width: 36%;} .bottom1 {position: absolute; bottom: 0px; right: 0px;} /* Extra CSS, just styling the look */ .box1 {color: #ddd; text-align: center;} .content1 {background: #bbb;} .left1 {min-height: 100px; padding: 2%; text-align: justify; background: #006; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; -ms-box-sizing: border-box; box-sizing: border-box;} .menu1 {padding: 2%; float: right; background: #060; -moz-box-sizing: border-box; -webkit-box-sizing: border-box; -ms-box-sizing: border-box; box-sizing: border-box;} .menu1 p {position: relative; top: 0px; left: 0px; float: left; height: auto; width: 100%; padding: 0px; margin: 0px;} .menu1 a {color: #ddd; text-decoration: none;} .menu1 a:hover {text-decoration: underline;} .bottom1 {padding: 2%; color: #eee; background: #600;} There is a web page with a div box, containing its content - box1. There are two div boxes inside it: 1. "content1" with the content proper on the left and menu on the right. 2. "bottom1" after the content1.

Unzip multiple files. Linux

 

Unzip multiple zip files into one directory by Linux command line unzip. Contrary to possible expectations, "unzip *.zip" is not going to work, *.zip should be put into quotes: unzip "*.zip" There may be files with the same names in these archives. To avoid overwriting: unzip -B "*.zip" "Unzip -B" makes unzip to overwrite duplicates during extraction process, but saving a backup copy of each overwritten file. The names for these backup copy files are created by adding tilde ("~") at the end of the original names of the files. If a file extension is present, then "~" is added after it. If that is not enough, unique sequence number (up to 5 digits) is appended after the "~". "Unzip -B" is not too practical. For example, since when the sequence number range for numbered backup files gets exhausted (99999, or 65535 for 16-bit systems), the backup file with the maximum sequence number is deleted and replaced by the new backup version without notice ( More on the subject ). The number of files in an archive may not be always known in advance or may be more than possible sequence number range, so "Unzip -B" is not a great choice. Renaming duplicate files by adding "~" at the end of their names, after the extension, is not too convenient either. But another built-in option is even worse. If the "-B" modifier is not used, each time a file with same name as there already unpacked is being extracted, unzip asks "replace example.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:". And each time "r" must be hit, then a new name has to be input. So some bash or another script solving the problem should probably be prepared and used instead.