Kompx.com or Compmiscellanea.com

Lynx. Web data extraction

Operating systems : Linux

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples:

Dealing with external links

Count number of external links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Find external links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Find external links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Dealing with internal links

Count number of internal links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Find internal links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Find internal links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt

The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page. "Lynx -dump" would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. "Lynx -dump -listonly" gives only a list of links, so that there is no confusion with plain text links looking strings.


Aliosque subditos et thema


Arachne web browser. Installing and setting up for dial-up internet connection


A : Installing Arachne web browser on a disk created in RAM - Arachne runs the fastest this way. RAM size should allow for a RAM disk of 6 MB or more. In order to install and set up Arachne web browser for dial-up internet connection, there have to be several programs at hand: 1. Arachne web browser [ Download ] 2. If Arachne web browser is to be used for surfing web pages with character encoding other than Latin for West European languages, visit www.glennmcc.org/apm/ to find available character set packages and download the necessary one. 3. Mouse driver, mouse.com for instance [ Download ] 4. Archivers. For example, PKZIP [ Download ] and PKUNZIP [ Download ] 5. If it is not MS-DOS 6.0+ to be used, QEMM97 [ Download ] 6. If it is not MS-DOS 6.0+ to be used, TDSK [ Download ] Installing and setting up Arachne web browser, step by step: 1. Create a RAM disk. Which drive letter will be assigned to it comes from the assumption that A: and B: go to floppy drives (even if there is only one, both letters will be reserved anyway), C: goes to the first active primary MS-DOS partition on the first physical hard disk. If there are more disks, then there will be as many letters used consecutively as to name them all. Unless there are no devices installed using DRIVER.SYS or similar drivers, the next drive letter will be assigned to the RAM disk. In order to be sure, after having the relevant string for making RAM disk added to CONFIG.SYS (See below), computer could be restarted and what letter is assigned to the RAM disk checked by experiment. In this case, it is E: Depending on RAM size it needs to be decided how many megabytes can be reserved for RAM disk. Basically, the more the better. Since, for instance, web browser cache is going to swell during prolonged and intensive use within a session. In this example the RAM disk is 12 000 KB. The maximum size for RAMDRIVE.SYS MS-DOS driver is 32 767 KB, the one of TDSK - 64 MB. In order to create such a disk, the string has to be added somewhere in the middle of CONFIG.SYS as follows: DEVICE=C:\DOS\RAMDRIVE.SYS 12000 512 512 /E 2. Create a folder, for example C:\DRIVERS\. Put there a mouse driver, for instance mouse.com 3. Add a string starting mouse driver to AUTOEXEC.BAT. Specify there the full path to the driver, may be any: LH C:\DRIVERS\MOUSE.COM 4. Run MemMaker or OPTIMIZE from QEMM97 to optimize base memory management. If it is MemMaker, press Enter at any suggestion - MemMaker will handle it itself. Computer is going to restart several times, each time MemMaker will be re-running - again nothing, just Enter, is a safe choice. If it is QEMM97 (specifically OPTIMIZE), then there is going to be several restarts too and each time just pressing Enter is OK. 5. Start installation of Arachne web browser on RAM disk. In the case discussed it is E: A195GPL.EXE Press Y to continue: Press N to specify the path to the folder Arachne web browser is to be installed in: Specify the path to the folder Arachne web browser is to be installed in. In the case discussed it is E:\ARACHNE\.

CSS vertical alignment


CSS vertical alignment of a block element containing text and images. It works for various combinations of inline and block elements. Example: CSS vertical alignment CSS vertical alignment HTML / XHTML. Code: <div class="parent"> <div class="child"> <div class="childcontent">CSS vertical alignment</div> <div class="childcontent"><img src="image.jpg" width="68" height="68" alt="Image" /></div> <div class="childcontent">CSS vertical alignment</div> </div> </div> CSS. Code: .parent {position: relative; left: 0px; top: 0px; height: 200px; display: table;} .child {position: relative; left: 0px; top: 0px; display: table-cell; vertical-align: middle;} .childcontent {position: relative; left: 0px; top: 0px;} Note: .parent and .childcontent may be floated left ("float: left;") or not, but .child must be without "float: left;" for this method of CSS vertical alignment to work. [ 1 ] As well as Netscape 6.01+, Mozilla 0.6+. [ 2 ] As well as Netscape 6.01+, Mozilla 0.6+.