Kompx.com or Compmiscellanea.com

Lynx. Web data extraction

Operating systems : Linux

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples:

Dealing with external links

Count number of external links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Find external links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Find external links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Dealing with internal links

Count number of internal links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Find internal links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Find internal links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt

The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page. "Lynx -dump" would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. "Lynx -dump -listonly" gives only a list of links, so that there is no confusion with plain text links looking strings.


Aliosque subditos et thema

 

Arachne web browser. Installing and setting up for internet connection via Ethernet

 

A : Installing Arachne web browser on a disk created in RAM - Arachne runs the fastest this way. RAM size should allow for a RAM disk of 6 MB or more. In order to install and set up Arachne web browser for internet connection via Ethernet, there have to be several programs at hand: 1. Arachne web browser [ Download ] 2. If Arachne web browser is to be used for surfing web pages with character encoding other than Latin for West European languages, visit www.glennmcc.org/apm/ to find available character set packages and download the necessary one. 3. Mouse driver, mouse.com for instance [ Download ] 4. Packet driver for Ethernet network card: http://www.crynwr.com/drivers/ http://www.georgpotthast.de/sioux/packet.htm 5. Microsoft Network Client 3.0 [ Download ] 6. NDIS 2.0 driver for Ethernet network card. For example: Realtek RTL8029AS [ Download ]. Drivers for other network cards may be found, for instance, on web sites of Ethernet cards manufacturers. 7. Archivers. For example, PKZIP [ Download ] and PKUNZIP [ Download ] 8. If it is not MS-DOS 6.0+ to be used, QEMM97 [ Download ] 9. If it is not MS-DOS 6.0+ to be used, TDSK [ Download ] Installing and setting up Arachne web browser, step by step: 1. Create a RAM disk. Which drive letter will be assigned to it comes from the assumption that A: and B: go to floppy drives (even if there is only one, both letters will be reserved anyway), C: goes to the first active primary MS-DOS partition on the first physical hard disk. If there are more disks, then there will be as many letters used consecutively as to name them all. Unless there are no devices installed using DRIVER.SYS or similar drivers, the next drive letter will be assigned to the RAM disk. In order to be sure, after having the relevant string for making RAM disk added to CONFIG.SYS (See below), computer could be restarted and what letter is assigned to the RAM disk checked by experiment. In this case, it is E: Depending on RAM size it needs to be decided how many megabytes can be reserved for RAM disk. Basically, the more the better. Since, for instance, web browser cache is going to swell during prolonged and intensive use within a session. In this example the RAM disk is 12 000 KB. The maximum size for RAMDRIVE.SYS MS-DOS driver is 32 767 KB, the one of TDSK - 64 MB. In order to create such a disk, the string has to be added somewhere in the middle of CONFIG.SYS as follows: DEVICE=C:\DOS\RAMDRIVE.SYS 12000 512 512 /E 2. Create a folder, for example C:\DRIVERS\. Put there: a mouse driver, for instance mouse.com, a packet driver for Ethernet network card and a NDIS 2.0 driver for Ethernet network card. 3. Add a string starting mouse driver to AUTOEXEC.BAT. Specify there the full path to the driver, may be any: LH C:\DRIVERS\MOUSE.COM 4. Prepare installation floppies of Microsoft Network Client 3.0: DSK3-1.EXE -d A: DSK3-2.EXE -d A: 5. Start setup.exe from the first floppy and begin Microsoft Network Client 3.0 installation. Installation is starting. Press Enter to continue Select folder for Microsoft Network Client 3.0 to be installed to. It may be any or the suggestion of the installer may be left as it is - in the case discussed it is left as it is.

CSS centering <hr />

 

CSS centering <hr />, if its width is less than 100%. Horizontal centering. Example: HTML / XHTML. Code: <hr /> CSS. Code: hr {width: 50%; margin: 0 25% 0 25%;} /* Extra CSS, just styling the look: */ hr {height: 1px; float: left; border: 0px; color: #f00; background: #f00;} Note: mostly it works both with float: left and float: none. But float: left makes it for sure. [ 1 ] As well as Netscape 4.04+, Mozilla 0.6+. [ 2 ] As well as Netscape 4.04+, Mozilla 0.6+.