Kompx.com or Compmiscellanea.com

Lynx. Web data extraction

Operating systems : Linux

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples:

Dealing with external links

Count number of external links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Find external links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Find external links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Dealing with internal links

Count number of internal links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Find internal links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Find internal links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt

The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page. "Lynx -dump" would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. "Lynx -dump -listonly" gives only a list of links, so that there is no confusion with plain text links looking strings.


Aliosque subditos et thema

 

Network setup in DOS. Microsoft Network Client 3.0

 

In order to install Microsoft Network Client 3.0 and set up network in DOS, there have to be several programs at hand: 1. Microsoft Network Client 3.0 [ Download ] 2. NDIS 2.0 driver for Ethernet network card. For example: Realtek RTL8029AS [ Download ]. Drivers for other network cards may be found, for instance, on web sites of Ethernet cards manufacturers. 3. If it is not MS-DOS 6.0+ to be used, QEMM97 [ Download ] Installing Microsoft Network Client 3.0 and setting up network in DOS, step by step: 1. Create a folder, for example C:\DRIVERS\. Put there: a NDIS 2.0 driver for Ethernet network card. 2. Prepare installation floppies of Microsoft Network Client 3.0: DSK3-1.EXE -d A: DSK3-2.EXE -d A: 3. Start setup.exe from the first floppy and begin Microsoft Network Client 3.0 installation. Installation is starting. Press Enter to continue Select folder for Microsoft Network Client 3.0 to be installed to. It may be any or the suggestion of the installer may be left as it is - in the case discussed it is left as it is. Enter Microsoft Network Client 3.0 installer examining the system files Select driver for Ethernet network card. If there is no right driver on the list, choose "*Network adapter not shown on list below ..." Enter This dialogue appears if there was no right driver on the proposed list of Ethernet network card drivers and "*Network adapter not shown on list below ..." has been selected. Specify the path to the folder containing the appropriate driver for the Ethernet network card. In the case discussed it is C:\DRIVERS\, typing it in. Enter Select driver from C:\DRIVERS\ folder specified in the previous step. In the case discussed it is RTL8029AS PCI Ethernet Adapter. Enter Choose to let or not to let Microsoft Network Client 3.0 use more RAM in its work to get the best performance. Any of the two choices is acceptable. For example - let it to. Enter Enter user name of up to 20 characters. It can contain Latin letters, numbers and characters listed. In the case discussed it is "net".

Windows console applications

 

Some time ago text-based applications were the only form of software of average end user computer experience. As well as after the graphical user interface programs started to become widespread, console applications used to retain their strong positions. But gradually GUI software virtually superseded text-based applications in daily use of the average end user. However, even now there are console programs that can more or less compete with software of graphical user interface, be useful for the average user to solve various problems and fulfill numerous tasks on modern computers. Windows console applications. File managers Windows console applications. Multimedia Windows console applications. Web browsers Windows console applications. Text editors Besides file managers, multimedia programs, text editors, web browsers, there are plenty of other text-based programs and utilities for use under Windows: both standalone and those included in MS Windows distributions. For example, ipconfig and netstat for work with network, Windows built-in FTP client useful for some tasks, CommandBurner for command line burning CD / DVD or cdburn with dvdburn from Windows Server 2003 Support Tools for the same, etc.