Kompx.com or Compmiscellanea.com

Lynx. Web data extraction

Operating systems : Linux

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples:

Dealing with external links

Count number of external links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l

Find external links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt

Find external links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt

Dealing with internal links

Count number of internal links

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l

Find internal links and save them to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt

Find internal links, omit duplicate entries and save the output to a file

Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file:

lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt

The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page. "Lynx -dump" would send to output formatted text where real links and plain text links like strings would look just the same and grep would not be able to discern one from another. "Lynx -dump -listonly" gives only a list of links, so that there is no confusion with plain text links looking strings.


Aliosque subditos et thema

 

ELinks

 

Features : Configuration : Use : Screenshots : Download links ELinks is an effort to create an advanced text-based web browser. It started as a fork based on the code of Links browser. Aiming first to try and realize several features more or less weak / absent in Links. Hence "E" in "ELinks" - "Experimental" [Links]. The success of the effort made it to be understood as "Extended" or "Enhanced". There was a crossroad at the point when Links browser achieved certain level of completeness, surpassing in some areas then the most advanced text mode web browser, Lynx: to move forward into displaying graphics and further beyond pure text or to enhance text-based web surfing experience beyond boundaries reached first by Lynx and then Links browsers - but still keeping it in text mode. The first course resulted into a Links version capable of displaying graphic content of web pages - Links2. The second one is ELinks web browser. Lynx was and is a very mature software in its kind. Its authors conceived and realized a quite elaborate concept of web surfing in text mode with specific abstractions and conventions, which aided to overcome many restrictions and shortages of text-based surfing and created an experience, a world so definitely different from rapidly expanding graphical web. But with the time HTML and hardware moved forward, spread of scripting languages took place, the whole world of presenting, finding and consuming information advanced. New possibilities appeared. Many of them were realized in Links web browser, but then next shift in information visual presentation in web documents - from more of HTML to more of CSS - made new roads open; even still keeping it to be in text mode. And that is where ELinks tries to come: colors in enabled consoles, some CSS positioning and even beginning of JavaScript / ECMAScript support. Technical part of networking (like SSL support) and various text encodings support were pretty strong in Links browser already, but ELinks enhanced some features and made others to be more worked out. ELinks moved forward the concept of text mode web browser, making ELinks the most advanced example of it. Although Lynx still keeps positions pretty strongly. Its concept of text mode web surfing even if being simplifying, bringing different approach to information presentation and handling rather than trying to be resembling to graphical web browsers environment - works quite well. Web documents become more and more complicated in realization and (while having all the inevitable restrictions of text mode web browsing) to follow a different way of handling it is quite competitive to trying to be like mainstream, graphic full featured web browsers of desktop computers. It is like this dilemma for smaller screen mobile devices browsers: to try and imitate full sized display computers or to transform web document and make it corresponding to the characteristics of the environment. Text-based web browsers are used mostly on computers with more or less large displays, so there are less of dimensional restrictions and more temptations: Lynx - to stay restrained, ELinks - to extend it. Features Text-based web browser. Versions for Linux, other *nix systems, Windows, DOS, OS/2, BeOS and some others. HTML ( tables and frames including ). Meagre support for CSS and JavaScript ( More ). Support for 16, 88 or 256 colors palette in capable terminal emulators / consoles. Tabbed browsing, background download with queuing. Mouse support. Editing of text boxes / forms in web pages in external text editor. Shortcuts for URLs. Scripting in Perl, Lua, Guile, Ruby. Passing URI of a web page in ELinks or URI of a link in a web page in ELinks to external applications: from clipboard app (to copy URI and paste it some place else) to other web browser, etc. Control over how HTML of the surfed web pages is rendered: like display frames or not. Bookmarks. And More. HTTP and Proxy authentication. Persistent HTTP cookies. SSL. http, https, ftp, fsp, IPv4, IPv6 and experimentally BitTorrent, gopher, nntp protocols. Configuration Go to "ELinks.

Arachne web browser. Installing and setting up for internet connection via Ethernet

 

A : Installing Arachne web browser on a disk created in RAM - Arachne runs the fastest this way. RAM size should allow for a RAM disk of 6 MB or more. In order to install and set up Arachne web browser for internet connection via Ethernet, there have to be several programs at hand: 1. Arachne web browser [ Download ] 2. If Arachne web browser is to be used for surfing web pages with character encoding other than Latin for West European languages, visit www.glennmcc.org/apm/ to find available character set packages and download the necessary one. 3. Mouse driver, mouse.com for instance [ Download ] 4. Packet driver for Ethernet network card: http://www.crynwr.com/drivers/ http://www.georgpotthast.de/sioux/packet.htm 5. Microsoft Network Client 3.0 [ Download ] 6. NDIS 2.0 driver for Ethernet network card. For example: Realtek RTL8029AS [ Download ]. Drivers for other network cards may be found, for instance, on web sites of Ethernet cards manufacturers. 7. Archivers. For example, PKZIP [ Download ] and PKUNZIP [ Download ] 8. If it is not MS-DOS 6.0+ to be used, QEMM97 [ Download ] 9. If it is not MS-DOS 6.0+ to be used, TDSK [ Download ] Installing and setting up Arachne web browser, step by step: 1. Create a RAM disk. Which drive letter will be assigned to it comes from the assumption that A: and B: go to floppy drives (even if there is only one, both letters will be reserved anyway), C: goes to the first active primary MS-DOS partition on the first physical hard disk. If there are more disks, then there will be as many letters used consecutively as to name them all. Unless there are no devices installed using DRIVER.SYS or similar drivers, the next drive letter will be assigned to the RAM disk. In order to be sure, after having the relevant string for making RAM disk added to CONFIG.SYS (See below), computer could be restarted and what letter is assigned to the RAM disk checked by experiment. In this case, it is E: Depending on RAM size it needs to be decided how many megabytes can be reserved for RAM disk. Basically, the more the better. Since, for instance, web browser cache is going to swell during prolonged and intensive use within a session. In this example the RAM disk is 12 000 KB. The maximum size for RAMDRIVE.SYS MS-DOS driver is 32 767 KB, the one of TDSK - 64 MB. In order to create such a disk, the string has to be added somewhere in the middle of CONFIG.SYS as follows: DEVICE=C:\DOS\RAMDRIVE.SYS 12000 512 512 /E 2. Create a folder, for example C:\DRIVERS\. Put there: a mouse driver, for instance mouse.com, a packet driver for Ethernet network card and a NDIS 2.0 driver for Ethernet network card. 3. Add a string starting mouse driver to AUTOEXEC.BAT. Specify there the full path to the driver, may be any: LH C:\DRIVERS\MOUSE.COM 4. Prepare installation floppies of Microsoft Network Client 3.0: DSK3-1.EXE -d A: DSK3-2.EXE -d A: 5. Start setup.exe from the first floppy and begin Microsoft Network Client 3.0 installation. Installation is starting. Press Enter to continue Select folder for Microsoft Network Client 3.0 to be installed to. It may be any or the suggestion of the installer may be left as it is - in the case discussed it is left as it is.