Kompx.com or Compmiscellanea.com

Lynx browser. Creating sitemap.xml

Operating systems : Linux

There are more than few online services for sitemap.xml generation. But it is also possible to do it yourself, by means of lynx web browser and several Linux command line utilities. An example bash script employing them, named "sitemap.sh" is described below.

Bash script creating a sitemap.xml file:

#!/bin/bash

cd /home/me/sitemap/www/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

cd /home/me/sitemap/www2/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

echo "...Done"

After the bash script file is prepared: "chmod +x sitemap.sh" to make it executable.

Download sitemap.sh in sitemap.sh.tar.gz archive ( After downloading and unpacking it, put a web site name with "www" instead of http://www.compmiscellanea.com/ and a web site name without "www" instead of http://compmiscellanea.com/ in the file. Replace "static" in the last line of the file by a string unnecessary links should possess to be removed. Then "chmod +x sitemap.sh". Then run sitemap.sh ).

Commentary

Download sitemap2.sh with line by line commentary in sitemap2.sh.tar.gz archive.

Before running the bash script, three folders should be created. Since lynx browser may miss some links if a web site domain name to be crawled is put with or without "www", bash script runs lynx twice, crawling the web site by its name with "www" and crawling the web site by its name without "www".

The two result files are put into two of these separate folders, here they are "/home/me/sitemap/www/" and "/home/me/sitemap/www2/". And "/home/me/sitemap/sitemap/" is for sitemap.xml created in the end.


1. Path to bash:

#!/bin/bash

2. Going to a folder - lynx browser is going to put there the files obtained from crawling a web site with "www" in its name:

cd /home/me/sitemap/www/

3. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is with "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

4. Going to another folder - lynx browser is going to put there the files obtained from crawling the web site without "www" in its name:

cd /home/me/sitemap/www2/

5. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is without "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

6. Running lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www", creates two files with the links collected. So here the content of the second file is added to the end of the first one:

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

7. Links gathered by lynx crawling the web site by its name without "www" have no "www." in the URLs, so to make links collection uniform, the rest of links are stripped from "www.". Then sorted alphabetically by sort. Then uniq removes duplicate entries. Then the result is written into a file named "sitemap.xml" created in the process:

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

8. If there are &, ', ", >, < in URLs, they are to be replaced by &amp;, &apos;, &quot;, &gt;, &lt;. Other special and non-ASCII characters are supposed to be made compliant with the current sitemap.xml file standards [ 1 ] and common practice [ 2 ] by the web site's developers or its CMS.

Otherwise lynx is going to attempt to understand the URLs according to its rules and abilities, to try and read them, then write them to traverse.dat. Depending on the environment lynx is run in, sometimes it will be more or less successful, sometimes more or less not.

So, & is replaced by &amp;

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

9. ' is replaced by &apos;

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

10. " is replaced by &quot;

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

11. > is replaced by &gt;

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

12. < is replaced by &lt;

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

13. www. is added to all the links:

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

14. <url><loc> is added before every line:

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

15. </url></loc> is added after every line:

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

16. Opening tags of XML document and a comment are added before the content of the file:

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

17. Closing tag of XML document is added after the content of the file:

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

18. Unnecessary links with a given string in them are removed:

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

19. Reporting the process is completed:

echo "...Done"


Lynx browser docs on "-traversal" and "-crawl" switches: CRAWL.announce.


Aliosque subditos et thema

 

ELinks

 

Features : Configuration : Use : Screenshots : Download links ELinks is an effort to create an advanced text-based web browser. It started as a fork based on the code of Links browser. Aiming first to try and realize several features more or less weak / absent in Links. Hence "E" in "ELinks" - "Experimental" [Links]. The success of the effort made it to be understood as "Extended" or "Enhanced". There was a crossroad at the point when Links browser achieved certain level of completeness, surpassing in some areas then the most advanced text mode web browser, Lynx: to move forward into displaying graphics and further beyond pure text or to enhance text-based web surfing experience beyond boundaries reached first by Lynx and then Links browsers - but still keeping it in text mode. The first course resulted into a Links version capable of displaying graphic content of web pages - Links2. The second one is ELinks web browser. Lynx was and is a very mature software in its kind. Its authors conceived and realized a quite elaborate concept of web surfing in text mode with specific abstractions and conventions, which aided to overcome many restrictions and shortages of text-based surfing and created an experience, a world so definitely different from rapidly expanding graphical web. But with the time HTML and hardware moved forward, spread of scripting languages took place, the whole world of presenting, finding and consuming information advanced. New possibilities appeared. Many of them were realized in Links web browser, but then next shift in information visual presentation in web documents - from more of HTML to more of CSS - made new roads open; even still keeping it to be in text mode. And that is where ELinks tries to come: colors in enabled consoles, some CSS positioning and even beginning of JavaScript / ECMAScript support. Technical part of networking (like SSL support) and various text encodings support were pretty strong in Links browser already, but ELinks enhanced some features and made others to be more worked out. ELinks moved forward the concept of text mode web browser, making ELinks the most advanced example of it. Although Lynx still keeps positions pretty strongly. Its concept of text mode web surfing even if being simplifying, bringing different approach to information presentation and handling rather than trying to be resembling to graphical web browsers environment - works quite well. Web documents become more and more complicated in realization and (while having all the inevitable restrictions of text mode web browsing) to follow a different way of handling it is quite competitive to trying to be like mainstream, graphic full featured web browsers of desktop computers. It is like this dilemma for smaller screen mobile devices browsers: to try and imitate full sized display computers or to transform web document and make it corresponding to the characteristics of the environment. Text-based web browsers are used mostly on computers with more or less large displays, so there are less of dimensional restrictions and more temptations: Lynx - to stay restrained, ELinks - to extend it. Features Text-based web browser. Versions for Linux, other *nix systems, Windows, DOS, OS/2, BeOS and some others. HTML ( tables and frames including ). Meagre support for CSS and JavaScript ( More ). Support for 16, 88 or 256 colors palette in capable terminal emulators / consoles. Tabbed browsing, background download with queuing. Mouse support. Editing of text boxes / forms in web pages in external text editor. Shortcuts for URLs. Scripting in Perl, Lua, Guile, Ruby. Passing URI of a web page in ELinks or URI of a link in a web page in ELinks to external applications: from clipboard app (to copy URI and paste it some place else) to other web browser, etc. Control over how HTML of the surfed web pages is rendered: like display frames or not. Bookmarks. And More. HTTP and Proxy authentication. Persistent HTTP cookies. SSL. http, https, ftp, fsp, IPv4, IPv6 and experimentally BitTorrent, gopher, nntp protocols. Configuration Go to "ELinks.

Windows console applications. File managers

 

FAR Manager : DOS Navigator : File Commander The concept and requirements to file manager had formed itself back in the DOS epoch. With the spread of operating systems with graphical user interface, other applications facilitating files handling emerged. But for many tasks and for many users orthodox file managers remain the most convenient option. There are file managers with graphical user interface here for a long time already, however console file managers still hold on not only their proper niche, but as well a part of the space belonging in theory to file managers with a GUI. Nowadays file managers can, all in all, the same and in general the same way, but text-based file managers are more responsive to user actions. Also, even if it is not topical enough now, console file managers require less system resources, than GUI file managers of comparable functionality. FAR Manager - / home page / Console file manager for Windows. Among the built-in functions: FTP, Windows network, extensible archive files support, print manager, text editor. Other plugins are available: SFTP/SCP, image viewer, hex editor, syntax highlighting and auto-completion for text editor, some others. FAR Manager 2.0: Console file manager FAR Manager 2.0: FTP, downloading files FAR Manager 2.0: A submenu FAR Manager 2.0: System settings FAR Manager 2.0: Text editor FAR Manager 2.0: MPlayer, playing .mp3 DOS Navigator - / open source project / Console file manager for Windows. A variation of the DOS file manager. There is also a version for OS/2. Archive files support, text editor with syntax highlighting, disk editor, spreadsheet, calculator, calendar, etc.