Kompx.com or Compmiscellanea.com

Lynx browser. Creating sitemap.xml

Operating systems : Linux

There are more than few online services for sitemap.xml generation. But it is also possible to do it yourself, by means of lynx web browser and several Linux command line utilities. An example bash script employing them, named "sitemap.sh" is described below.

Bash script creating a sitemap.xml file:

#!/bin/bash

cd /home/me/sitemap/www/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

cd /home/me/sitemap/www2/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

echo "...Done"

After the bash script file is prepared: "chmod +x sitemap.sh" to make it executable.

Download sitemap.sh in sitemap.sh.tar.gz archive ( After downloading and unpacking it, put a web site name with "www" instead of http://www.compmiscellanea.com/ and a web site name without "www" instead of http://compmiscellanea.com/ in the file. Replace "static" in the last line of the file by a string unnecessary links should possess to be removed. Then "chmod +x sitemap.sh". Then run sitemap.sh ).

Commentary

Download sitemap2.sh with line by line commentary in sitemap2.sh.tar.gz archive.

Before running the bash script, three folders should be created. Since lynx browser may miss some links if a web site domain name to be crawled is put with or without "www", bash script runs lynx twice, crawling the web site by its name with "www" and crawling the web site by its name without "www".

The two result files are put into two of these separate folders, here they are "/home/me/sitemap/www/" and "/home/me/sitemap/www2/". And "/home/me/sitemap/sitemap/" is for sitemap.xml created in the end.


1. Path to bash:

#!/bin/bash

2. Going to a folder - lynx browser is going to put there the files obtained from crawling a web site with "www" in its name:

cd /home/me/sitemap/www/

3. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is with "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

4. Going to another folder - lynx browser is going to put there the files obtained from crawling the web site without "www" in its name:

cd /home/me/sitemap/www2/

5. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is without "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

6. Running lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www", creates two files with the links collected. So here the content of the second file is added to the end of the first one:

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

7. Links gathered by lynx crawling the web site by its name without "www" have no "www." in the URLs, so to make links collection uniform, the rest of links are stripped from "www.". Then sorted alphabetically by sort. Then uniq removes duplicate entries. Then the result is written into a file named "sitemap.xml" created in the process:

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

8. If there are &, ', ", >, < in URLs, they are to be replaced by &amp;, &apos;, &quot;, &gt;, &lt;. Other special and non-ASCII characters are supposed to be made compliant with the current sitemap.xml file standards [ 1 ] and common practice [ 2 ] by the web site's developers or its CMS.

Otherwise lynx is going to attempt to understand the URLs according to its rules and abilities, to try and read them, then write them to traverse.dat. Depending on the environment lynx is run in, sometimes it will be more or less successful, sometimes more or less not.

So, & is replaced by &amp;

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

9. ' is replaced by &apos;

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

10. " is replaced by &quot;

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

11. > is replaced by &gt;

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

12. < is replaced by &lt;

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

13. www. is added to all the links:

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

14. <url><loc> is added before every line:

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

15. </url></loc> is added after every line:

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

16. Opening tags of XML document and a comment are added before the content of the file:

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

17. Closing tag of XML document is added after the content of the file:

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

18. Unnecessary links with a given string in them are removed:

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

19. Reporting the process is completed:

echo "...Done"


Lynx browser docs on "-traversal" and "-crawl" switches: CRAWL.announce.


Aliosque subditos et thema

 

Arachne web browser. Installing and setting up for dial-up internet connection

 

A : Installing Arachne web browser on a disk created in RAM - Arachne runs the fastest this way. RAM size should allow for a RAM disk of 6 MB or more. In order to install and set up Arachne web browser for dial-up internet connection, there have to be several programs at hand: 1. Arachne web browser [ Download ] 2. If Arachne web browser is to be used for surfing web pages with character encoding other than Latin for West European languages, visit www.glennmcc.org/apm/ to find available character set packages and download the necessary one. 3. Mouse driver, mouse.com for instance [ Download ] 4. Archivers. For example, PKZIP [ Download ] and PKUNZIP [ Download ] 5. If it is not MS-DOS 6.0+ to be used, QEMM97 [ Download ] 6. If it is not MS-DOS 6.0+ to be used, TDSK [ Download ] Installing and setting up Arachne web browser, step by step: 1. Create a RAM disk. Which drive letter will be assigned to it comes from the assumption that A: and B: go to floppy drives (even if there is only one, both letters will be reserved anyway), C: goes to the first active primary MS-DOS partition on the first physical hard disk. If there are more disks, then there will be as many letters used consecutively as to name them all. Unless there are no devices installed using DRIVER.SYS or similar drivers, the next drive letter will be assigned to the RAM disk. In order to be sure, after having the relevant string for making RAM disk added to CONFIG.SYS (See below), computer could be restarted and what letter is assigned to the RAM disk checked by experiment. In this case, it is E: Depending on RAM size it needs to be decided how many megabytes can be reserved for RAM disk. Basically, the more the better. Since, for instance, web browser cache is going to swell during prolonged and intensive use within a session. In this example the RAM disk is 12 000 KB. The maximum size for RAMDRIVE.SYS MS-DOS driver is 32 767 KB, the one of TDSK - 64 MB. In order to create such a disk, the string has to be added somewhere in the middle of CONFIG.SYS as follows: DEVICE=C:\DOS\RAMDRIVE.SYS 12000 512 512 /E 2. Create a folder, for example C:\DRIVERS\. Put there a mouse driver, for instance mouse.com 3. Add a string starting mouse driver to AUTOEXEC.BAT. Specify there the full path to the driver, may be any: LH C:\DRIVERS\MOUSE.COM 4. Run MemMaker or OPTIMIZE from QEMM97 to optimize base memory management. If it is MemMaker, press Enter at any suggestion - MemMaker will handle it itself. Computer is going to restart several times, each time MemMaker will be re-running - again nothing, just Enter, is a safe choice. If it is QEMM97 (specifically OPTIMIZE), then there is going to be several restarts too and each time just pressing Enter is OK. 5. Start installation of Arachne web browser on RAM disk. In the case discussed it is E: A195GPL.EXE Press Y to continue: Press N to specify the path to the folder Arachne web browser is to be installed in: Specify the path to the folder Arachne web browser is to be installed in. In the case discussed it is E:\ARACHNE\.

Netscape 3. Screenshots 1

 

Netscape 3 running under Windows 7 (32-bit). Screenshots 1. Netscape 3: netscape.aol.com Netscape 3: w3schools.com/browsers/browsers_stats.asp Netscape 3: en.wikipedia.org/wiki/Netscape_Navigator Netscape 3: ebay.com Netscape 3: kompx.com/en/internet-explorer-3-screenshots-1.htm Netscape 3: twitter.com Download Netscape 3. It may happen to be impossible either to install Netscape 3 or to run it under Windows 7 (32-bit). Try installing Netscape 3 as Administrator then. When installed in the proper way, Netscape 3 can run under Windows 7 (32-bit) quite well.