Kompx.com or Compmiscellanea.com

Lynx browser. Creating sitemap.xml

Operating systems : Linux

There are more than few online services for sitemap.xml generation. But it is also possible to do it yourself, by means of lynx web browser and several Linux command line utilities. An example bash script employing them, named "sitemap.sh" is described below.

Bash script creating a sitemap.xml file:

#!/bin/bash

cd /home/me/sitemap/www/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

cd /home/me/sitemap/www2/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

echo "...Done"

After the bash script file is prepared: "chmod +x sitemap.sh" to make it executable.

Download sitemap.sh in sitemap.sh.tar.gz archive ( After downloading and unpacking it, put a web site name with "www" instead of http://www.compmiscellanea.com/ and a web site name without "www" instead of http://compmiscellanea.com/ in the file. Replace "static" in the last line of the file by a string unnecessary links should possess to be removed. Then "chmod +x sitemap.sh". Then run sitemap.sh ).

Commentary

Download sitemap2.sh with line by line commentary in sitemap2.sh.tar.gz archive.

Before running the bash script, three folders should be created. Since lynx browser may miss some links if a web site domain name to be crawled is put with or without "www", bash script runs lynx twice, crawling the web site by its name with "www" and crawling the web site by its name without "www".

The two result files are put into two of these separate folders, here they are "/home/me/sitemap/www/" and "/home/me/sitemap/www2/". And "/home/me/sitemap/sitemap/" is for sitemap.xml created in the end.


1. Path to bash:

#!/bin/bash

2. Going to a folder - lynx browser is going to put there the files obtained from crawling a web site with "www" in its name:

cd /home/me/sitemap/www/

3. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is with "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

4. Going to another folder - lynx browser is going to put there the files obtained from crawling the web site without "www" in its name:

cd /home/me/sitemap/www2/

5. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is without "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

6. Running lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www", creates two files with the links collected. So here the content of the second file is added to the end of the first one:

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

7. Links gathered by lynx crawling the web site by its name without "www" have no "www." in the URLs, so to make links collection uniform, the rest of links are stripped from "www.". Then sorted alphabetically by sort. Then uniq removes duplicate entries. Then the result is written into a file named "sitemap.xml" created in the process:

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

8. If there are &, ', ", >, < in URLs, they are to be replaced by &amp;, &apos;, &quot;, &gt;, &lt;. Other special and non-ASCII characters are supposed to be made compliant with the current sitemap.xml file standards [ 1 ] and common practice [ 2 ] by the web site's developers or its CMS.

Otherwise lynx is going to attempt to understand the URLs according to its rules and abilities, to try and read them, then write them to traverse.dat. Depending on the environment lynx is run in, sometimes it will be more or less successful, sometimes more or less not.

So, & is replaced by &amp;

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

9. ' is replaced by &apos;

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

10. " is replaced by &quot;

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

11. > is replaced by &gt;

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

12. < is replaced by &lt;

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

13. www. is added to all the links:

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

14. <url><loc> is added before every line:

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

15. </url></loc> is added after every line:

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

16. Opening tags of XML document and a comment are added before the content of the file:

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

17. Closing tag of XML document is added after the content of the file:

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

18. Unnecessary links with a given string in them are removed:

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

19. Reporting the process is completed:

echo "...Done"


Lynx browser docs on "-traversal" and "-crawl" switches: CRAWL.announce.


Aliosque subditos et thema

 

Windows console applications. Web browsers

 

Lynx : Links : ELinks Text-based, or console web browsers are more typical for the Linux environment and other Unix-like systems. There the text-based web browsers were created, there is their main line of development. Very few (e.g., Wanna-Be / WannaBe for classic Mac OS) console web browsers were made originally for some other operating systems. And the text web browsers for Windows are the versions of console web browsers for *nix systems. Although those of them that do not work in the Cygwin environment have their little peculiarities. In former times text-based web browsers were an important tool for viewing web documents. With the development of GUI programs further in the dial-up era, text web browsers have been useful as the fastest way to view web documents and as a part of text-to-speech systems. The spread of broadband Internet and specialized soft for text-to-speech systems cut the scope of the console web browsers. All the more, most Windows users have always had a quite vague idea of their existence. However, text web browsers are highly specialized mature tools that may be useful in various situations using Windows. Lynx - / home page / Text-based web browser. Versions for Linux, FreeBSD, Mac OS X, some other *nix systems, DOS, Windows, BeOS, MINIX, QNX, AmigaOS, OpenVMS and classic Mac OS. HTML ( More 1 ) ( More 2 ). Lynx 2.8.5rel.1: lynx.isc.org Lynx 2.8.5rel.1: w3schools.com/browsers/browsers_stats.asp Lynx 2.8.5rel.1: en.wikipedia.org/wiki/Lynx_(web_browser) Lynx 2.8.5rel.1: ebay.com Lynx 2.8.5rel.1: kompx.com/en/web-browsers-for-dos.htm Lynx 2.8.5rel.1: twitter.com Links - / home page / Text-based web browser. Versions for Linux, FreeBSD, Mac OS X, some other *nix systems, BeOS, Haiku, OS/2, DOS, Windows.

CSS horizontal and vertical centering - 2

 

Centering the content of a web page in the viewable area of a browser by means of CSS. A box to contain the whole content of the page is CSS centered horizontally and vertically: [ Open demo page ] HTML / XHTML. Code: <!DOCTYPE html> <html> <head> <title>CSS horizontal and vertical centering - 2</title> <link rel="stylesheet" type="text/css" href="css.css" /> </head> <body> <div class="spacer">&nbsp;</div> <div class="wrapper"> <div class="pagecontent">&nbsp;</div> </div> </body> </html> CSS. Code: html {height: 100%; margin: 0px;} body height: 100%; margin: 0px;} .spacer {position: relative; top: 0px; left: 0px; height: 50%; width: 100px; float: left; margin: 0px 0px -250px 0px; background: #999;} .wrapper {position: relative; top: 0px; left: 0px; height: 500px; width: 100%; clear: both; background: #a3ddc4;} .pagecontent {position: relative; top: 0px; left: 0px; height: 500px; width: 800px; margin: 0 auto; background: #ff6f6f;} The .pagecontent box is for the page content. It must be of a fixed height and width in units like px's or em's - not in percents. Height and width may be larger than web browser viewable area, but here the more practical case is discussed - when the height and width of .pagecontent are smaller than those of the web browser viewable area. The .pagecontent box is horizontally centered by its "margin: 0 auto". .Wrapper creates a space where .pagecontent is centered horizontally. .Wrapper's width is 100% for centering at various web browser viewable area sizes. The height has to be equal to the one of .pagecontent. .Spacer centers .wrapper with .pagecontent in it vertically inside browser viewable area. Its width may be any. The height is 50% - that places the top edge of .pagecontent vertically in the middle of the browser viewable area. The bottom margin of .spacer equal to half the .pagecontent height centers .pagecontent and its contents vertically in the web browser viewable area of a current height. This method is reliable in all major modern web browsers. It also works in older browsers like Internet Explorer 6 or earlier versions of Maxthon. But the height of box for page content has to be assigned explicitly and if it is changed - the size of .spacer bottom margin must be changed accordingly as well. There is another way of CSS horizontal and vertical centering, with CSS code easier to maintain, even if not suitable for older web browsers: CSS horizontal and vertical centering - 1. [ 1 ] As well as Netscape 7.2+, Mozilla 1.5+. [ 2 ] As well as Netscape 7.2+, Mozilla 1.5+.