Kompx.com or Compmiscellanea.com

Lynx browser. Creating sitemap.xml

Operating systems : Linux

There are more than few online services for sitemap.xml generation. But it is also possible to do it yourself, by means of lynx web browser and several Linux command line utilities. An example bash script employing them, named "sitemap.sh" is described below.

Bash script creating a sitemap.xml file:

#!/bin/bash

cd /home/me/sitemap/www/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

cd /home/me/sitemap/www2/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

echo "...Done"

After the bash script file is prepared: "chmod +x sitemap.sh" to make it executable.

Download sitemap.sh in sitemap.sh.tar.gz archive ( After downloading and unpacking it, put a web site name with "www" instead of http://www.compmiscellanea.com/ and a web site name without "www" instead of http://compmiscellanea.com/ in the file. Replace "static" in the last line of the file by a string unnecessary links should possess to be removed. Then "chmod +x sitemap.sh". Then run sitemap.sh ).

Commentary

Download sitemap2.sh with line by line commentary in sitemap2.sh.tar.gz archive.

Before running the bash script, three folders should be created. Since lynx browser may miss some links if a web site domain name to be crawled is put with or without "www", bash script runs lynx twice, crawling the web site by its name with "www" and crawling the web site by its name without "www".

The two result files are put into two of these separate folders, here they are "/home/me/sitemap/www/" and "/home/me/sitemap/www2/". And "/home/me/sitemap/sitemap/" is for sitemap.xml created in the end.


1. Path to bash:

#!/bin/bash

2. Going to a folder - lynx browser is going to put there the files obtained from crawling a web site with "www" in its name:

cd /home/me/sitemap/www/

3. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is with "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

4. Going to another folder - lynx browser is going to put there the files obtained from crawling the web site without "www" in its name:

cd /home/me/sitemap/www2/

5. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is without "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

6. Running lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www", creates two files with the links collected. So here the content of the second file is added to the end of the first one:

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

7. Links gathered by lynx crawling the web site by its name without "www" have no "www." in the URLs, so to make links collection uniform, the rest of links are stripped from "www.". Then sorted alphabetically by sort. Then uniq removes duplicate entries. Then the result is written into a file named "sitemap.xml" created in the process:

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

8. If there are &, ', ", >, < in URLs, they are to be replaced by &amp;, &apos;, &quot;, &gt;, &lt;. Other special and non-ASCII characters are supposed to be made compliant with the current sitemap.xml file standards [ 1 ] and common practice [ 2 ] by the web site's developers or its CMS.

Otherwise lynx is going to attempt to understand the URLs according to its rules and abilities, to try and read them, then write them to traverse.dat. Depending on the environment lynx is run in, sometimes it will be more or less successful, sometimes more or less not.

So, & is replaced by &amp;

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

9. ' is replaced by &apos;

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

10. " is replaced by &quot;

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

11. > is replaced by &gt;

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

12. < is replaced by &lt;

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

13. www. is added to all the links:

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

14. <url><loc> is added before every line:

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

15. </url></loc> is added after every line:

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

16. Opening tags of XML document and a comment are added before the content of the file:

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

17. Closing tag of XML document is added after the content of the file:

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

18. Unnecessary links with a given string in them are removed:

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

19. Reporting the process is completed:

echo "...Done"


Lynx browser docs on "-traversal" and "-crawl" switches: CRAWL.announce.


Aliosque subditos et thema

 

Selecting all in mc

 

Selecting all files and folders in a directory viewed in a panel of Midnight Commander. Achieving the same effect a does Ctrl - A under Microsoft Windows or Cmd - A in Mac OS X: Open a directory --> Ctrl - Shift - = --> Enter * [asterisk] --> Enter A directory with files and folders Selection options prompt Entering * [asterisk] All files and folders has been selected The keyboard shortcut of Ctrl - Shift - = is meant to type plus symbol. So just Shift - = may also happen to be enough. The "Using shell patterns" option in selection options prompt has to be checked on in order to make asterisk to work as the symbol for "all" or "any".

CSS rollover. Changing background image position

 

CSS rollover. Changing background image position. Example: CSS rollover. Changing background image position HTML / XHTML. Code: <div class="example"> <a href="css-rollover-changing-background-image-position.htm">CSS rollover. Changing background image position</a> </div> CSS. Code: .example {position: relative; top: 0px; left: 0px; height: auto; width: 100%; white-space: nowrap; padding: 10px 20px 10px 20px;} .example a {height: 35px; width: 100%; font-size: 24px; line-height: 35px; color: #006; text-decoration: none; text-align: right; display: block; background: url("bgimg.png") no-repeat left top;} .example a:hover {color: #c00; background: url("bgimg.png") no-repeat left bottom;} Background image: CSS rollover Changing background image position The concept : 1. A link with width and height declared explicitly (here it is 100% and 35px), plus display: block - this makes a 100% by 35px rectangle, the whole area of which is a hyperlink. Line-height is declared explicitly as well, equal to the rectangle's height (35px here) - it helps to have the same vertical alignment of the link text across various web browsers. Text-align: right is added to move the link text to the right in the the link block. (The link block area is highlighted in green to make it more obvious) CSS rollover. Changing background image position 2. An image for the link block area background with height equal to two heights of the link block area is created. The image has two parts, the upper one and the lower one, each of height equal to the height of the link block area; here it is 35px. 3. The background image URL is added to the background CSS property of the link. The CSS background property values of no-repeat and left position the background image to the left of the link block area. The CSS background property value of top places the upper part of the background image on the same level as the link text. 4. The background image URL is added to background CSS property of the link's :hover pseudo-class. The CSS background property values of no-repeat and left of the link's :hover pseudo-class position the background image to the left of the link block area. The CSS background property value of bottom of the link's :hover pseudo-class places the lower part of the background image on the same level as the link text, when a user moves the mouse pointer over the link block area. So, when there is no mouse pointer over the link block area, the upper part of the background image is placed on the same level as the link text. When there is a mouse pointer moved over the link block area, the lower part of the background image gets on the same level as the link text. Mouse out - the upper part is displayed, mouse over - the lower part shows up. And no need to preload any additional images. CSS rollover.