Kompx.com or Compmiscellanea.com

Lynx browser. Creating sitemap.xml

Operating systems : Linux

There are more than few online services for sitemap.xml generation. But it is also possible to do it yourself, by means of lynx web browser and several Linux command line utilities. An example bash script employing them, named "sitemap.sh" is described below.

Bash script creating a sitemap.xml file:

#!/bin/bash

cd /home/me/sitemap/www/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

cd /home/me/sitemap/www2/

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

echo "...Done"

After the bash script file is prepared: "chmod +x sitemap.sh" to make it executable.

Download sitemap.sh in sitemap.sh.tar.gz archive ( After downloading and unpacking it, put a web site name with "www" instead of http://www.compmiscellanea.com/ and a web site name without "www" instead of http://compmiscellanea.com/ in the file. Replace "static" in the last line of the file by a string unnecessary links should possess to be removed. Then "chmod +x sitemap.sh". Then run sitemap.sh ).

Commentary

Download sitemap2.sh with line by line commentary in sitemap2.sh.tar.gz archive.

Before running the bash script, three folders should be created. Since lynx browser may miss some links if a web site domain name to be crawled is put with or without "www", bash script runs lynx twice, crawling the web site by its name with "www" and crawling the web site by its name without "www".

The two result files are put into two of these separate folders, here they are "/home/me/sitemap/www/" and "/home/me/sitemap/www2/". And "/home/me/sitemap/sitemap/" is for sitemap.xml created in the end.


1. Path to bash:

#!/bin/bash

2. Going to a folder - lynx browser is going to put there the files obtained from crawling a web site with "www" in its name:

cd /home/me/sitemap/www/

3. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is with "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://www.compmiscellanea.com/ > /dev/null

4. Going to another folder - lynx browser is going to put there the files obtained from crawling the web site without "www" in its name:

cd /home/me/sitemap/www2/

5. Running lynx browser to crawl a web site. Since some links may be missed by lynx if the domain name of the web site to be crawled is put with or without "www", bash script runs lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www". Here it is without "www".

Lynx will automatically go through all the pages and the links on them. All cookies are to be accepted. An amount of time lynx is to try to connect following each link may be set in seconds by the "-connect_timeout" option:

lynx -crawl -traversal -accept_all_cookies -connect_timeout=30 http://compmiscellanea.com/ > /dev/null

6. Running lynx browser twice, crawling the web site by its name with "www" and crawling the web site by its name without "www", creates two files with the links collected. So here the content of the second file is added to the end of the first one:

cat /home/me/sitemap/www2/traverse.dat >> /home/me/sitemap/www/traverse.dat

7. Links gathered by lynx crawling the web site by its name without "www" have no "www." in the URLs, so to make links collection uniform, the rest of links are stripped from "www.". Then sorted alphabetically by sort. Then uniq removes duplicate entries. Then the result is written into a file named "sitemap.xml" created in the process:

cat /home/me/sitemap/www/traverse.dat | sed -e 's/\<www\>\.//g' | sort | uniq > /home/me/sitemap/sitemap/sitemap.xml

8. If there are &, ', ", >, < in URLs, they are to be replaced by &amp;, &apos;, &quot;, &gt;, &lt;. Other special and non-ASCII characters are supposed to be made compliant with the current sitemap.xml file standards [ 1 ] and common practice [ 2 ] by the web site's developers or its CMS.

Otherwise lynx is going to attempt to understand the URLs according to its rules and abilities, to try and read them, then write them to traverse.dat. Depending on the environment lynx is run in, sometimes it will be more or less successful, sometimes more or less not.

So, & is replaced by &amp;

sed -i 's/\&/\&amp\;/g' /home/me/sitemap/sitemap/sitemap.xml

9. ' is replaced by &apos;

sed -i "s/'/\&apos\;/g" /home/me/sitemap/sitemap/sitemap.xml

10. " is replaced by &quot;

sed -i 's/"/\&quot\;/g' /home/me/sitemap/sitemap/sitemap.xml

11. > is replaced by &gt;

sed -i 's/>/\&gt\;/g' /home/me/sitemap/sitemap/sitemap.xml

12. < is replaced by &lt;

sed -i 's/</\&lt\;/g' /home/me/sitemap/sitemap/sitemap.xml

13. www. is added to all the links:

sed -i 's/http:\/\//http:\/\/www\./g' /home/me/sitemap/sitemap/sitemap.xml

14. <url><loc> is added before every line:

sed -i -e 's/^/<url><loc>/' /home/me/sitemap/sitemap/sitemap.xml

15. </url></loc> is added after every line:

sed -i -e 's/$/<\/loc><\/url>/' /home/me/sitemap/sitemap/sitemap.xml

16. Opening tags of XML document and a comment are added before the content of the file:

sed -i -e '1 i <?xml version="1\.0" encoding="UTF-8"?>\r\r<urlset xmlns="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9" xmlns:xsi="http:\/\/www\.w3\.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9 http:\/\/www\.sitemaps\.org\/schemas\/sitemap\/0\.9\/sitemap\.xsd">\r\r<!-- created by sitemap.sh from http:\/\/www.compmiscellanea.com\/en\/lynx-browser-creating-sitemap.xml\.htm -->\r\r' /home/me/sitemap/sitemap/sitemap.xml

17. Closing tag of XML document is added after the content of the file:

sed -i -e '$ a \\r</urlset>' /home/me/sitemap/sitemap/sitemap.xml

18. Unnecessary links with a given string in them are removed:

sed -i '/static/d' /home/me/sitemap/sitemap/sitemap.xml

19. Reporting the process is completed:

echo "...Done"


Lynx browser docs on "-traversal" and "-crawl" switches: CRAWL.announce.


Aliosque subditos et thema

 

Lynx. Web data extraction

 

Aside from browsing / displaying web pages, Lynx can dump the formatted text of the content of a web document or its HTML source to standard output. And that then may be processed by means of some tools present in Linux, like gawk, Perl, sed, grep, etc. Some examples: Dealing with external links Count number of external links Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, wc counts the number of links extracted and displays it: lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | wc -l Find external links and save them to a file Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it and saves them to a file: lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" > file.txt Find external links, omit duplicate entries and save the output to a file Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http:", sends the result further again to grep that picks lines not starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (external links of the web page) out of it, sort sorts them and uniq deletes duplicate entries. The output is saved to a file: lynx -dump -listonly "elinks.htm" | grep -o "http:.*" | grep -E -v "http://compmiscellanea.com|http://www.compmiscellanea.com" | sort | uniq > file.txt Dealing with internal links Count number of internal links Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), wc counts the number of links extracted and displays it: lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | wc -l Find internal links and save them to a file Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links) and saves them to a file: lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" > file.txt Find internal links, omit duplicate entries and save the output to a file Lynx sends list of links from the content of a web page to standard output. Grep looks only for lines starting with "http://compmiscellanea.com" and "http://www.compmiscellanea.com" (internal links), sort sorts them and uniq deletes duplicate entries. The output is saved to a file: lynx -dump -listonly "elinks.htm" | grep -E -o "http://compmiscellanea.com.*|http://www.compmiscellanea.com.*" | sort | uniq > file.txt The reason behind using "lynx -dump -listonly" instead of just "lynx -dump" is that there may be web pages with plain text strings looking like links (containing "http://" for instance) in the text of the content, as it is the case with http://www.kompx.com/en/elinks.htm page.

Windows console applications. Multimedia

 

MPlayer : FFmpeg Media players appeared long ago, but their heyday began with the mass spread of personal computers powerful enough to play video files. This coincided with the mass spread of operating systems and other software with graphical user interface. However, a program with a graphical user interface is dualistic in its nature: there is a code responsible for graphical user interface, for the appearance and there is a code - for performing the task the given application has been created for in the first place. Both code complexes consume system resources, their response time to user actions makes up certain amount of waiting time. And in cases or concepts when appearance is taken as less important - less important to the point of sparing or almost sparing to employ it - console applications, among others, appear. Moreover, the segmentation of the GUI and of the engine makes it easier to change the graphical user interface or perform complex automated operations. The scheme is implemented for media players for Windows as well. MPlayer, for instance, in its usual form is a console application, starting up quickly, having fast response to user actions, consuming system resources almost entirely for its immediate task. And on this basis if desired, one or another graphical interface may be added for creating, all in all, a new application. MPlayer - / home page / Console media player for Windows. Basis for SMPlayer and UMPlayer. There are versions for Linux, FreeBSD, NetBSD, OpenBSD, Apple Darwin, Mac OS X, QNX, OpenSolaris/Solaris, Irix, HP-UX, AIX, some other *nix system, BeOS, Syllable, AmigaOS, AROS, MorphOS, DOS, Windows. Supported video, audio formats, static images, subtitles, etc. ( List and More extensive list of video and audio codecs ). MPlayer: "Dead Man" MPlayer: "Sky Captain and the World of Tomorrow" MPlayer: "10,000 BC" MPlayer: "13 Tzameti" MPlayer: "The Draughtsman's Contract" MPlayer: "Balzaminov's Marriage" FFmpeg - / home page / Pack of utilities and libraries for work with video and audio files. Created in and for Linux, but there is a Windows variant. Source code may be compiled for some other operating systems. Supported file formats and codecs: ( List ). Also, VLC media player can be run in text mode, ncurses using.