ER/Box powered by Compiere
 
Font size:      

Apache Forrest Notes

Generate Google Sitemap file

In order to support accurate representation in the Google search index it may help to submit a Google Sitemap file. The following modifications allow Forrest to automatically generate a valid Sitemap file with each run (or on the fly when in interactive mode). The whole process is controlled by three new entries in the forrest.properties file of your project.

1. Add New Default Properties

The following new property entries are added to the Forrest default properties file {FORREST_HOME}\main\webapp\forrest-default.properties:

# Switch to generate a Google Sitemap file
project.generate-google-sitemap=false

# Name of the Google Sitemap file
project.google-sitemap-name=sitemap.xml

# Prefix to generate absolute urls in Google Sitemap file
project.site-uri-prefix=http://localhost/

2. Override New Default Properties

The new default properties have to be adjusted in the properties file of your project {PROJECT_HOME}\forrest.properties:

# Switch to generate a Google Sitemap file
project.generate-google-sitemap=true

# Name of the Google Sitemap file
#project.google-sitemap-name=sitemap.xml

# Prefix to generate absolute urls in Google Sitemap file
project.site-uri-prefix=http://www.erbox.org/

3. Make Properties Available to Cocoon Processing

The properties project.google-sitemap-name and project.site-uri-prefix have to be available in some of the Cocoon pipelines (i.e. *.xmap files) for further processing. This can be achieved by modifying the file {FORREST_HOME}\main\webapp\WEB-INF\xconf\forrest-core.xconf.

The lines

<google-sitemap-name>@project.google-sitemap-name@</google-sitemap-name>
<site-uri-prefix>@project.site-uri-prefix@</site-uri-prefix>

have to be added to the following sections:

...
<component-instance name="defaults" class="org.apache.forrest.conf.ForrestConfModule">
  <values>
    INSERT HERE
    ...
  </values>
</component-instance>
...

and

...
<component-instance name="project" class="org.apache.forrest.conf.ForrestConfModule">
  <values>
    INSERT HERE
    ...
  </values>
</component-instance>
...

4. Add New Request to Generate the Sitemap File

The site generation process is driven by Apache Ant; the main build file is {FORREST_HOME}\main\build.xml, which in turn calls {FORREST_HOME}\main\targets\site.xml. To additionally generate a Google Sitemap file another call to Cocoon has to be added after the original call to generate the complete website. Note that the two Cocoon calls are run inside of a <sequence> tag which has to be added.

...
    <parallel>
      <sequential> <!-- NEW TAG -->
        <java classname="org.apache.cocoon.Main"
            fork="true"
          ...
          UNMODIFIED: ORIGINAL CALL TO COCOON TO GENERATE WEBSITE
          ...
        </java> 

        <!-- START: ADDITIONAL CALL TO COCOON TO GENERATE SITEMAP FILE -->
        <if>
          <equals arg1="${project.generate-google-sitemap}" arg2="true"/>
          <then>
            <java classname="org.apache.cocoon.Main"
                fork="true"
                dir="${forrest.core.webapp}"  
                maxmemory="${forrest.maxmemory}"
                failonerror="true">
              <jvmarg line="${forrest.jvmargs}"/>
              <jvmarg value="-Djava.endorsed.dirs=\\  <!-- ONE LINE! -->
                        ${forrest.endorsed.lib-dir}${path.separator}${java.endorsed.dirs}"/>
              <jvmarg value="-Dorg.apache.cocoon.core.LazyMode=true"/>
              <arg value="--logLevel=${project.debuglevel}"/>
              <arg value="--Logger=${project.logger}"/>
              <arg value="--logKitconfig=${project.logkitfile}"/>
              <arg value="--destDir=${project.site-dir}"/>
              <arg value="--xconf=${project.configfile}"/>
              <arg value="--brokenLinkFile=${project.brokenlinkfile}"/>
              <arg value="--workDir=${project.cocoon-work-dir}"/>
              <arg value="--followLinks=false"/>
              <arg value="${project.google-sitemap-name}"/>
              <classpath>
                <path refid="forrest.cp"/>
              </classpath>
              <syspropertyset>
                <propertyref prefix="forrest."/>
                <propertyref prefix="project."/>
              </syspropertyset>
            </java> 
          </then>
        </if>
        <!-- END: ADDITIONAL CALL TO COCOON -->

      </sequential> <!-- NEW TAG -->

      <sequential>
        <echo>
Copying the various non-generated resources to site.
...

The new call to cocoon is just a copy of the call to generate the site with the following exceptions:

  • The crawling process is started from ${project.google-sitemap-name} instead of ${project.start-uri}.
  • The site is not crawled recursively through setting the parameter <arg value="--followLinks=false"/>.

5. Set Default Encoding of XML serializer in sitemap.xmap to UTF-8

The Google Sitemap file has to be provided in UTF-8 encoding. Since the default encoding of the XML serializer in {FORREST_HOME}\main\webapp\sitemap.xmap is not UTF-8 it has to be customized:

...
      <map:serializer name="xml" mime-type="text/xml"
                        src="org.apache.cocoon.serialization.XMLSerializer">
        <encoding>UTF-8</encoding>
      </map:serializer>
...

6. Add New Matcher to sitemap.xmap

The new request to ${project.google-sitemap-name} has to be handled inside of a Cocoon pipeline in the file {FORREST_HOME}\main\webapp\sitemap.xmap. This is achieved by the following modification:

...
      <!-- Body content -->
      <map:match pattern="**.xml">

        ...
        
        <map:match pattern="linkmap.xml">
          <map:mount uri-prefix="" src="linkmap.xmap" check-reload="yes" />
        </map:match>
        
        <!-- NEW MATCHER -->
        <map:match pattern="{project:google-sitemap-name}">
          <map:mount uri-prefix="" src="googlesitemap.xmap" check-reload="yes" />
        </map:match>

        <map:match pattern="forrest-issues.xml">
          <map:mount uri-prefix="" src="issues.xmap" check-reload="yes" />
        </map:match>
...

7. Add New Pipeline googlesitemap.xmap

As can be seen from the new matcher above, matching requests are handled in the (new) separate pipeline {FORREST_HOME}\main\webapp\googlesitemap.xmap. This file originated as a copy of {FORREST_HOME}\main\webapp\linkmap.xmap and was adapted to handle parameters as needed to generate a Google Sitemap file.

Contents of the file:

<?xml version="1.0"?>
<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0">
  <map:components>
    <map:generators default="file"/>
    <map:serializers default="html"/>
    <map:transformers default="xslt"/>
    <map:matchers default="wildcard"/>
  </map:components>

  <map:pipelines>
 
     <map:pipeline>     
      <map:match pattern="{project:google-sitemap-name}">
        <map:generate src="cocoon://abs-linkmap" />
        <map:transform src="{forrest:stylesheets}/googlesitemap2document.xsl">
          <map:parameter name="site-uri-prefix" value="{project:site-uri-prefix}" />
        </map:transform>
        <map:serialize type="xml" />
      </map:match>
    </map:pipeline>

  </map:pipelines>

</map:sitemap>

8. Add New Transformation googlesitemap2document.xsl

As can be seen from the new pipeline above the transformation is performed by the (new) XSLT stylesheet {FORREST_HOME}\main\webapp\resources\stylesheets\googlesitemap2document.xsl. This file originated as a copy of {FORREST_HOME}\main\webapp\resources\stylesheets\linkmap2document.xsl and was adapted to generate Google Sitemap syntax.

Contents of the file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                   xmlns="http://www.google.com/schemas/sitemap/0.84">
   <xsl:output method="xml" 
               version="1.0" 
               omit-xml-declaration="no" 
               indent="yes" />
       
   <xsl:param name="site-uri-prefix"/>

   <xsl:template match="/">
     <urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
       <xsl:apply-templates select="*[not(self::site)]" />
     </urlset>
   </xsl:template>     

   <xsl:template match="*">
  <xsl:if test="@label">
          <xsl:if test="@href">
            <url>
        <loc><xsl:value-of select="$site-uri-prefix"/><xsl:value-of select="@href"/></loc>
        <xsl:if test ="@lastmod">
          <lastmod><xsl:value-of select="@lastmod"/></lastmod>
        </xsl:if>
        <xsl:if test ="@changefreq">
          <changefreq><xsl:value-of select="@changefreq"/></changefreq>
        </xsl:if>
        <xsl:if test ="@priority">
          <priority><xsl:value-of select="@priority"/></priority>
        </xsl:if>
            </url>
          </xsl:if>
        </xsl:if>
        <xsl:apply-templates/>        
  </xsl:template>

</xsl:stylesheet>

With these steps completed Forrest in now ready to generate a Google Sitemap file for your publication.

9. Insert additional attributes in site.xml

Entries in the Google Sitemap file are generated for all nodes of your project's site.xml file ({PROJECT_HOME}\src\documentation\content\xdocs\site.xml) which contain a label and a href attribute. This should comprise exactly those nodes which make up your publication.

You may append three new attributes to each node which will be reflected in your Google Sitemap file:

  • lastmod
  • changefreq
  • priority

For syntax and semantics of these attributes please see Sitemap protocol.

Hints for site.xml
1. The project's site.xml file should not be formatted with tabstops; tabstops produce some strange characters in the resulting Google Sitemap file.

2. The <site> tag should not contain a href attribute; otherwise a superfluous entry will be generated in the Google Sitemap file; you will usually have an entry for the top level index.html file anyway.
This problem could probably be circumvented by better selection criteria in googlesitemap2document.xsl (see above).

Here's a snippet of the site.xml file of www.erbox.org:

<about label="Home">
    <index label="ER/Box" href="index.html" description="ER/Box homepage"
                          lastmod="2006-07-16" changefreq="weekly" priority="0.9"/>
      <credits label="Credits" href="credits.html" description="Credits"
                            lastmod="2006-07-16" changefreq="monthly" priority="0.3"/>
      <news label="News" href="newsfeed.html" description="ER/Box Project News"
                            lastmod="2006-07-16" changefreq="daily" priority="0.7"/>
  </about>

  <projectHome label="Project" tab="project">
    <index label="Overview" href="project/index.html" description="Project index page"
                            lastmod="2006-07-16" changefreq="weekly" priority="0.6"/>
    <sfProjectSummary label="SF Project Statistics" href="sfProjectStatsShort.html"
                            description="Sourceforge Project Summary (including basic stats)"
                            lastmod="2006-07-16" changefreq="weekly" priority="0.5"/>
  </projectHome>

  <documentationHome label="Documentation" tab="documentation">
    <index label="Introduction" href="documentation/index.html" description="Introduction"
                            lastmod="2006-07-16" changefreq="monthly" priority="0.4"/>
    <install label="Installation" description="Installation">

and this is the corresponding output in the generated Google Sitemap file (linebreaks formatted):

<?xml version="1.0" encoding="UTF-8"?>
  <urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
   
     
  
    <url><loc>http://www.erbox.org/index.html</loc><lastmod>2006-07-16</lastmod>
                         <changefreq>weekly</changefreq><priority>0.9</priority></url>
      <url><loc>http://www.erbox.org/credits.html</loc><lastmod>2006-07-16</lastmod>
                           <changefreq>monthly</changefreq><priority>0.3</priority></url>
      <url><loc>http://www.erbox.org/newsfeed.html</loc><lastmod>2006-07-16</lastmod>
                           <changefreq>daily</changefreq><priority>0.7</priority></url>
  

  
    <url><loc>http://www.erbox.org/project/index.html</loc><lastmod>2006-07-16</lastmod>
                         <changefreq>weekly</changefreq><priority>0.6</priority></url>
    <url><loc>http://www.erbox.org/sfProjectStatsShort.html</loc><lastmod>2006-07-16</lastmod>
                         <changefreq>weekly</changefreq><priority>0.5</priority></url>
  

  
      <url><loc>http://www.erbox.org/documentation/index.html</loc>
                           <lastmod>2006-07-16</lastmod><changefreq>monthly</changefreq>
                           <priority>0.4</priority></url>
      
        <url><loc>http://www.erbox.org/documentation/install.html</loc>
                             <lastmod>2006-07-16</lastmod><changefreq>weekly</changefreq>
                             <priority>0.4</priority></url>
        <url><loc>http://www.erbox.org/documentation/installWin.html</loc>
                             <lastmod>2006-07-16</lastmod><changefreq>weekly</changefreq>
                             <priority>0.5</priority></url>

The extra empty lines could probably be avoided with a more clever XSLT stylesheet (googlesitemap2document.xsl; see above).