Java : Bitching over a cup of coffee: Ajax Crawling using HTMLUnit

Lately Ajax is finding more and more use in modern Web Applications. However for websites that load a lot of their content "on Document Load", using JavaScript there is a problem related to SEO. The Crawler can only get the HTML from the Server request, it cannot directly execute Ajax.

You may write solutions where you have two different versions of the same page and decide to present one to the Crawler and one for actual use, however it has sever handicaps in terms of SEO crawling. The reason being, you really want to present only 1 Link to the user & crawler. Since we know SEO is a lot about link building and off-page credibility than it is about on page.
The solution I present here; you can take any URL on your site and convert it to an Ajax Crawlable URL without having to Code on the page or for that page in mind. So as a Generic Site wide solution its ready to go without much mental cycles on thinking about it on a per page basis.
A full understanding can be gained from the following links of what is required to achieve the optimal solution:
Google Ajax-Crawling Understanding
Google Ajax-Crawling Full Specification

Solution for Java Sites

Our solution has the following steps:

Create an Adapter that converts a Request with Ajax to HTML using HTMLUnit.
A publisher that writes the HTML to cache/storage
A Filter that distinguishes Crawl requests and uses the cache if available or generates it

STEP 1 : Request to HTML using HTMLUnit

/**
 * An adapter that takes in the URL address as a String and returns the HTML String as output using HTMLUnit.
 * 

 * Note: A call to this adapter is Synchronous and thread blocking.
 * 
 * @author Arjun Dhar
 */
public class AjaxUrlToHTMLTransformer implements Transformer {
 private static Logger log = LoggerFactory.getLogger(AjaxUrlToHTMLTransformer.class);
 
 /**
  * {@link WebClient#waitForBackgroundJavaScript(long)}
  * @default 15000
  */
 private int javaScriptWaitSecs = 15000;
 
 /**
  * {@link BrowserVersion}
  * @default  {@link BrowserVersion#FIREFOX_24}
  */
 private BrowserVersion browser = BrowserVersion.FIREFOX_24;
 
 /**
  * Connect to servers that have any SSL certificate
  * @see WebClientOptions#setUseInsecureSSL(boolean)
  */
 private boolean supportInsecureSSL = true;
 
 /**
  * If false will ignore JavaScript errors
  * @default false
  */
 private boolean haltOnJSError = false;
 
 private static final SilentCssErrorHandler cssErrhandler = new SilentCssErrorHandler();
 
 @Override
 public Object transform(Object input) {
  if (input==null) {
   return null;
  }
  
     final WebClient webClient = new WebClient(browser);
     WebClientOptions options = webClient.getOptions();
     options.setJavaScriptEnabled(true);
     options.setThrowExceptionOnScriptError(haltOnJSError);
     options.setUseInsecureSSL(supportInsecureSSL); 
     
     //Optimizations
     //options.setPopupBlockerEnabled(true); //No use for popups     
     options.setCssEnabled(false); //For crawling we don't care about CSS since its going to be Window less
     webClient.setCssErrorHandler(cssErrhandler);
     options.setAppletEnabled(false);
     
     //The following two lines make it possible to wait for the initial JS to load the products via AJAX and include it in the final HTML    
     webClient.waitForBackgroundJavaScript(javaScriptWaitSecs); //Wait for document.ready Auto search to fire and fetch page results via AJAX
     webClient.setAjaxController(new NicelyResynchronizingAjaxController());     
     
     try {
      final HtmlPage page = webClient.getPage(input.toString());

      final String pageAsXml = page.asXml();     
      webClient.closeAllWindows();
      
      return pageAsXml;
     }
     catch(Exception e) {
      throw new RuntimeException(e);
     }
 }

 //TODO: Rest of getter Setters for the bean
        //...
}

STEP 2 : Publisher to publish resulting HTML to Storage

        public class Publisher implements ApplicationContextAware {

        /**
        * Base path on the FileSystem where we will store Cached files.
        * Note: This is not mandatory, as you may choose any mechanism to store your cache of SEO friendly Pages.
        */
        private String folderLocation;

        @Autowired private ApplicationContext springAppContext;  //Optional. helpful to use Files as Resources when using Spring

 /**
  * It is critical to have a Mapping/location resolution on how a Web-Request translates to a path in the storage.
         * You could define any algorithm suitable to you. For demonstration purpose we value the PATH & QUERY PARAMS
         * so have an implementation that generates a Cache File name based on those inputs.
  * 
  * @param webPath as String. The is API will
  * take the literal implementation of the webPath. 
  * For this reasons it is recommended the webPath be cleaned of protocol, host, port before it is passed if we wish a more generic Page level match.

  * It will expect the decoded version of the String.
  */
 public URI getLocation(String webPath, String facetNameId) throws Exception {
  String relativePath = webPath.replaceFirst("\\.[a-zA-Z0-9]+(\\?|$)", "").replaceAll("(http(s)?)|\\?|\\&|\\.|\\:|=|/", "-").replaceAll("(-)+", "-").replaceFirst("^-|;(?i:jsessionid)=(.)*$", "") + ((facetNameId!=null)?"_"+facetNameId:"");
                String fileName = cleanChars(relativePath); //Some pseudo method that cleans out Special chars etc to give it a legal file name
  return new URI("file:///" + fileName + ".html");
 }

 public void publish(String html, String webPath) throws Exception {
  URI publishFilePath = getLocation(webPath, null);
  String outputPath = publishFilePath.toString();
  if (springAppContext != null) {
   Resource res = springAppContext.getResource(outputPath);
   outputPath = res.getFile().getAbsolutePath();
  }
  FileWriter fileWriter = new FileWriter(outputPath);
  fileWriter.write(html);
  fileWriter.flush();
  fileWriter.close();  
 }

        /**
        * Fetch the content from a given URI defined by getLocation
        */
        public String getContent(URI uri) {
          //TODO: Your implementation of fetching the file contents from the URI. Standard stuff...
        }
}

STEP 3 : Filter to present HTMLUnit generated pages

/**
 * If the Application uses Ajax pages that generate HTML via JavaScript and we want to make the pages also Crawler Friendly
 * as specified in Google specs
 * then we use this filter so that the pages Generated by {@link Publisher} can be accessed by the FIlter directly when the Crawler requests for them.
 * 


 * Filter Init Params:
 * 
 *  htmlPublisherBeanId - String Bean Id of the {@link Publisher} ; used to generate a fresh page & publish if no published version is available
 * 
 * 
 * @author arjun_dhar
 */
public class SEOAjaxGenPagesFilter extends Filter {
 private static Logger log = LoggerFactory.getLogger(WicketFilter.class);
 
 /**
  * As per Google specs
  * the crawler will convert all != (Ajax Pages) to URL's with param _escaped_fragment_ and expect the server to render back the pure HTML version.
  */ 
 public static final String GOOGLE_AJAX_CRAWL_FRAGMENT = "_escaped_fragment_="; 
 
 public SEOAjaxGenPagesWicketFilter() throws Exception {
  super();
 }
 
 private boolean isSEOCrawlRequest(final HttpServletRequest request) {
  String q = request.getQueryString();
  return q!=null?q.contains(GOOGLE_AJAX_CRAWL_FRAGMENT) && !q.contains("seoajaxcycle=true") /*Avoid recursion*/:false;
 }
 
 private transient Publisher htmlPublisher;
 
 protected final Object getSpringBean(String beanId) {
  return WebApplicationContextUtils.
       getRequiredWebApplicationContext(getFilterConfig().getServletContext()).
       getBean(getFilterConfig().getInitParameter(beanId));
 }
 
 /**
  * Write the HTML to the Filter Response
  * 
  * @param html as String
  * @param response as {@link HttpServletResponse}
  * @throws IOException
  */
 protected void writeToResponse(String html, HttpServletResponse response) throws IOException {
  HttpServletResponseWrapper wrapper = new HttpServletResponseWrapper(response);     
  wrapper.setCharacterEncoding("UTF-8");
  wrapper.setContentType("text/html"); //V.Important
  wrapper.setContentLength(html.length());
  wrapper.getWriter().write(html);
  wrapper.getWriter().flush();
  wrapper.getWriter().close();  
 }
 
 @Override
 @SuppressWarnings("all")
 public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
  if (request instanceof HttpServletRequest) {
   HttpServletRequest httpRequest = (HttpServletRequest)request;
   if (isSEOCrawlRequest(httpRequest)) {
    String html = null;
    if (htmlPublisher == null) htmlPublisher= (Publisher)getSpringBean("htmlPublisherBeanId");
    
    String queryStr = httpRequest.getQueryString()!=null?MarshalUtils.decode(httpRequest.getQueryString(), "UTF-8").replaceFirst("(\\?|\\&)"+htmlPublisher .getIgnoreQueryParam(), ""):null;
    String filteredRelativePath = URLDecoder.decode(httpRequest.getRequestURI(), "UTF-8") + (!StringUtils.isEmpty(queryStr)?"?"+queryStr:""); 
    try {
     URI uri = htmlPublisher.getLocation(filteredRelativePath, null);
     html = htmlPublisher.getContent(uri);
     if (html == null) throw new Exception("No content found, forcing Exception"); 
    }
    catch(Exception e) {
     log.error("[doFilter] Failed to resolve published content, will try to fetch it afresh now ...", e);
     try {   
      html = (String)htmlPublisherListener.getAjaxToHTMLTransformer().transform(httpRequest.getRequestURL() + "?" + (!StringUtils.isEmpty(queryStr)?queryStr+"&":"")+"seoajaxcycle=true" /*Avoid recursion*/);
      try {
       htmlPublisher.publish(html, filteredRelativePath);
      }
      catch(Exception e2) {
       log.error("[doFilter] Failed to publish HTML file for path " + httpRequest.getRequestURI(), e2);
      }
     }
     catch(Exception e2) {
      log.error("[doFilter] Failed to generate HTML content for "+httpRequest.getRequestURI()+"; Giving up! Passing on to chain to process", e2);
      super.doFilter(request, response, chain);
     }     
    }
     
    if (html!=null) {
     writeToResponse(html, (HttpServletResponse)response);
    }
    else {
     log.warn("[doFilter] Could not find a Published SEO crawler friendly page for path " + httpRequest.getRequestURI());
     super.doFilter(request, response, chain);
    }
   }
   else {
    super.doFilter(request, response, chain);
   }
  }
  else {
   super.doFilter(request, response, chain);
  }
 } 
}

Configurations

Our example uses Spring, so you can wire the Publisher bean via Spring or instantiate it however you like. In addition the following are the Maven dependency and web.xml configs needed:

Maven Dependency for HTMLUnit

...
  
  
   net.sourceforge.htmlunit
   htmlunit
   2.15
  
...

Please note, you may get some exceptions while running this if there are dependency conflicts with your other libraries. A very common conflict is a result of nekoHTML.
Add the following Exclusion to the library applicable. I use OWASP toolkit, and this typically conflicts with that. So I add the following exclusion to the OWASP library.

   
     
     net.sourceforge.nekohtml
     nekohtml

Configure your web.xml to accept the Filter

  
    
     contextConfigLocation
     classpath:applicationContext.xml
    
    
     org.springframework.web.context.ContextLoaderListener     
    
 ...
 
  seo.ajax.filter
   com.neurosys.seo.ajax.wicket.filters.SEOAjaxGenPagesWicketFilter 
  
   contentExtractorBeanId
   seoResourceContentExtractor
         
  
   htmlPublisherListenerBeanId
   webToCrawlFrndlyPublisher
        
   
  
   seo.ajax.filter
  /*
 
...

Java : Bitching over a cup of coffee

My Page

Sunday, June 22, 2014

Ajax Crawling using HTMLUnit