Lately Ajax is finding more and more use in modern Web Applications. However for websites that load a lot of their content "on Document Load", using JavaScript there is a problem related to SEO. The Crawler can only get the HTML from the Server request, it cannot directly execute Ajax.
You may write solutions where you have two different versions of the same page and decide to present one to the Crawler and one for actual use, however it has sever handicaps in terms of SEO crawling. The reason being, you really want to present only 1 Link to the user & crawler. Since we know SEO is a lot about link building and off-page credibility than it is about on page.
The solution I present here; you can take any URL on your site and convert it to an Ajax Crawlable URL without having to Code on the page or for that page in mind. So as a Generic Site wide solution its ready to go without much mental cycles on thinking about it on a per page basis.
A full understanding can be gained from the following links of what is required to achieve the optimal solution:
Google Ajax-Crawling Understanding
Google Ajax-Crawling Full Specification
Add the following Exclusion to the library applicable. I use OWASP toolkit, and this typically conflicts with that. So I add the following exclusion to the OWASP library.
You may write solutions where you have two different versions of the same page and decide to present one to the Crawler and one for actual use, however it has sever handicaps in terms of SEO crawling. The reason being, you really want to present only 1 Link to the user & crawler. Since we know SEO is a lot about link building and off-page credibility than it is about on page.
The solution I present here; you can take any URL on your site and convert it to an Ajax Crawlable URL without having to Code on the page or for that page in mind. So as a Generic Site wide solution its ready to go without much mental cycles on thinking about it on a per page basis.
A full understanding can be gained from the following links of what is required to achieve the optimal solution:
Google Ajax-Crawling Understanding
Google Ajax-Crawling Full Specification
Solution for Java Sites
Our solution has the following steps:- Create an Adapter that converts a Request with Ajax to HTML using HTMLUnit.
- A publisher that writes the HTML to cache/storage
- A Filter that distinguishes Crawl requests and uses the cache if available or generates it
STEP 1 : Request to HTML using HTMLUnit
/** * An adapter that takes in the URL address as a String and returns the HTML String as output using HTMLUnit. *
* Note: A call to this adapter is Synchronous and thread blocking. * * @author Arjun Dhar */ public class AjaxUrlToHTMLTransformer implements Transformer { private static Logger log = LoggerFactory.getLogger(AjaxUrlToHTMLTransformer.class); /** * {@link WebClient#waitForBackgroundJavaScript(long)} * @default 15000 */ private int javaScriptWaitSecs = 15000; /** * {@link BrowserVersion} * @default {@link BrowserVersion#FIREFOX_24} */ private BrowserVersion browser = BrowserVersion.FIREFOX_24; /** * Connect to servers that have any SSL certificate * @see WebClientOptions#setUseInsecureSSL(boolean) */ private boolean supportInsecureSSL = true; /** * If false will ignore JavaScript errors * @default false */ private boolean haltOnJSError = false; private static final SilentCssErrorHandler cssErrhandler = new SilentCssErrorHandler(); @Override public Object transform(Object input) { if (input==null) { return null; } final WebClient webClient = new WebClient(browser); WebClientOptions options = webClient.getOptions(); options.setJavaScriptEnabled(true); options.setThrowExceptionOnScriptError(haltOnJSError); options.setUseInsecureSSL(supportInsecureSSL); //Optimizations //options.setPopupBlockerEnabled(true); //No use for popups options.setCssEnabled(false); //For crawling we don't care about CSS since its going to be Window less webClient.setCssErrorHandler(cssErrhandler); options.setAppletEnabled(false); //The following two lines make it possible to wait for the initial JS to load the products via AJAX and include it in the final HTML webClient.waitForBackgroundJavaScript(javaScriptWaitSecs); //Wait for document.ready Auto search to fire and fetch page results via AJAX webClient.setAjaxController(new NicelyResynchronizingAjaxController()); try { final HtmlPage page = webClient.getPage(input.toString()); final String pageAsXml = page.asXml(); webClient.closeAllWindows(); return pageAsXml; } catch(Exception e) { throw new RuntimeException(e); } } //TODO: Rest of getter Setters for the bean //... }
STEP 2 : Publisher to publish resulting HTML to Storage
public class Publisher implements ApplicationContextAware { /** * Base path on the FileSystem where we will store Cached files. * Note: This is not mandatory, as you may choose any mechanism to store your cache of SEO friendly Pages. */ private String folderLocation; @Autowired private ApplicationContext springAppContext; //Optional. helpful to use Files as Resources when using Spring /** * It is critical to have a Mapping/location resolution on how a Web-Request translates to a path in the storage. * You could define any algorithm suitable to you. For demonstration purpose we value the PATH & QUERY PARAMS * so have an implementation that generates a Cache File name based on those inputs. * * @param webPath as String. The is API will * take the literal implementation of the webPath. * For this reasons it is recommended the webPath be cleaned of protocol, host, port before it is passed if we wish a more generic Page level match.
* It will expect the decoded version of the String. */ public URI getLocation(String webPath, String facetNameId) throws Exception { String relativePath = webPath.replaceFirst("\\.[a-zA-Z0-9]+(\\?|$)", "").replaceAll("(http(s)?)|\\?|\\&|\\.|\\:|=|/", "-").replaceAll("(-)+", "-").replaceFirst("^-|;(?i:jsessionid)=(.)*$", "") + ((facetNameId!=null)?"_"+facetNameId:""); String fileName = cleanChars(relativePath); //Some pseudo method that cleans out Special chars etc to give it a legal file name return new URI("file:///" + fileName + ".html"); } public void publish(String html, String webPath) throws Exception { URI publishFilePath = getLocation(webPath, null); String outputPath = publishFilePath.toString(); if (springAppContext != null) { Resource res = springAppContext.getResource(outputPath); outputPath = res.getFile().getAbsolutePath(); } FileWriter fileWriter = new FileWriter(outputPath); fileWriter.write(html); fileWriter.flush(); fileWriter.close(); } /** * Fetch the content from a given URI defined by getLocation */ public String getContent(URI uri) { //TODO: Your implementation of fetching the file contents from the URI. Standard stuff... } }
STEP 3 : Filter to present HTMLUnit generated pages
/** * If the Application uses Ajax pages that generate HTML via JavaScript and we want to make the pages also Crawler Friendly * as specified in Google specs * then we use this filter so that the pages Generated by {@link Publisher} can be accessed by the FIlter directly when the Crawler requests for them. *
* Filter Init Params: *
-
*
- htmlPublisherBeanId - String Bean Id of the {@link Publisher} ; used to generate a fresh page & publish if no published version is available *
Configurations
Our example uses Spring, so you can wire the Publisher bean via Spring or instantiate it however you like. In addition the following are the Maven dependency and web.xml configs needed:Maven Dependency for HTMLUnit
...Please note, you may get some exceptions while running this if there are dependency conflicts with your other libraries. A very common conflict is a result of nekoHTML.... net.sourceforge.htmlunit htmlunit 2.15
Add the following Exclusion to the library applicable. I use OWASP toolkit, and this typically conflicts with that. So I add the following exclusion to the OWASP library.
net.sourceforge.nekohtml nekohtml
Configure your web.xml to accept the Filter
contextConfigLocation classpath:applicationContext.xml ... org.springframework.web.context.ContextLoaderListener seo.ajax.filter com.neurosys.seo.ajax.wicket.filters.SEOAjaxGenPagesWicketFilter contentExtractorBeanId seoResourceContentExtractor htmlPublisherListenerBeanId webToCrawlFrndlyPublisher ... seo.ajax.filter /*