#

Thursday, July 10, 2014

Learning from Brazil's Defeat

Reference to Context

For those who followed the worldcup, Brazil going down to Germany by 6 goals must have been a shock. However, what many don't know is that a Goldman Sachs team of number cruncher's had predicted twice that Brazil would win the world cup. Making this prediction worse is the fact that they did not get just the outcome wrong, but many of the teams they earlier predicted did not even make it past the 2nd round. Surely, a thrashing was predictable if the system had any credibility.

Link to their earlier prediction: http://www.goldmansachs.com/our-thinking/outlook/world-cup-sections/world-cup-book-2014-statistical-model.html
Link to the revised prediction: http://www.goldmansachs.com/our-thinking/outlook/world-cup-sections/world-cup-prediction-model-update-6-26-2014.html

Inference

I have always believed, using statistics to predict where you may fail, is more easier than predicting where you may succeed.
One should always listen to numbers & reason but where do you draw the line?
To me what comes clearer is, failure is easier to predict using mathematical models but success is not a number thing. The human will and intuition is a more powerful tool.

People may say, well isn't it two sides of the same coin? It's not ! That's because there are many faces to the coin, and the outcome of one is not a simple inverse of the other.

Risks/Failures can be assessed by probability better, I feel. This is because, failure requires the first breaking point. Means if there are multiple related events, failure is a result of ANY one event failing. Think of a for { } loop with a break on the first fail condition.
This is easier to approximate and calculate and hence trust (for me).... success on the other hand is NOT impacted by failure and neither do related or unrelated events of success guarantee further success. Success is harder to prove or trust!

Even in web-site analytic's, I have observed that you can predict why people are NOT buying something better, than why they WOULD be buying something. Luckily in E-Commerce consumers are more interested in their reasons for failure to improve their conversions, than focus on whats working for them.

Summary

Arguably, you can be sure of what WON'T work than what WILL work. Though one should backup all facts by numbers, as far as predictions go I'd be careful not to guarantee or trust predictions just on the basis of numbers.


Other References:

Sunday, June 22, 2014

Ajax Crawling using HTMLUnit

Lately Ajax is finding more and more use in modern Web Applications. However for websites that load a lot of their content "on Document Load", using JavaScript there is a problem related to SEO. The Crawler can only get the HTML from the Server request, it cannot directly execute Ajax.

You may write solutions where you have two different versions of the same page and decide to present one to the Crawler and one for actual use, however it has sever handicaps in terms of SEO crawling. The reason being, you really want to present only 1 Link to the user & crawler. Since we know SEO is a lot about link building and off-page credibility than it is about on page.
The solution I present here; you can take any URL on your site and convert it to an Ajax Crawlable URL without having to Code on the page or for that page in mind. So as a Generic Site wide solution its ready to go without much mental cycles on thinking about it on a per page basis.
A full understanding can be gained from the following links of what is required to achieve the optimal solution:
Google Ajax-Crawling Understanding
Google Ajax-Crawling Full Specification

Solution for Java Sites

Our solution has the following steps:
  1. Create an Adapter that converts a Request with Ajax to HTML using HTMLUnit.
  2. A publisher that writes the HTML to cache/storage
  3. A Filter that distinguishes Crawl requests and uses the cache if available or generates it

STEP 1 : Request to HTML using HTMLUnit

/**
 * An adapter that takes in the URL address as a String and returns the HTML String as output using HTMLUnit.
 * 
* Note: A call to this adapter is Synchronous and thread blocking. * * @author Arjun Dhar */ public class AjaxUrlToHTMLTransformer implements Transformer { private static Logger log = LoggerFactory.getLogger(AjaxUrlToHTMLTransformer.class); /** * {@link WebClient#waitForBackgroundJavaScript(long)} * @default 15000 */ private int javaScriptWaitSecs = 15000; /** * {@link BrowserVersion} * @default {@link BrowserVersion#FIREFOX_24} */ private BrowserVersion browser = BrowserVersion.FIREFOX_24; /** * Connect to servers that have any SSL certificate * @see WebClientOptions#setUseInsecureSSL(boolean) */ private boolean supportInsecureSSL = true; /** * If false will ignore JavaScript errors * @default false */ private boolean haltOnJSError = false; private static final SilentCssErrorHandler cssErrhandler = new SilentCssErrorHandler(); @Override public Object transform(Object input) { if (input==null) { return null; } final WebClient webClient = new WebClient(browser); WebClientOptions options = webClient.getOptions(); options.setJavaScriptEnabled(true); options.setThrowExceptionOnScriptError(haltOnJSError); options.setUseInsecureSSL(supportInsecureSSL); //Optimizations //options.setPopupBlockerEnabled(true); //No use for popups options.setCssEnabled(false); //For crawling we don't care about CSS since its going to be Window less webClient.setCssErrorHandler(cssErrhandler); options.setAppletEnabled(false); //The following two lines make it possible to wait for the initial JS to load the products via AJAX and include it in the final HTML webClient.waitForBackgroundJavaScript(javaScriptWaitSecs); //Wait for document.ready Auto search to fire and fetch page results via AJAX webClient.setAjaxController(new NicelyResynchronizingAjaxController()); try { final HtmlPage page = webClient.getPage(input.toString()); final String pageAsXml = page.asXml(); webClient.closeAllWindows(); return pageAsXml; } catch(Exception e) { throw new RuntimeException(e); } } //TODO: Rest of getter Setters for the bean //... }

STEP 2 : Publisher to publish resulting HTML to Storage

        public class Publisher implements ApplicationContextAware {

        /**
        * Base path on the FileSystem where we will store Cached files.
        * Note: This is not mandatory, as you may choose any mechanism to store your cache of SEO friendly Pages.
        */
        private String folderLocation;

        @Autowired private ApplicationContext springAppContext;  //Optional. helpful to use Files as Resources when using Spring

 /**
  * It is critical to have a Mapping/location resolution on how a Web-Request translates to a path in the storage.
         * You could define any algorithm suitable to you. For demonstration purpose we value the PATH & QUERY PARAMS
         * so have an implementation that generates a Cache File name based on those inputs.
  * 
  * @param webPath as String. The is API will
  * take the literal implementation of the webPath. 
  * For this reasons it is recommended the webPath be cleaned of protocol, host, port before it is passed if we wish a more generic Page level match.
* It will expect the decoded version of the String. */ public URI getLocation(String webPath, String facetNameId) throws Exception { String relativePath = webPath.replaceFirst("\\.[a-zA-Z0-9]+(\\?|$)", "").replaceAll("(http(s)?)|\\?|\\&|\\.|\\:|=|/", "-").replaceAll("(-)+", "-").replaceFirst("^-|;(?i:jsessionid)=(.)*$", "") + ((facetNameId!=null)?"_"+facetNameId:""); String fileName = cleanChars(relativePath); //Some pseudo method that cleans out Special chars etc to give it a legal file name return new URI("file:///" + fileName + ".html"); } public void publish(String html, String webPath) throws Exception { URI publishFilePath = getLocation(webPath, null); String outputPath = publishFilePath.toString(); if (springAppContext != null) { Resource res = springAppContext.getResource(outputPath); outputPath = res.getFile().getAbsolutePath(); } FileWriter fileWriter = new FileWriter(outputPath); fileWriter.write(html); fileWriter.flush(); fileWriter.close(); } /** * Fetch the content from a given URI defined by getLocation */ public String getContent(URI uri) { //TODO: Your implementation of fetching the file contents from the URI. Standard stuff... } }

STEP 3 : Filter to present HTMLUnit generated pages

/**
 * If the Application uses Ajax pages that generate HTML via JavaScript and we want to make the pages also Crawler Friendly
 * as specified in Google specs
 * then we use this filter so that the pages Generated by {@link Publisher} can be accessed by the FIlter directly when the Crawler requests for them.
 * 

* Filter Init Params: *
    *
  • htmlPublisherBeanId - String Bean Id of the {@link Publisher} ; used to generate a fresh page & publish if no published version is available
  • *
* * @author arjun_dhar */ public class SEOAjaxGenPagesFilter extends Filter { private static Logger log = LoggerFactory.getLogger(WicketFilter.class); /** * As per Google specs * the crawler will convert all != (Ajax Pages) to URL's with param _escaped_fragment_ and expect the server to render back the pure HTML version. */ public static final String GOOGLE_AJAX_CRAWL_FRAGMENT = "_escaped_fragment_="; public SEOAjaxGenPagesWicketFilter() throws Exception { super(); } private boolean isSEOCrawlRequest(final HttpServletRequest request) { String q = request.getQueryString(); return q!=null?q.contains(GOOGLE_AJAX_CRAWL_FRAGMENT) && !q.contains("seoajaxcycle=true") /*Avoid recursion*/:false; } private transient Publisher htmlPublisher; protected final Object getSpringBean(String beanId) { return WebApplicationContextUtils. getRequiredWebApplicationContext(getFilterConfig().getServletContext()). getBean(getFilterConfig().getInitParameter(beanId)); } /** * Write the HTML to the Filter Response * * @param html as String * @param response as {@link HttpServletResponse} * @throws IOException */ protected void writeToResponse(String html, HttpServletResponse response) throws IOException { HttpServletResponseWrapper wrapper = new HttpServletResponseWrapper(response); wrapper.setCharacterEncoding("UTF-8"); wrapper.setContentType("text/html"); //V.Important wrapper.setContentLength(html.length()); wrapper.getWriter().write(html); wrapper.getWriter().flush(); wrapper.getWriter().close(); } @Override @SuppressWarnings("all") public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException { if (request instanceof HttpServletRequest) { HttpServletRequest httpRequest = (HttpServletRequest)request; if (isSEOCrawlRequest(httpRequest)) { String html = null; if (htmlPublisher == null) htmlPublisher= (Publisher)getSpringBean("htmlPublisherBeanId"); String queryStr = httpRequest.getQueryString()!=null?MarshalUtils.decode(httpRequest.getQueryString(), "UTF-8").replaceFirst("(\\?|\\&)"+htmlPublisher .getIgnoreQueryParam(), ""):null; String filteredRelativePath = URLDecoder.decode(httpRequest.getRequestURI(), "UTF-8") + (!StringUtils.isEmpty(queryStr)?"?"+queryStr:""); try { URI uri = htmlPublisher.getLocation(filteredRelativePath, null); html = htmlPublisher.getContent(uri); if (html == null) throw new Exception("No content found, forcing Exception"); } catch(Exception e) { log.error("[doFilter] Failed to resolve published content, will try to fetch it afresh now ...", e); try { html = (String)htmlPublisherListener.getAjaxToHTMLTransformer().transform(httpRequest.getRequestURL() + "?" + (!StringUtils.isEmpty(queryStr)?queryStr+"&":"")+"seoajaxcycle=true" /*Avoid recursion*/); try { htmlPublisher.publish(html, filteredRelativePath); } catch(Exception e2) { log.error("[doFilter] Failed to publish HTML file for path " + httpRequest.getRequestURI(), e2); } } catch(Exception e2) { log.error("[doFilter] Failed to generate HTML content for "+httpRequest.getRequestURI()+"; Giving up! Passing on to chain to process", e2); super.doFilter(request, response, chain); } } if (html!=null) { writeToResponse(html, (HttpServletResponse)response); } else { log.warn("[doFilter] Could not find a Published SEO crawler friendly page for path " + httpRequest.getRequestURI()); super.doFilter(request, response, chain); } } else { super.doFilter(request, response, chain); } } else { super.doFilter(request, response, chain); } } }

Configurations

Our example uses Spring, so you can wire the Publisher bean via Spring or instantiate it however you like. In addition the following are the Maven dependency and web.xml configs needed:

Maven Dependency for HTMLUnit

...
  
  
   net.sourceforge.htmlunit
   htmlunit
   2.15
  
...
Please note, you may get some exceptions while running this if there are dependency conflicts with your other libraries. A very common conflict is a result of nekoHTML.
Add the following Exclusion to the library applicable. I use OWASP toolkit, and this typically conflicts with that. So I add the following exclusion to the OWASP library.
   
     
     net.sourceforge.nekohtml
     nekohtml
    
   

Configure your web.xml to accept the Filter

  
    
     contextConfigLocation
     classpath:applicationContext.xml
    
    
     org.springframework.web.context.ContextLoaderListener     
    
 ...
 
  seo.ajax.filter
   com.neurosys.seo.ajax.wicket.filters.SEOAjaxGenPagesWicketFilter 
  
   contentExtractorBeanId
   seoResourceContentExtractor
         
  
   htmlPublisherListenerBeanId
   webToCrawlFrndlyPublisher
        
   
  
   seo.ajax.filter
  /*
 
...


Sunday, April 20, 2014

E-Commerce Life Cycle

E-Commerce Life Cycle

Follow more details on this post here
E-Commerce Life Cycle

Wednesday, March 26, 2014

Java E-Commerce platform

Java for the Elite ?

The past decade has seen an explosion of technologies and solutions. On one had we have witnessed popular communities around WORDPRESS, DRUPAL, JOOMLA, MAGENTO and more, on the other hand within the Java community we have witnessed so many advancements in the language, platform, scripting and frameworks. However when you look closer something odd strikes you. The Web and mainly B2C driven solutions are dominated by PHP products and should make one wonder why don't we see a single Java based platform. Specially, with all the advancements and tooling one would wonder where are all these frameworks leading to?

While there are some very interesting advancements in the Java community; I prefer to coin them academic rather than useful. Be it the coming of Scala over Java or frameworks like Play-2. In spite all the innovation within the language, I do feel java developers dwell a lot on code quality (however one may choose to define it), verbosity etc. There is a certain "High", in creating frameworks rather than committing solutions that go the distance. Lets face it, most programmers do not like the noise and the politics of the outside world. I do feel this is a cultural issue rather than an issue with any platform or language. Further, we have survived our arrogance because Java is still desired by the big boys and in the so call "Enterprise" world.

Motivation for a E-Commerce platform

I'm aware of the CMS efforts from the Java community as am aware of the E-Commerce efforts. I won't name them here however Its sufficient to say, when challenged by service competitors from the PHP community competing with the likes of Drupal, Magento etc, ...Java has very little to offer. For all the frameworks we have we lack real world answers and our solace seems to be "Enterprise Development, Custom Development ". As if, we were never good enough for the average web developer on the street or maybe the average web developer isn't good enough to be writing and extending frameworks all his life. Somewhere the elitist attitude faded for me to just writing something useful to help the guys on the front line (guys like myself); catering to small to enterprise Business users from a shared Web Server instance without even SSH access to an Enterprise on multiple nodes. One size fits all! Something that can run by a client without needing a team of technical experts to maintain it. Something that packs in all the features needed to compete and do better than the solutions offered on the PHP front line and yet offer the platform benefits that Java has to offer. Not that I have anything against PHP (I actually admire things about the community), but as a java developer I do feel handicapped and loss for answers. Within JAVA we have fragmented ourselves so much that little comes out that is truly less boiler plate and more progressive. I won't mention all the Java based CMS's/Solutions I tried here, but its sufficient to prove that nothing in the Java community matches the success of DRUPAL, WORDPRESS, JOOMLA, MAGENTO to name a few, and you need to ask yourself "why" !

What's different ?

The above motivations my seem abstract. In reality how it translates is that there are many by-products of E-Commerce. Like Search, ETL, CMS etc. These not only find their way to contribute to E-Commerce but also other areas of development that are application specific. The issue with "framework" mindset is you are always providing a tool and nothing more. The higher order glues are left to the developers. That is a good thing, but in today's world these tools are not good enough. The GAP between polished services and what frameworks provide is getting larger. E-Commerce demands solutions and a lot of these "solutions" can be shared, improved, modularized or even exposed as services. Furthermore it results in leaner code bases and also better utilization of memmory, lower learning curves, lower investment in writing business ready applications.

When I look @ companies selling Enterprise versions, and giving Community edition's they seem just a little more than frameworks. But if you had to pitch it to a client, you fall very short with almost no community help from the companies that provided then Open Source solutions. We all want to make money, however what I'm against is monopoly of what the community contributes back. This is causing a dilemma for me in terms of licensing (separate topic)

There is also a lot of emphasis on reduced number of entities and Data Models. The entire CMS, ECOM, User data, etc .. you name it, can be put down in 8 Database Tables without any loss of functionality. I don't want to get into a technical discussion yet, however compare that number to some elephant you install. As features grow, so do their data needs. There are ramifications of blatant abuse of entities that most platforms adopt. These are exponential and not just limited to hardware. I feel these concerns are conveniently ignored in most projects. There are many other concepts that I won't dive into yet and let the code, architecture do the talking. Reducing and reusing entity objects (definitions and instances), does imply drastic improvement in many ways.

Structure

There are many things that go into making a CMS and more so an E-Commerce platform. Hence, all these services have been broken into Libraries. ETL, Reporting, Searching, Content Handling Services, Admin UI, Authorization, Authentication, Carts, etc etc. I have developed each as a standalone library so that people can use them independently. A lot of things useful in a CMS & E-Com can be leveraged in some other projects also. I feel existing CMS definitions are very single focused and hence limited.

The idea is that people don't have to look @ it as a monolithic project and can create or contribute to parts that are of interest, and each module can grow as its own project.

The liability of learning a new Framework

Shifting away from JAVA & PHP; furthermore if something is designed as a CMS then it is only a CMS. Its hard to extend it to being an E-Commerce platform and so on. This is not just about providing a JAVA's alternative but providing something that offers more flexibility for developers to build like LEGO blocks. Spring, Apache Wicket are beautiful example where they make this possible. This also offers a lower learning curve to adopters with no DSL or additional learning curve needed as such. However, as mentioned below we have plenty of frameworks. When you put together an E-Commerce site you do realize the gaps and the mammoth effort of putting all these together. And what you dont want is a "CUSTOM" glued solution that's a curry of all frameworks. You want solutions and services that have been tested and optimized in a domain specific environment and validated by business users and consumers.

A frustrating point of adopting new frameworks or alternatives is the knowledge curve required. The aim of any platform should be to allow people familiar with the core frameworks to dive right in. In addition, providing an independent module for each type of service allows which aspects people would like to use, customize, contribute. Solutions demand answers and the ability to replace frameworks if need be over time. However, there is a masochist in every good developer. For me learning Scala was not about a better language but was just about getting to learn a new alternative. This does not always imply we will be more productive and yet when a new framework comes out, no matter how productive, we always like the fact we can be Elite. Nothing wrong with that, but from a platform developers perspective the idea should be to allow and welcome as many people onboard and thus weed out core dependencies that are not standard practice. make it simple and useful and let people decide what works for them.

I do hope the motivation is well placed and look to hearing views. Strong opinions are welcome. I do also hope to back this up with the platform soon...

Current Status & Direction

The project has matured and achieved sufficient functionality for any small to medium enterprise to use it. However I am now to focus on the following. Worth noting I'd be looking @ a community to help me grow this:
  1. Licensing - Big, tough one!
  2. Documentation
  3. Creating the user Forum
  4. Creating a sample application for the community to demonstrate base usage and ability backed by videos and tutorials
  5. Providing OOB integration's with existing popular cloud services
  6. Additional Unit and integration test automation


Site(s) developed on this platform

https://lemillindia.com
https://nurhome.in

CMS only based sites:
http://wrap.co.in

more in the pipeline ...