Ajax

Search results via Ajax & JSON: Yahoo, Live deliver; Google fails it

Warning: This will be the most technical and nerdy article I’ve posted to RGR in quite a while. Those friends of mine who didn’t understand the title of this post might as well go ahead and skip the whole thing.

On Friday, my boss brought up the idea of building a system which we could use to check and track our clients’ sites’ search engine ranking performance for various keywords. He’s been coming up with a lot of these sorts of ideas lately - sometimes I wonder if he realizes he’s only hired one of me - but this idea struck me as particularly interesting. After doing some research on it, I found other tools online which do this task, but they all required payment and registration and other unpleasantness. And yet Google and Yahoo seem to offer their search results via JSON, so how difficult or expensive could this be? So I told my boss this seemed like something we could do, then went home for the weekend. Hey, quittin’ time is quittin’ time…

But after I got some work done for our clients Monday morning, I got started on it soon after. By lunchtime, I had Yahoo working, and got Google hammered out when I came back. I then looked into Live Search’s options and found them sufficient, so I added support for them too.

Basically, how it works is that you enter a search query in one field and a web address in another. When you submit the form, the system uses Ajax to submit the query to and fetch search results in JSON format from the three engines, then runs a regular expression on the web addresses in the results to see if they match the web address appropriately and reports on the results. It’s pretty slick, and I’d love to release it to the public, but it was made on my boss’s time, so I don’t know if he’d be cool with that…

Anyway, it was interesting to play with the differences between the three major engines with regards to their support for all this stuff. Long story short: Yahoo and Live Search were great to work with, but Google’s solution just ain’t cuttin’ it. Let’s go into more depth, shall we?

The XSS problem

I started with Yahoo first since they seemed the most developer-friendly for reasons I’ll get into later. My first problem with getting their feeds was… getting their feeds. When I just tried using jQuery’s standard $.ajax() function, I got a cryptic error about Yahoo’s address being illegal or something like that. I kept checking the format of the address I was using for the request, but I couldn’t find anything wrong with it… after doing a bit of searching in the manual sense, I found out that it turns out this is just the browsers being paranoid about requesting and executing scripts from “foreign” servers.

It turns out there’s a workaround, though. Instead of doing a standard Ajax call, what you actually do is inject a new <script> tag into your DOM with an SRC attribute which requests a script on the search service via GET, with a variable specifying a callback function. The service then returns a script which calls the callback with the JSON results. In practice, it works something like this: You add

<script src="http://search.example.com/search?q=searchquery&callback=myCallback"></script>

…to your DOM, and the script that the search service returns looks like:

myCallback({/* the search results as a JS object */});

…which then gets executed. This is a Grade AAA Prime Hack, 100% Certified. But it works and is supported by all three engines equally well.

Both Yahoo and MSN are capable of offering results in XML format instead of JSON with the change of a single query particle - I bet you can figure it out. But, guys… XML is a language for marking up documents, and JSON is a system for serializing data. And we’re working with data. Sorry, but the “let’s use XML for friggin’ everything!” crew annoy me.

Anyway. Let’s look at the services individually.

Yahoo!

Yahoo seems to really be putting a lot of effort into making their various services developer-friendly, and it shows. Check out the Everything YDN page on Yahoo’s Yahoo Developer Network site and check out all the stuff you can play with!

Let’s take a look at a GET query to Yahoo’s servers. Note that I’m going to leave the URLs in these examples unencoded for easier readability; they won’t actually work until you run encodeURIComponent() or something on them.

http://query.yahooapis.com/v1/public/yql?format=json&callback=myCallback&q='select * from search.web(100) where query = "bananas"'

Woah, what the crap? Is that SQL? Nope, that’s Yahoo! Query Language, an SQL-inspired language for querying Yahoo services - including Flickr, Delicious, and so on in addition to just standard web search. Like SQL, you can “select” only certain “fields” from the results, and you can even do WHERE clauses to a certain extent. No sorting, though. It’s pretty trippy. Try the interactive console for some ideas of what it can do.

Notice the parenthesized (100) after search.web? That’s where we tell Yahoo how many results we want back. As far as I can tell, there’s no hard limit to this… I once upped it to 1000, and Yahoo dutifully gave me 1000 results, which is really a magnitude more than our project really needs to use. I didn’t bother asking for more, but it seemed like the system was ready to give me more if I asked. Wow.

Yahoo’s response looks something like this:

myCallback({
  "query": {
    "count": "10",
    "created": "2009-03-31T05:21:37Z",
    "lang": "en-US",
    "updated": "2009-03-31T05:21:37Z",
    "uri": "http://query.yahooapis.com/v1/yql?q=select+*+from+search.web%2810%29+where+query+%3D+%22bananas%22",
    "diagnostics": {
      "publiclyCallable": "true",
      "url": {
        "execution-time": "285",
        "content": "http://boss.yahooapis.com/ysearch/web/v1/bananas?format=xml&start=0&count=10"
      },
      "user-time": "287",
      "service-time": "285",
      "build-version": "911"
    },
    "results": {
      "result": [
        {
          "abstract": "Information from Wikipedia on this fruit, including its description, world trade, <b>...</b> <b>Bananas</b> are a valuable source of vitamin B6, vitamin C, and potassium. <b>...</b>",
          "clickurl": "http://lrd.yahooapis.com/_ylc=X3oDMTQ4amI4Z25zBF9TAzIwMjMxNTI3MDIEYXBwaWQDb0pfTWdwbklrWW5CMWhTZnFUZEd5TkouTXNxZlNMQmkEY2xpZW50A2Jvc3MEc2VydmljZQNCT1NTBHNsawN0aXRsZQRzcmNwdmlkA3QxM1lUVWdlQXUyM1JXRVZyVEpybXdzS1N6N3VQVW5ScUdFQUFrUTM-/SIG=118hrpqt5/**http%3A//en.wikipedia.org/wiki/Banana",
          "date": "2009/03/19",
          "dispurl": "<b>en.wikipedia.org</b>/wiki/Banana",
          "size": "140417",
          "title": "Banana - Wikipedia, the free encyclopedia",
          "url": "http://en.wikipedia.org/wiki/Banana"
        },
        {
          "abstract": "<b>bananas</b>, fruit, healthy <b>...</b> has proved that just two <b>bananas</b> provide enough energy for a <b>...</b> This is because <b>bananas</b> contain tryptophan, one of the twenty <b>...</b>",
          "clickurl": "http://lrd.yahooapis.com/_ylc=X3oDMTQ4amI4Z25zBF9TAzIwMjMxNTI3MDIEYXBwaWQDb0pfTWdwbklrWW5CMWhTZnFUZEd5TkouTXNxZlNMQmkEY2xpZW50A2Jvc3MEc2VydmljZQNCT1NTBHNsawN0aXRsZQRzcmNwdmlkA3QxM1lUVWdlQXUyM1JXRVZyVEpybXdzS1N6N3VQVW5ScUdFQUFrUTM-/SIG=11c59h9ug/**http%3A//www.finetuneyou.com/Bananas.html",
          "date": "2009/03/22",
          "dispurl": "www.<b>finetuneyou.com</b>/<b>Bananas</b>.html",
          "size": "16271",
          "title": "<b>Bananas</b>",
          "url": "http://www.finetuneyou.com/Bananas.html"
        },
        /* …snip… */
      ]
    }
  }
});

Ah, it’s glorious. Nicely formatted, with all that meta-info… Well, both query.uri or query.url.content are wrong, but oh well, close enough.

Live Search

Live Search, which is what Microsoft is calling its search service this week, was surprisingly forward with its results as well. To read up on Microsoft’s documentation for this, start here. Unlike the other two services, you have to sign up for an API key before you can even make some test queries against the service, but doing so is free, quick and relatively painless. I already have a “Live ID” thanks to my Xbox Live subscription, so I didn’t even have to create a new account. A query looks like this:

http://api.search.live.net/json.aspx?Sources=web&Web.Count=50&JsonType=callback&JsonCallback=myCallback&AppId=0123456789ABCDEF&Query=banana

You can probably guess that I faked in that AppId. (Hey, I don’t want you associatin’ my good ID with whatever sicko queries you’re going to be makin’.) The number of results that we can fetch is set by the Web.Count parameter; through trial and error, I found that it seems to max out at fifty, which was sufficient enough for our task. (If you need more, Live Search lets you specify an offset parameter to fetch the next “page” of results.) Also note the use of TitleCase all over the place; not only in the query, but as you’re about to see, in the response as well. On a one-to-ten scale of annoyingness, that’s about a four.

if(typeof myCallback == 'function') myCallback({
  "SearchResponse": {
    "Version": "2.1",
    "Query": {
      "SearchTerms":"banana"
    },
    "Web": {
      "Total": 29500000,
      "Offset": 0,
      "Results": [
        {
          "Title": "Banana - Wikipedia, the free encyclopedia",
          "Description": "Banana is the common name for a type of fruit and also the herbaceous plants of the genus Musa which produce this commonly eaten fruit. They are native to the tropical region of ... ",
          "Url": "http:\/\/en.wikipedia.org\/wiki\/Banana",
          "CacheUrl": "http:\/\/cc.msnscache.com\/cache.aspx?q=banana&d=75747133304431&w=5ffa56e8,3e149266",
          "DisplayUrl": "http:\/\/en.wikipedia.org\/wiki\/Banana",
          "DateTime": "2009-03-27T12:31:49Z"
        },
        {
          "Title": "Guide to Bananas - History - Recipes - Nutrition - Banana.com",
          "Description": "Complete Guide to Bananas features the history of bananas, banana recipes, the purchase and storage of bananas, how to grow bananas, medicinal uses of bananas, the nutritional ... ",
          "Url": "http:\/\/www.banana.com\/",
          "CacheUrl": "http:\/\/cc.msnscache.com\/cache.aspx?q=banana&d=75708296684832&w=14416f13,fb67564f",
          "DisplayUrl": "http:\/\/www.banana.com\/",
          "DateTime": "2009-03-22T05:56:53Z"
        },
        /* …snip… */
      ]
    }
  }
} /* pageview_candidate */);

Live Search’s results do something unusual in checking for the existence of the callback function before calling it. I’m not sure I like that - if something goes wrong, raising an execption is often better than failing silently. Hmm.

Google

Google. Google Google Google Google Google… tsk tsk tsk.

Whereas Yahoo is gloriously generous with the data it’s providing us, Google seems downright stingy. Most of their “AJAX Search API” documentation is geared more around the idea of drawing a pretty little search form and pretty little search results on the page, not providing raw data to work with - for info on getting that, you have to read the section annoyingly titled “Flash and other non-Javascript [sic] Environments”, even if you really are working entirely in JavaScript. Additionally, they “ask, but do not require, that each request contains a valid API Key” without providing any information as to just how that API key should be passed to the server. In the data passed back, there’s no explicit search result offset value as there is with the other services’ data; you can find it doing simple arithmetic with other values, but it’s still annoying that you have to do it at all. They also make a lot of demands about preserving the Google branding and such when the results are displayed (which I gloriously ignored since technically we’re not displaying results and really only a couple people in the office are ever going to use this anyway… Perhaps the other services make demands like this too, but are less obnoxious about them).

Worst of all… the amount of results you can fetch at once maxes out at eight. Eight! To work around this limitation, I scripted the system to check through the first page of results for a match, and if none is found, to get the next eight, and so on, up to eight times (sixty-four results). The result is that querying Google takes up to ten connections to Google’s server, whereas the rest only take one (possibly two in the case of Live Search if the boss decides the first fifty results aren’t enough). Fail!

Well, anyway. A query:

http://ajax.googleapis.com/ajax/services/search/web?v=1.0&callback=myCallback&rsz=large&q=bananas

Simple enough. The “rsz” attribute is what tells the servers to send us eight results - the default is only four!

myCallback({
  "responseData": {
    "results": [
      {
        "GsearchResultClass": "GwebSearch",
        "unescapedUrl": "http://en.wikipedia.org/wiki/Banana",
        "url": "http://en.wikipedia.org/wiki/Banana",
        "visibleUrl": "en.wikipedia.org",
        "cacheUrl": "http://www.google.com/search?q\u003dcache:Gdi1ltWHn3UJ:en.wikipedia.org",
        "title": "\u003cb\u003eBanana\u003c/b\u003e - Wikipedia, the free encyclopedia",
        "titleNoFormatting": "Banana - Wikipedia, the free encyclopedia",
        "content": "\u003cb\u003eBanana\u003c/b\u003e is the common name for a type of fruit and also the herbaceous plants of   the genus Musa which produce this commonly eaten fruit. \u003cb\u003e...\u003c/b\u003e"
      },
      {
        "GsearchResultClass": "GwebSearch",
        "unescapedUrl": "http://www.bananasinc.org/",
        "url": "http://www.bananasinc.org/",
        "visibleUrl": "www.bananasinc.org",
        "cacheUrl": "http://www.google.com/search?q\u003dcache:paffpacthUcJ:www.bananasinc.org",
        "title": "\u003cb\u003eBANANAS\u003c/b\u003e Home Page",
        "titleNoFormatting": "BANANAS Home Page",
        "content": "\u003cb\u003eBANANAS\u003c/b\u003e specializes in childcare, daycare \u0026amp; babysitting referrals for parents   and childcare providers in Alameda County, California."
      },
      /* …snip… */
    ],
    "cursor": {
      "pages": [
        {
          "start": "0",
          "label": 1
        },
        {
          "start": "8",
          "label": 2
        },
        /* …snip… */
      ],
      "estimatedResultCount": "16400000",
      "currentPageIndex": 0,
      "moreResultsUrl": "http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dbananas"
    }
  },
  "responseDetails": null,
  "responseStatus": 200
});

You can see that Google provides this curious responseData.cursor.pages array which I guess is supposed to be used to build a “Goooooooogle”-like pager for the results. It seems useless, but it sort of comes in handy when a URL matches and I have to calculate parseInt(json.responseData.cursor.pages[json.responseData.cursor.currentPageIndex].start) + i + 1 to find out which result it was. Barf.

So well on Microsoft and especially Yahoo for making their search data so accessible like this. The potential for building some really great “mash-up”-style apps with search result data is nearly limitless. I know that Google is the go-to search engine for everyone from n00bs to l33ts, but I think developers really should take a second look at what Yahoo offers for them - they’re really doing some great work in terms of developer outreach. I didn’t look into it as deeply, but it seems Microsoft has made some great steps in that direction too.

But all shame upon Google for providing poorly-formatted, restriction-heavy data. Clearly their focus is not upon us, the developers itching to use their data in a sweet new web app. They haven’t had anything to fear from the competition in a while, but if they don’t watch their back, it may come back to bite them…

Syndicate content

About RGR

Ray Gun Robot is the personal site of Garrett Albright, a fairly decent web developer living in northern California. Find out more about me or check out some projects I’ve worked on.