Tuesday, February 22, 2011

 

The mysteries of googling

Mr Boon of copying fame (15th February) makes the interesting observation that if you ask Mr. Google about London you get 733 million hits in 0.18 seconds. If you ask him about sex you get the rather grander total of 1,030 million in 0.09 seconds. So sex is somewhat more important than London. But if you ask him about copyright you get the stunning 10,960 million hits in 0.13 seconds. From which we deduce that even in our post-capitalist world property rights are still more important than the basic functions of humanity.

And then yesterday I was prompted to air my knowledge of matters searching - to little avail at the time, but the brain was clearly chugging away overnight and a thousand flowers bloomed in the morning. Couldn't get the subject out of my head for the half hour or so it took me to wake up. Blog dump clearly called for; which follows.

We suppose that the population being searched is web pages, with any one web site being made up of one or more web pages. We assume that a web page belongs to exactly one web site and that many web sites consist of exactly one web page. All those chaps trying to sell their skills at double glazing, bonsai care or whatever.

We suppose that the search term is an ordered list of words, in the simple case, just one.

The context of a search will sometimes be available to the search engine. That is to say information about the person, computer and place making the search. Information of this sort makes it possible for the search engine to produce a good result from a bad search term. But also constitutes an invasion of privacy. We ignore this possibility in what follows.

We suppose that it is not practical for the search engine to search all web pages each time it is asked to do a search. Web pages have to be indexed in some more or less complicated way as a background task (a task once known as crawling, undertaken by crawlers) and individual searches do most of their business by looking at those indexes. One of the things taken into account in constructing indexes is whether one web page is pointed at by some other, respectable web page. The more the merrier. None and you don't get indexed. Furthermore, an index will only be able to answer the kinds of questions which have been taken into account during its construction, this being another way of saying that the index contains a lot less information than the original web pages. One will always be able to devise questions that the indexes cannot, on their own, answer.

We start off with the simple and unhelpful search term 'London'. The search engine looks at its indexes and gets a list of all the web sites containing web pages containing the word 'London'. It might think about 'london', 'Londons', 'Londoners', 'Cockneys', 'Londinium' and 'Londres'. What about all those Londons in far flung places when the search obviously comes from Epsom? Next task is to present this list in descending order, when all kinds of considerations might come into play. Does a web page which says London lots score more? Does it score more if it says London in special places, places in the html which you may not see when you display the page on the screen - but which you can see in Chrome anyway by right click then view source. Places you can only populate by paying Mr. Google to give you the key. But does view source really give you the full sp.? And what about web pages that you have to pay to see?

Things get much more interesting when you move up to two element search terms, perhaps 'London baker'. Do you score the web page according to whether and how often the two words appear not more than so many words apart? According to how often the two words appear in the right order not more than so many words apart? According to how often the exact phrase appears? Does one take document structure into account? No good if the two words appear in two separate boxes?

What about noise words. Suppose you were looking for a house in a village with a baker and you search for 'village with baker'. Should Mr. Google strip the 'with' out of the search term before visiting the indexes? Should it do it both ways?

And that has, I think, more or less worked this topic out of the system and I can get on with the morning's tasks. Which this particular morning revolve around the sourcing and erection of a new indoor washing line. But it would be fun to go on a tour with lectures of a Google facility. I wonder if they do such things; the Pentagon does, or at least used to.

Questions for revision: 1) how much do the answers to searches vary over time? 2) how much of this variation is down to variation in the resource applied to the search? 3) how much of this variation is down to variation in the population of web pages? 4) is there seasonal variation? 5) if not, why not?

Part of last night's discussion was about whether one could work out what it happening by experiment. Putting in all kinds of searches and seeing what you got. Tweaking one's own web pages and then searching for them, after allowing a suitable interval during which the indexes could catch up. How long does this take? Can you pay Mr. Google for fast catch up? Yet another project for the university of the fourth age. My view last night was that it might take some time - and that remains my view this morning - although it might be good fun. Quicker to do a bit of searching about searching.

Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?