Niche Search [20:52]

06/08/2007 21 min
Niche Search [20:52]

Listen "Niche Search [20:52]"

Episode Synopsis

Intro:
You may think Google and Yahoo have a lock on search but it may be time
to starting thinking a little differently. In this podcast we take a
look at some niche search sites.
Mike: Gordon, we love Google products and services - is there a the problem?

It
may be Google does too good of a job! Have you ever tried Google
searching on a persons name? A simple Google search on my first and
last name gives over 1.9 million results!
Today,
three companies control almost 90% of online search:
- over
50% of all searches are done using Google

- over
25% on Yahoo
- and
over 13% using Microsoft


There
are some problems though – these search engines primarily give
results based on the number of sites linking to a page and the
prominence of search terms on a page. Because they work this way
there is room for niche.
Mike: With
this kind of lock on search it would be almost impossible for a
startup to launch a successful general search product - right?


Yes
- it would be almost impossible but we are seeing some acrivirt in the
niche areas. Areas like travel and finance are niches that have already
been filled but today there seems to be some room in the
people search area.



Mike: Are there companies in this market we should be looking at?
One
of the startups to watch is Spock at www.spock.com.
Spock is scheduled for their public launch the first week of August.
Among other places on the web, Spock scans social networking websites
like Facebook and LinkedIn. Search results give summary information
(age, address, etc) about the person along with a list of website links
that refer to the person.


According
to Spock 30% of the 7 billion searches done on the web every month
are related to individuals. Spock says about half of those searches
concern celebrities with the other half including business and
personal lookups. According to Spock, a common problem that we face
is that there are many people with the same name. Given that, how do
we distinguish a document about Michael Jackson the singer from
Michael Jackson the football player?


With
billions of documents and people on the web, we need to identify and
cluster web documents accurately to the people they are related to.
Mapping these named entities from documents to the correct person is
what Spock is all about and they're coming at the problem in an
interesting way.



Mike: I've looked at Spock - what is the Spock Challenge?
They've
launched what they call the Spock Challenge – more formally
referred to as the SPOCK Entity Resolution Problem linked here:
http://challenge.spock.com/pages/learn_more



If
you go to the site you can download a couple of data sets – one
called a training set (approx 25,000 documents) and the other called
a test set (approx 75,000 documents).



Along
with the document sets they include a set of target names. You assume
that each document contains only one of the target names (even though
most documents contain many names). The challenge is to partition all
the documents relevant to a target name by their referent.



Mike: When does the contest begin and end?


It has already begun on 4/16/07. It will end on 11/16/07. On
11/16/07, Spock will run the final round of the competition and announce
the winner.Here are the dates off the website:
4/16 Registration started


5/1- 8/15 Proposal submissions accepted


7/1 Leader board live


11/1 Finalists announced


11/16 Final round at Spock, winner announced
Mike: What languages and tools be used?You can use any language and any non-commercial libraries, tools
and data to develop the solution. There is one catch - the winner grants Spock
non-exclusive right to use the software and data. As an FYI, much of Google is actualy written in Python with the Search Engine Core written in C++. Python provied scripting
support for the search engine. and some apps like google code are done
in pythonMike: Can you give us and example of how this works?From their website: Consider
the following two documents with the target name "Michael
Jackson":
Michael
Jackson - The King of Pop or Wacko Jacko?

Michael
Jackson statistics - pro-football-reference.com

The
referents of these articles are the pop star and football player,
respectively. They've also included the ground truth for the
training set so you have something to compare against.


Once
you're done training, you can run your algorithm on the test set and
submit your results on this site. Spock will provide instant feedback
in the form of a percentage rank score. This way you can see how you stack up against the
other teams.


So
they provide you with a lot of well constructed data, and the ground
truth about that data. "Ground truth? data is real
results and you use this information to validate your search
algorithm results.



This
data is documents about people, and the challenge is to determine all
the unique people described in the data set. This data can be your
training set. Once you have got your basic algorithm working against
the training set, they let you further tune your code by running it
against a second test data set and give you instant accuracy feedback
in the form of a score. The score depends on how many correct unique
people you can identify in the data. This way you can continue to
refine your work, and see how you are doing, and how well others are
doing.



This looks like a great academic challenge. At
the end of the contest time, you submit your code, a 3 page
description of your approach, pre-built binary executables that can
run in isolation on Spock servers, and your results (the "Software
Entry?). Spock will select the finalists based upon
submissions, and fly the finalists to visit the judges. The winner
will win $50,000, 2nd place wins $5000 and 3rd place wins $2000.



Mike: How doe people enter?You
may enter the Contest by registering online at
www.spock.com/contestregistration
. You may register as an individual or as a team. During the
registration process, you must provide your name, your age, your
email address, and the country you are from. If you are entering on
behalf of an organization, a school or a company, you must identify
its name. If you are registering as a team, you must provide the same
information for each member of your team as well as the identity of a
team leader. You will also provide a name for your team or for
yourself by which you or your team will be known to other
participants in the Contest. Spock may change the name if it feels
the name you select is not appropriate for any reason.

Mike: What are the differences between the Spock Challenge and the Netflix Challenge? From Netflix website: The Netflix Prize (http://www.netflixprize.com ) seeks to substantially improve the accuracy of predictions
about how much someone is going to love a movie based on their movie
preferences. Improve it enough and you win one (or more) Prizes.

Winning the
Netflix Prize improves Netflix ability to connect people to the movies they love.
Netflix provides you with a lot of anonymous rating data,
and a prediction
accuracy bar that is 10% better than what Cinematch can do on the same
training data set. (Accuracy is a measurement of how closely predicted
ratings of movies match subsequent actual ratings.) If you develop a
system that Netflix judges  beats that bar on the qualifying test set
they
provide, you get serious money and the bragging rights. But (and you
knew there would be a catch, right?) only if you share your method with
Netflix and describe to the world how you did it and why it works. In addition to the Grand Prize, we're also offering a $50,000
Progress Prize each year the contest runs. It goes to the team whose
system we judge shows the most improvement over the previous year's
best accuracy bar on the same qualifying test set. No improvement, no
prize. And like the Grand Prize, to win you'll need to share your
method with us and describe it for the world.


The Netflix contest started October 2, 2006 and continues through at least October 2, 2011.So..... back to your question - The Netflix Challenge will run another 4 years; Spock Challenge has
every intention to give out the grand prize to a team with a reasonable
solution at the end of the 6 months.
Netflix Chellenge sets an absolute standard for winning the grand
prize; Spock Challenge intends to award to the best reasonable solution.





Mike: How about some other companies?


Wink
– www.wink.com Similar
to Spock – launched a few months ago. Claim that Wink People
Search now searches over two hundred million people profiles.
Searches people across numerous social networks including MySpace,
LinkedIn, Friendster, Bebo, Live Spaces, Yahoo!360, Xanga, Twitter
and more. Also included in the results are Web sources such as
Wikipedia and IMDB with more coming all the time.



Zoominfo
– www.zoominfo.com Specializes
in executive searches. Claim 37,131,140 People and 3,518,329
Companies indexed. You can currently search on three categories –
people, jobs and companies.
Searchwikia - http://search.wikia.com Jimmy Wales and his open-source search protocol and human collaboration project. From Press release:
"Last week Wikia acquired Grub, the original visionary
distributed search project, from LookSmart and released
it under an open source license for the first time in four years. Grub
operates under a model of users donating their personal computing
resources towards a common goal, and is available today for download
and testing at: http://www.grub.org/ .



Grub, now open source, is designed with modularity so that
developers can quickly and easily extend and add functionality,
improving the quality and performance of the entire system. By
combining Grub, which is building a massive, distributed
user-contributed processing network, with the power of a wiki to form
social consensus, the open source Search Wikia project has taken the
next major step towards a future where search is open and transparent".