Listen "Flat World Strategies: Google and Search Wikia, Search Technology Explained [23:10]"
Episode Synopsis
Intro: Right before the 2006
holidays Jimmy Wales, creator of the online encyclopedia Wikipedia, announced
the Search Wikia project. This project will rely on search results based on the
future sites community of users. In this podcast we take a look at popular
search engine technologies and discuss the Search Wikia project concept.
Question: I know this project was
really just announced. Before we get into the technology involved - can you
tell us what phase the project is in?According to the BBC Jimmy Wales is currently recruiting
people to work for the company and he's buying hardware to get the site up and
running.
Question: What makes this concept
fundamentally different than what Google or Yahoo! Are doing?When Wales announced the project he came
right out and said it was needed because the existing search systems for the
net were "broken". They were broken, he said, because they lacked
freedom, community, accountability and transparency.
Question: This sounds a lot like digg - am I on the
right track?Yes you are - what you end up with
is a digg like application, or what Wales is calling, a
"people-powered" search site.
Question: Can you provide a bit more
detail on how Google works?Googlebot is Google's web crawling
robot. Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html,
and through finding links by crawling the web.
Source: www.google.com
Question: That's Googlebot, how does
the indexer work?Googlebot gives the indexer the
full text of the pages it finds. These pages are stored in Google's index
database. This index is sorted alphabetically by search term, with each index
entry storing a list of documents in which the term appears and the location
within the text where it occurs. This data structure allows rapid access to
documents that contain user query terms.
Source: www.google.com
Question: So now that everything is
indexed, can you describe the search query?The query processor has several
parts, including the user interface (search box), the "engine" that
evaluates queries and matches them to relevant documents, and the results
formatter.
PageRank
is Google's system for ranking web pages. A page with a higher PageRank is
deemed more important and is more likely to be listed above a page with a lower
PageRank.
Source: www.google.com
Question: Can you run us through,
step by step, a Google search query?Sure - this is also off of Google's
site, Here's the steps in a typical query process:
1. User accesses google server at
google.com and makes query.
2. The web server sends the query
to the index servers. The content inside the index servers is similar to the
index in the back of a book--it tells which pages contain the words that match
any particular query term.
3. The query travels to the doc
servers, which actually retrieve the stored documents. Snippets are generated
to describe each search result.
4. The search results are returned
to the user in a fraction of a second.
Source: www.google.com
Question: OK, so now we know how
Google and Yahoo! How will this new Search Wikia type search engines work.I can give some details based on
what I've taken a look at. As we've said the Search Wikia project will not rely
on computer algorithms to determine how relevant webpages are to keywords.
Instead the results generated by the search engine will be decided and edited
by the users.
There are a couple of projects
called Nutch and Lucene, along with some others that can now provide the
background infrastructure needed to generate a new kind of search engine, which
relies on human intelligence to do what algorithms cannot. Let's take a quick
look at these projects.
Lucene: Lucene is a free and
open source information retrieval API, originally implemented in Java by Doug
Cutting. It is supported by the Apache Software Foundation and is released
under the Apache Software License.
We mentioned Nutch earlier. Nutch
is a project to develop an open source search engine. Nutch is supported by the
Apache Software Foundation, and is a subproject of Lucene since 2005.
With Search Wikia Jimmy Wales hopes to build on Lucene and Nutch by adding the social component. What we'll end up with in the end is more intelligent and
social based search tools. Now, don't think Google, Yahoo!, Microsoft and all
the rest are not working on these kinds of technologies. It will be interesting
to watch how these new technologies and methods are implemented.
Sources: http://search.wikia.comhttp://search.wikia.com/wiki/Nutchhttp://lucene.apache.org/java/docs/
http://wikipedia.org/
References:
Wikipedia creator
turns to search: http://news.bbc.co.uk/2/hi/technology/6216619.stm
How
Google Works: http://www.googleguide.com/google_works.html
Search Wikia website:
http://search.wikia.com
Search Wikia Nutch website
http://search.wikia.com/wiki/Nutch
Lucene Website: http://lucene.apache.org/java/docs/
Wikipedia Website:
http://wikipedia.org/
holidays Jimmy Wales, creator of the online encyclopedia Wikipedia, announced
the Search Wikia project. This project will rely on search results based on the
future sites community of users. In this podcast we take a look at popular
search engine technologies and discuss the Search Wikia project concept.
Question: I know this project was
really just announced. Before we get into the technology involved - can you
tell us what phase the project is in?According to the BBC Jimmy Wales is currently recruiting
people to work for the company and he's buying hardware to get the site up and
running.
Question: What makes this concept
fundamentally different than what Google or Yahoo! Are doing?When Wales announced the project he came
right out and said it was needed because the existing search systems for the
net were "broken". They were broken, he said, because they lacked
freedom, community, accountability and transparency.
Question: This sounds a lot like digg - am I on the
right track?Yes you are - what you end up with
is a digg like application, or what Wales is calling, a
"people-powered" search site.
Question: Can you provide a bit more
detail on how Google works?Googlebot is Google's web crawling
robot. Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html,
and through finding links by crawling the web.
Source: www.google.com
Question: That's Googlebot, how does
the indexer work?Googlebot gives the indexer the
full text of the pages it finds. These pages are stored in Google's index
database. This index is sorted alphabetically by search term, with each index
entry storing a list of documents in which the term appears and the location
within the text where it occurs. This data structure allows rapid access to
documents that contain user query terms.
Source: www.google.com
Question: So now that everything is
indexed, can you describe the search query?The query processor has several
parts, including the user interface (search box), the "engine" that
evaluates queries and matches them to relevant documents, and the results
formatter.
PageRank
is Google's system for ranking web pages. A page with a higher PageRank is
deemed more important and is more likely to be listed above a page with a lower
PageRank.
Source: www.google.com
Question: Can you run us through,
step by step, a Google search query?Sure - this is also off of Google's
site, Here's the steps in a typical query process:
1. User accesses google server at
google.com and makes query.
2. The web server sends the query
to the index servers. The content inside the index servers is similar to the
index in the back of a book--it tells which pages contain the words that match
any particular query term.
3. The query travels to the doc
servers, which actually retrieve the stored documents. Snippets are generated
to describe each search result.
4. The search results are returned
to the user in a fraction of a second.
Source: www.google.com
Question: OK, so now we know how
Google and Yahoo! How will this new Search Wikia type search engines work.I can give some details based on
what I've taken a look at. As we've said the Search Wikia project will not rely
on computer algorithms to determine how relevant webpages are to keywords.
Instead the results generated by the search engine will be decided and edited
by the users.
There are a couple of projects
called Nutch and Lucene, along with some others that can now provide the
background infrastructure needed to generate a new kind of search engine, which
relies on human intelligence to do what algorithms cannot. Let's take a quick
look at these projects.
Lucene: Lucene is a free and
open source information retrieval API, originally implemented in Java by Doug
Cutting. It is supported by the Apache Software Foundation and is released
under the Apache Software License.
We mentioned Nutch earlier. Nutch
is a project to develop an open source search engine. Nutch is supported by the
Apache Software Foundation, and is a subproject of Lucene since 2005.
With Search Wikia Jimmy Wales hopes to build on Lucene and Nutch by adding the social component. What we'll end up with in the end is more intelligent and
social based search tools. Now, don't think Google, Yahoo!, Microsoft and all
the rest are not working on these kinds of technologies. It will be interesting
to watch how these new technologies and methods are implemented.
Sources: http://search.wikia.comhttp://search.wikia.com/wiki/Nutchhttp://lucene.apache.org/java/docs/
http://wikipedia.org/
References:
Wikipedia creator
turns to search: http://news.bbc.co.uk/2/hi/technology/6216619.stm
How
Google Works: http://www.googleguide.com/google_works.html
Search Wikia website:
http://search.wikia.com
Search Wikia Nutch website
http://search.wikia.com/wiki/Nutch
Lucene Website: http://lucene.apache.org/java/docs/
Wikipedia Website:
http://wikipedia.org/
More episodes of the podcast Professor
Hacking Car Anti-collision Systems [19:08]
28/08/2016
Lock It and Still Lose It [24:11]
15/08/2016
Intro To Pokemon Go [31:14]
24/07/2016
4K Ultra High Definition Television [22:37]
07/05/2016
FCC Spectrum Auction 2016 [32:00]
10/04/2016
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.