|
|
A New Probable Algorithm For A Search Engine | ||
Discussion by CaptainRon with 6 Replies.
Last Update: March 29, 2006, 12:38 pm | |||
![]() |
|
|
Now some times i make a search and i get thousans of results. I am looking for that one exact thing which i simply can't find in the top 50 results in google. I move to the next page... and viola! there it is! WHy is it that the exact thing i searched, have to be on the 50+ listings?
Probably page rank algorithm now explains it.
OK, suppose the search engine gives us a choice between two methods, in one Credibility is given priority and in the other the Information is given priority.
We rank Information on the basis of HTML tags. If what I searched for appears in a web pages <title> tag, i think it gets the highest priority. Suppose it appears next in the <h1> or the <h2> tag, it gets the next highest priority. Likewise, we do weighting of Information in the pages by combining all the factors together.
Like, just if the <title> tag contains something wouldn't mean that it is the best result. We would look for how big is the plain content in the page and how is it structured on the basis of subheadings. A sub heading could be identified as anything that is stronger than the plain matter... anyway i will think on this more...
QUOTE (twitch)
Interesting. But could you explain to me the variables based on Credibility? I'm not trying to doubt you in any way, but I am rather curious as to what mechanics the indexing system holds in order to rank things through credibility. I suppose it could be based on the number of other resources that share similar content, but then wouldn't that lower the unique factor, which could work against your Page Rank and keyword search.
Link: view Post: 71139
Hi,
I suppose Google already masters the credibility technique with PageRank method. I want that my search shall reveal the exact information i am searching for. For that matter, I suppose the above method described shall hold good.
The search engine database shall store the pages (cache them) in hierarchical format. That is, not just HTML, but a plain sructure in which the 'content' data shall be arranged. So like wise we can quickly search the different tags to find relatively important data. I dont know if the other search engines already use this method or not.
For example, you say check the title as the most important, H1 second most important, etc. Well someone could simply spam their titlebar with a large number of commonly searched phrases, and then might show up in a search for "free web hosting" even though it is actually a porn site or something. Similarly, that is like when sites put "invisible" (same color as background) text with tonnes of common terms only to draw in search engines.
By ranking pages based on credibility it allows pages that people actually find useful to be the first results, followed by those the crawler "thinks" has the correct information.
I may have misinterpreted what you meant, but assuming I didn't then this should all be relevant heh
~Viz
For example, my friend complained about searching something on Google for half an hour and not finding it in the end. Then I tried finding it and did it in about 5 minutes. So you see, it's al about choosing the right words to search for.
As for examining the content of the page more, we all know why it is not good. Google might be smart, but I think it's not smart enough to recognize false keywords. When I say false keywords I mean words at the bottom of the page, the same colour as the background, that are there because it helps increase the rank at some search engine. That's why Google isn't paying much attention to it.
On the other hand, the link counting is perfect, since it is not really possible that a single person own hundreds of good web sites that would link to his other site
I just gave a brief scrap of what came to my mind. Let's say I get serious with this technique, I will make a more complex implementation.
To give a small explanation:
I will create a tree structure, just for a single page. When I say I will give more importance to the title tag, it means it will be the ROOT of the Tree. The H1 (or to be precise, any bold html that shows up prior to simple text) tags will come as nodes, and the content they discuss will come as child to those nodes. To simplify look up, the content is broken up into keywords which have a proper construct (like the way MS Word Grammar check does). These keywords are associated into a index table (just for that particular page, and in the specific subnode), with their occurence frequencies. Now since I said I will index only those keywords which follow proper construct, it will stop spammers from repeatedly wrting the same key word over and over again. After that I create a diversity factor. Usually, in previous case, a spammer could re-write a sentence with same keywords many times over and over again. To cut that, the diversity factor is calculated as a function of words in a sentence construct. It will also include non-keywords like (is that the them their etc), hence a unique paragraph with meaningful text gets properly credited.
This along with frequency table will make the index table.
This index table is then finally generated for the whole page and belongs to the tree structure. Such tree structure is generated for each and every page that is submitted, and then in the end these tree's finally become the part of the giant tree called the webspace. The way a page-tree enters teh web space is, it is categorically stored. Categories are created on the basis of keywords, and a page-tree can belong to several keywords (ofcourse), but are linked with weighted nodes, where the weight of the node tells that how prominent that key word is in the page tree.
Remember that the keyword weight is a function of "where it appears in the page" plus the frequency plus the diversity factor. It can all become a complex mathematical equation if I sit down to seriously work upon it.
But the point is... in a world dominated by Google, its impossible to outperform it. Look at Acoona... a real fine search engine with little future.
Similar Topics:
Search Engine Optimization
Profusion.com The Original Meta Se...
Some Html Ways To Increase Your Ran...
Blingo Search Engine A SE that gives away random prizes (3)
|
(9) Profusion.com: The Original Meta-Search Engine The original Meta-Search engine
|
HOME 






