Algorithms Google PageRank

The original algorithm of PageRank was described by Lawrence Page and Sergey Brin. It’s calculated by the formula:

PageRank(A) = (1-d) + d(PageRank(T1)/C(T1) + … + PageRank(Tn)/C(Tn))

Where

- PageRank(A) is the PageRank of the A page,

- PageRank(Ti) is the PageRank of the Ti page which has a link to the A,

- C(Ti) is the number of outer links on the Ti,

- d is the damper factor with a value from 0 to 1.

So first of all we see that PageRank doesn’t value the entire site, it’s defined for each page. Next, the A PageRank is recursively defined by the PageRanks of those which have links to A.

Th eTi PageRanks, that have links to the A page influence the value of the A PageRank not equally. In this algorithm the PageRank of the T page is always divided into the amount of links C(T) that appear on T (we can call it weighted PageRank). It means that the more links the T page has, the less use A gets from T.

Then the weighted PageRanks of the Ti pages are added. So we can conclude that another incoming link to the A page ALWAYS raises its PageRank.

And at last the sum of the weighted PageRanks from all the Ti pages is multiplied to the damper factor which is from 0 to 1. This lowers the influence of the page’s PageRank that has a link to another one.

Random Surfer Model

Larry Page and Sergey Brin gave a simple intuitively-clear explanation to the PageRank algorithm in their publications. They see PageRank as a user behavior model, where a surfer clicks the links randomly not caring about the content.

A random surfer will visit the web-page with some probability that is a derivative from PageRank. The probability of that a random surfer clicks the link depends exclusively on the amount of links on the page. That’s why a PageRank of a certain page is never given to other ones it’s linked to in a whole value, but is divided by the amount of links at the page.

So the probability of that a random surfer finds a certain page is the sum of the probabilities that this surfer follows the links to this very page. Now this probability is lowered by the damper factor d. The explanation of this factor in the Random Surfer Model is that this surfer will not click the infinite amount of links. Once he gets bored he’ll go to another random page.

The probability of that a random surfer won’t stop clicking the links is defined by the damper factor d which carries the value between 0 and 1 depending on the probability level. The higher d is the more probably is that a random surfer continues clicking the links. As soon as a surfer goes to another random page after he stopped clicking the links the probability is realized in the algorithm as (1-d). No matter how many incoming links there are the probability of that a random surfer goes to the page always equals (1-d) and that’s why the page has minimal PageRank.

Another interpretation of the PageRank algorithm

Larry Page and Sergey Brin published two different versions of the PageRank algorithm in different documents. The second version of the algorithm defines the A PageRank as:

PageRank(A) = (1-d) / N + d (PageRank(T1)/C(T1) + … + PageRank(Tn)/C(Tn))

where N is the number of all pages in the web. Actually the second version has no fundamental difference from the first one. As to the Random Surfer Model, PageRank in the second version is the actual probability for a surfer to find the page after clicking a lot of links. PageRanks then form the probability distribution (stochastic distribution) between the web-pages so that the PageRanks’ sum of all the pages is the same.

And the other way round, in the first version the probability to find a page for a random surfer is weighted by the common number of web-pages. So in this version PageRank represents an expected number of times a surfer visits the page starting the random surfing procedure several times considering the entire amount of pages in the web. If there are 100 pages and the needed page has the PageRank value of 2, a random surfer will visit this page twice every 100 restarts.

As it was said before both versions of the algorithm don’t differ fundamentally from each other. The PageRank received in the second version of the algorithm needs to be multiplied to the number of all the web-pages to receive the PageRank of the first version. Page and Brin mixed both versions in their most popular document “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, where they put the task of the first algorithm version to form the distributed probability between the web-pages with the only PageRanks sum for all pages.

Hereafter we are using the first version of the algorithm. The reason is that PageRank calculations are easier here as we can’t even imagine the approximate amount of pages.

PageRank Characteristics

PageRank characteristics will be described by one little example.

Imagine a small network containing three pages A, B and C where A has links to B and C, B has a link to C and C – to A. According to Page and Brin the damper factor is usually set to 0.85 but to simplify the calculations we’ll use 0.5. The exact value of the damper factor affects PageRank of course but it doesn’t influence the fundamental principles of PageRanking. So we have the next equations to calculate the PageRank:

PageRank(A) = 0.5 + 0.5 PageRank(C)

PageRank(B) = 0.5 + 0.5 (PageRank(A) / 2)

PageRank(C) = 0.5 + 0.5 (PageRank(A) /2 + PageRank(B))

They are easily calculated. And we get the following PageRank values for our pages:

PageRank(A) = 14/13 = 1.07692308

PageRank(B) = 10/13 = 0.76923077
PageRank(C) = 15/13 = 1.15384615

Obviously the sum of these values is 3 and this is the amount of all pages. As it was told before it’s not just a special result for our simple example.

It was easy to solve the equations for our simple three pages example to calculate their PageRank. In practice the network contains billions of documents and there is no way to find the result rechecking it.

Iteration PageRank Calculation

Because of the great size of the network Google uses the approximate iteration calculation of the PageRank values. It means that every page is given the basic value and after that the PageRanks of all the pages are calculated in several calculation cycles based on the PageRank algorithm equations. The iteration calculation is illustrated in our three pages example where every page is given the basic PageRank value of 1.

Iteration

PageRank(A)

PageRank(B)

PageRank(C)

0

1

1

1

1

1

0.75

1.125

2

1.0625

0.765625

1.1484375

3

1.07421875

0.76855469

1.15283203

4

1.07641602

0.76910400

1.15365601

5

1.07682800

0.76920700

1.15381050

6

1.07690525

0.76922631

1.15383947

7

1.07691973

0.76922993

1.15384490

8

1.07692245

0.76923061

1.15384592

9

1.07692296

0.76923074

1.15384611

10

1.07692305

0.76923076

1.15384615

11

1.07692307

0.76923077

1.15384615

12

1.07692308

0.76923077

1.15384615

We can see here that we get a good approximation to the real PageRank values after several iterations. Referring to the works of Larry Page and Sergey Brin we find out that we need about 100 iterations to get the necessary approximation of the PageRank values of the entire network.

When using iteration calculations the PageRanks’ sum of all pages still equals the amount of all the network pages. So the average PageRank of a page is 1. The minimum PageRank is defined by the expression (1-d). As a result the maximum PareRank is defined by the expression dN+(1-d), where N is the number of all pages in the web. This maximum can be reached (theoretically :)) if all the pages have links to this one and this page has links only to itself.

PageRank Algorithm

Some years ago the Google Company created its unique and patented algorithm accounting the outer links to the site pages and called it PageRank. It was created to define the site ranks in search results.

Basically Google thought the theoretical attendance to be the criteria of a page’s relevance.

We also know that this attendance is defined by the probability that a user will leave the site and start watching through it from another page (the approximate coefficient – 0.15). There’s also the probability of that a user will go on watching the pages using the links at the very ones. As a result more popular pages will be visited more often than less known ones.

PageRank has a floating value with several signs after point because it uses the coefficients of different probabilities. To make is easier Google shows the PageRank value from 0 to 10.

With this data we may conclude that:

- every page of a site has a greater value than 0 (it may be 0.0000 …), because still there’s the probability that a user will visit this page;

- if a page has links to other pages it gives some part of its own PageRank to these pages and the given value depends on the amount of the outgoing links;

- the PageRank value is given in some proportions and not totally with the further lowering (i.e. if there is a link at a page with the PageRank value of 5 to a page with the PageRank value of 0, the last one will not surely get 5).

So this is how the PageRank algorithm indirectly affects pages’ positions in search results.

Оставьте комментарий к статье