This article will show how we used the approach of genetic algorithms to optimize the distribution of links within web sites to maximize the average rank of individual web pages.
We begin by explaining how about a link analysis algorithm ranking (LAR), which are used by search engines to make an 'order (or ranking) of web pages. It infers importance to a web page based on the topological structure of the graph, extracted by analyzing the links of the site in question, analyze and along the arches outgoing and incoming arcs, that is, outgoing links and incoming links of pages that make up the graph. Based on this information is a value associated with each page, which will be used for sorting.
The predecessor of link analysis algorithms ranking is InDegree which calculates the popularity of a page, taking into account the number of pages that have a link to it. Algorithms latest and most refined are:
- Kleinberg (hereinafter called HITS)
PageRank follows a random graph of the web, where every page spreads its weight to the pages to which it links, resulting in an array of weights appointed following weights of authority. The complete algorithm for calculating the PageRank relies on the use of the theory of Markov processes.
Kleinberg proposed schema of a spread of weights at two levels, leading to the values of authority and also the values of hub. A page with high hub, is a page that contains links to pages of quality (ie with high authority) and symmetrically a page with a high authority, will be a page from many good bet hub.
Salsa is a hybrid between the two previous algorithms.
To approximate these algorithms take into account the formula originally developed by the founders of Google.
- Rank[A] is the value of Page Rank What we want to calculate
- P[i] are the pages containing a link to A
- Rank[Pi] are the values of Rank pages P1 ... Pn
- Link[Pi] are the total number of links on the page that offers links
- d (damping factor) is a factor that Google decided that the original documentation becomes 0.85. It can be adjusted by Google to determine the percentage of PageRank, which must pass from one page to another and the value of the minimum rank assigned to each page in the database.
The algorithms in question are obviously much more complex and our aim is not to replicate faithfully, but to obtain a mathematical model that can approach for a "Our value of Rank that will enable us to determine the goodness of the topological structure calculated the 'genetic algorithm after it was applied.
Let us now an overview on genetic algorithms.
The genetic algorithm is an algorithm to analyze the data and belongs to a particular class of algorithms used in various fields, including the' artificial intelligence. It is a heuristic search and optimization, based on the principle of natural selection of Charles Darwin that governs the development.
A typical genetic algorithm from a number of possible solutions (individuals) calls population and arrange for their evolve during the execution: each iteration, it makes a selection of members of the public is used to generate new elements of the population , Which will replace an equal number of workers already present, and thus constitute a new iteration for the population (or generation) below. The succession of generations to evolve an optimal solution of the problem assigned.
The evolution is obtained through a recombination of partial solutions, each individual transmit part of its genetic heritage to their descendants. The introduction of mutations in the population of random departure, occasionally generates new workers, with features not included among those present in the genetic of the original species.
Once the stage of evolution, population generated at each iteration, is analyzed and are required only solutions that best solve the problem: individuals with the qualities most suitable environment in which they find are more likely to survive and reproduce. These solutions will suffer a new phase of development and so on.
In the end, we expect to find a population of solutions that can adequately solve the problem. There is no way to determine in advance whether the algorithm will actually be able to find an acceptable solution. As a rule, genetic algorithms are used for optimization problems for which no one knows algorithms complexity or linear polynomial.
Details of the operation
The solution of the problem is encoded in a data structure, usually a string, called gene.
Initially creates a number of genes at random and you define a function that returns the "goodness" of a gene as a solution to the problem, said fitness function.
The algorithm consists of the operations that tend to alter the population of genes in an attempt to improve them so that a better solution.
The evolution proceed in steps, for each of these is first performed a sort of genes on the basis of the result of the fitness function. They then performed the operations on a number of genes parameters established by the algorithm, which generally determine how many genes must undergo crossover and mutation, and to what extent.
The algorithm then evolves through the following points:
- Generation, in a random initial population
- Creation of a sequence of new populations, or generations.
In each iteration, the current members of the public are used to create the next generation and to this
end, make the additional steps:
- Each member of the current population is estimated calculating the value of their fitness
- It determines an appropriate order of those individuals on the basis of the values of fitness
- The most promising individuals are selected as parents
- From these individuals will generate an equal number of individuals of the next generation, and this can be done in two ways, ie by making random changes on a single parent (mutation) or appropriately combining the characteristics of a pair of parents (crossing )
- Individuals thus be generated to replace the parents by allowing the formation of the next generation
- Finally, the algorithm s'interrompe when one of the criteria for arrest is satisfied.
According to a coefficient determined initially, some parts of the genes better results are exchanged, assuming that this can improve the outcome of the function of fitness in the next "evolutionary step".
Single point crossover
There are various techniques of crossover. One of the simplest is the "single point crossover" which is to take two workers and cut their strings encoding in a case in point. It creates two heads and two tails. Now exchange the heads and tails, getting two new genes. The crossover is not always applied, but with a probability pc. Where is not applied children are simply copies of the parents.
Experimentally you can see that the improvement is welcome after a number of steps.This in cases of less fortunate, of course.
The mutation is the random changes to portions of genes with a value of fitness lower, based on factors defined initially. These changes aim to improve the value of the function for the gene in question.
In reality it is not proper to think of changing only the chromosomes with lower fitness, in order to provide greater capacity exploratory the algorithm (and not end up in "holes" in excellent local) are also considered useful mutations of chromosomes with the same high fitness. Ultimately the mutations mainly served to explore the area of research, did not order improvements.
A website is in fact a graph. And it is the graph of the site under consideration that will apply the 'genetic algorithm.
The representation is used for adjacency matrix.
The matrix of a nearby structure particular data commonly used in the representation of graphs. In particular, is widely used in the preparation of algorithms that operate on graphs and their representation in the general computing.
Since any graph its matrix of adjacent consists of a binary square matrix that has the indices of rows and columns of the names of the top of the graph. In the place (i, j) of the matrix is a 1 if and only if there is an arc in the graph that goes from summit to summit the j, otherwise there is a 0.
If instead of 1 in the matrix are numbers, these are interpreted as the weight given to each arc. For example, if all the vertices of the graph represents a series of dots on a map, the weight of strings can be interpreted as the distance of the points they connect. In our case indicate the quantity of links.
In the case of representation of non-oriented graphs, the matrix is symmetrical than the main diagonal.
One of the features of this matrix is allowed to obtain the number of paths from one node to node i j who must cross n nodes. To achieve all this is sufficient to power the same n-matrix and see the number that appears instead i, j.
As previously explained a gene is a binary string, then a sequence of 0 and 1. The idea behind this algorithm is a matrix that can be seen as a binary string and then a gene. If we want is access to the position in X, Y of a matrix organized as a string, just apply the following formula:
Applying a genetic algorithm to graph the site before you can get a better topological structure that improves the rank of individual pages.
Have been studied and implemented other processes of evolution than to crossover previously explained, specially designed for the SEO problem.
The fitness function is quite complex and among the parameters that take into account are the average rank of goal pages, the average number of links per page and number of pages void.
The goal pages are pages of the site that are our special interest. Typically, in most cases, these pages leaf.
For anything page is a page with the same rank equal to 0.15 (it is easily deduced from the formula of rank above), or the value determined by the user.
The algorithm is able to take account of the editorial constraints and evolve as a function of them.
The application has allowed us to make new considerations on those before, perhaps, were considered certainties of the SEO. Speaking with various seo specialist, reading the forums, one of the considerations that makes the most is that we must "balancing" the distribution of links. Or (as regards internal links) if a page from a link to an 'other, it is appropriate that the spare parts, or fancedo accounts of serve for each page, outgoing link has many, many must have entrants. The model matetematico, however, seems to say completely 'opposite. The "goal" page, we want to maximize is the node 5.
nodo 2 → 0.999591
nodo 3 → 0.999652
nodo 4 → 0.999704
nodo 5 → 0.999749
nodo 2 → 1.714694
nodo 3 → 0.993759
nodo 4 → 0.544196
nodo 5 → 0.265642
As we can see in the complex model, the page you wanted to build was even penalized. Clearly this example is rather forced, but it is a simplification of the structure of many websites. The intention of this test is obviously not to suggest a circular pattern for your sites, but to show how it is not discounted the validity of the structures of which we are convinced.
Following the results of 'application of' genetic algorithm at: http://www.fratellileonelli.it
You can see the representation of 3D graph before and after the trial. The size of green areas indicate the value of page rank, while the strings are represented in purple.
We see at a glance the difference between before and after treatment. The balls (Individual web pages) have a larger diameter, which means higher rank, and are much more interconnected.
|Url||Rank Before||Rank After||Change in %|
It is important to note that the algorithm has tried to distribute in a more homogeneous the rank of the site.
The table with the final results are highlighted in red the most significant.
The pages are in the menus as you can imagine are those with higher rank, both before and after. However, their rank was lowered for the goal pages.
Of course, this behavior can be driven by need. In this particular case the customer was important to raise the rank of pages that contain the description of the plant, therefore, l 'algorithm has evolved in this direction.
As a result you can see how they were positioned additional links calculated by the process:
If you are interested to apply (for free) the software described, simply download genesis here
The application was developed with Qt 4.6, a cross platform application and UI framework.
tel .: +39 338 5482810
18 novembre 2008