|
Content uniqueness is always checked by the search engines during the analysis of sites they index. And if the search engine discovers that the essential part of the text or even the whole text has been copied from other Web resources, this site will never acquire high rankings in the search engine results. How exactly is content uniqueness checked?
When the search engines check content uniqueness they take into account the so-called Zipf's law. George Kingsley Zipf – an American linguist and philologist – studied statistical occurrences in various languages. He discovered empirical regularity of the frequency of occurrence of words of a natural language in a text.
Zipf's law can be conventionally divided into two parts. The first part states that the product of the frequency of any word in the text and its frequency rank is a non-varying value (a constant). The frequency rank of the most frequently used word equals 1.
The second part of the Zipf's law states that the form of the curve showing interdependence of the word frequency and the number of words with this frequency in the text is the same for all texts .
On the basis of this law the search engines divide all words that occur in the text of the page under check into certain groups. For example, short words (interjections, prepositions, conjunctions) are the most frequently occurring words in any text. However, these words not not possess the semantic meaning, therefore the search engines do not take them into account during page ranking. Such words are called stop words.
The group of words that have the important semantic meaning for every particular site is called the keywords. The search engines take these words into account during site ranking. The third group of words is occasional words. They have the semantic meaning but they are not important for this particular site. These words are not taken into account during site ranking.
Thus Zipf's law allows to check content uniqueness taking into account only those words that carry certain sense load (meaning for the site) and to ignore punctuation marks, conjunctions, prepositions, interjections. Purification of the text from these “unnecessary” words and signs is called canonization of the text.
Content uniqueness is checked by the search engines on the basis of complex algorithms. One of such algorithms is shingling algorithm. Before you check content uniqueness using the shingling algorithm, you should canonize the text.
The text under check is tokenized, i.e. divided into shingles – contiguous subsequences of tokens. A shingle may consist of different number of tokens, the number of tokens is called the size of a shingle. The second word in the first shingle is the first word in the second shingle, the second word in the second shingle is the first word in the third shingle etc. Thanks to such principle of shingle building none of words will remain unchecked.
For every shingle a check sum (a signature) is calculated. It is a unique number that is put in correspondence with a certain fragment of the text, in this case – with a shingle. The calculation of the signature is performed according to one of predetermined algorithms.
Two different fragments of the text cannot have the same signature; this is the essence of the shingling algorithm. From the great number of signatures of the whole text (the number of signatures equals the number of words in the text minus the size of the shingle minus one) only those are selected which comply with a certain criteria, for example signatures which are divisible by 10 or 25.
When you check a text for uniqueness its signatures are compared with the signatures of another text. If any matches are found the text us not unique. The more matches have been found, the higher the probability that the text is a copy. It is evident that when you use this method for checking content uniqueness, the accuracy of results increases as the size of shingles reduces.
The check of content uniqueness using the shingle algorithm allows to find not only full copies of documents but also almost replicas, i.e. sightly rewritten texts. It makes the shingling algorithm very popular and allows to create various online and desktop software applications for checking content uniqueness on the basis of this algorithm.
However, the shingling algorithm has a considerable drawback. It is very hard to single out phraseologisms and popular quotations (i.e. widely used word combinations) from the general text. If the text under check contains matches by such word combinations, the algorithms will show low degree of uniqueness.
The specialists of our web design studio achieve very high content uniqueness (95% and more). This number conforms to the quality standards of the major search engines. Therefore when you request copywriting services in WebStudio2U you can be sure that you get unique texts that are competently optimized for the Web. This content will become an effective tool of your business on the Internet.
TAGS
content uniqueness,
check content uniqueness,
shingling algorithm,
zipf's law,
shingle,
token |