What does it do?
Identify what written languages the website is using (e.g. English, French, Spanish).
Why is it important?
For large sites, identifying the amount of content in different languages can be difficult. This test is also used by several other tests (such as Spelling) to determine how parts of the site should be tested differently.
How is it measured?
Sitebeam checks each page in the website, and attempts to classify each sentence in the website by language.
This test never awards a score. If more than one language is found, it reports detail on what pages use which languages.
British English, Chinese/Japanese/Korean, Code, French, German, Portuguese, Russian, Spanish, US English
Sitebeam can detect and classify text as some special ‘languages’ which you may not expect:
Sitebeam was unable to accurately say what language this is. Usually this means it doesn’t know the language, or there wasn’t enough correctly spelt text to be sure.
Text which appears to be computer code, or mathematics. For example, HTML, C++, PHP or mathematical formulae.
This is classified as a single language for some cases, where Sitebeam knows that the language is one of these three, but is unable to determine which. It is a relatively hard problem for a computer to distinguish between these languages in some cases (i.e. where they share certain characters).
A series of complex algorithms are used to detect language on a per-sentence level. In summary:
- Only the text for each page is analyzed. The claimed language (as specified by the
langattribute) is ignored, as are HTTP headers or the contents of images.
- The overall text for each page is classified by a combination of n-Grams and character frequency distribution. For example, the presence of certain character ranges almost invariably indicates languages such as Chinese, Japanese or Russian. n-Grams allow for differentiation between languages that share similar alphabets, as they recognize patterns in the order of these letters.
- A probability of possible languages for the whole page is now known. For possible languages, a series of more nuanced locales is calculated. For example, English could be British English or American English. This is our list of possible languages.
- Each sentence is tested in turn, and classified based on matches against a dictionary for the possible languages. In many cases, this results in ambiguous answers (i.e. we match no dictionaries, or we match several dictionaries equally). All possible matches are stored for now.
- At the end of this process, we consider each language family (e.g. English, which includes British English and American English) and attempt to determine which is more likely. We then classify every instance of this family with our preferred family, if we have sufficient certainty that this is correct. If not, we leave it as ambiguous.
- We now iteratively merge the lower probability and most similar languages until we have no more than 3 for the whole page. For example, if we have some possible Spanish and Portuguese text, we will likely only end up with one or the other.
- All pages are classified in this manner, leaving some ambiguities.
- Once the whole site has been analyzed, we revisit the ambiguous pages and resolve those based on the analysis of the whole website. For example, pages which could be British or American English are now resolved based on what we’ve seen elsewhere in the site.
This test doesn’t recognize some languages
This test can’t detect all languages in existence. More obscure languages may not be recognized because Sitebeam doesn’t know them. This is not a fault, and it will not affect your scores, but Sitebeam isn’t able to name the language for you.
This test incorrectly guesses a language
The most common mistake is between very similar languages, such as British and American English, or Spanish and Portuguese. It is much more difficult – sometimes impossible – for Sitebeam to classify these automatically.
You can help Sitebeam by telling it exactly what languages to expect under Site settings > Test configuration. This will reduce the possible languages which it tests, but it can still only recognize languages that it knows.