Wikipedia word frequency list

April 17th, 2009

Almost all domainers are interested in keyword frequencies appearing in different contexts. One way to valuate domain name, for example, is to obtain Google Search Results Count for phrase in domain name. I am interested particularly in connection to algorithms for domain name splitting in compound words and appraisal methodologies.

There are some well known word frequency lists:

I decided to create my own list of words and associated frequences based on all articles that are in the English version of Wikipedia.

Wikipedia is HUGE. Only the English part is 21GB in XML format. It takes a 5h to parse entire file and extract statistics for all tokens that looks like a word.

Some statistics:

  • Total tokens (words, no numbers): 1,570,455,731
  • Unique tokens (words, no numbers): 5,800,280

If you know what a Zipf distribution is, I created a chart of all words in log/log scale, just to observe later that the same chart can be seen here.

Wikipedia Words Frequency List

Wikipedia Words Frequency List

The chart can be divided on four parts:

  • Rank(1-50)  Count(86M-3M) Examples(the, of, and, in, to, a, is) Words that are stop words.
  • Rank(51-3K) Count(2.4M-56K) Examples(university, January, tea, sharp) Words form the  “core” of the English dictionary — words that are most frequently used.
  • Rank(3K-200K) Count(56K-118) Examples(officiates, polytonality, neoligism) Words that can be found in some large and comprehensive dictionaries (above rank 50K are mostly Long Tail words)
  • Rank(200K-5.8M) Count(117-1) Examples(euprosthenops, eurotrochilus, lokottaravada) Terms from obscure niches, misspelled words, transliterated words from other languages, new words  and “not words at all”

Google study shows that there are 14M one word and 315M two word phrases (bigrams). Currently I have no plans to extract two words phrases due to their large number, but it is interesting to analyze them in context of two words domain names.

Technical info

The process of extracting all words and counting them is not an easy task:

  • Download a copy of Wikipedia. I used version dumped in XML format.
  • Write parser to extract text from <title> and <text> tags.
  • Wikipedia uses its own markup language. Write parser to extract all data from markup language and filter-out some unnecessary parts.
  • Filter out numbers, special characters.
  • Tokenize.
  • Collect useful statistics.

The good news is that Wikipedia is much clean and organized then the rest of the web.

Some selected words and associated counts:

  • Google  197920
  • Twitter 894
  • domain  111850
  • domainer 22
  • Wikipedia 3226237
  • Wiki    176827
  • Obama   22941
  • Oprah   3885
  • Moniker 4974
  • GoDaddy 228

Keywords statistics for .IN zone file

March 31st, 2009

I am developing algorithm for keywords detection in domain names. Thanks to Jeff at inforum.in I obtained a copy of .IN zone file (from Feb. 09) and decided to analyze words distribution in it. Keywords splitting is not easy task. Some problems I am facing are:

  • Keywords are from different languages. I use English dictionary with 100K words and this statistics is only for domains that seems to be composed of English words. There are lot of Indian proper names that are not in the dictionary. If domain name contain proper name or other unrecognizable part and English part, then only English part is included in final statistics. Numbers in domains are not counted in statistics.
  • Most domain names can be slitted in different ways. My approach splits the names in all possible ways and choose one by some heuristic but more accurate approach can involve a statistical methods for keywords co-occurrence.
  • Some domains can’t be splitted by humans too.
  • Some domains are intentionally misspelled and I do not use algorithm that detects misspelled word variants.

Some statistics:

  • Total number of domains: 315K
  • Number of domains that are splitted: 175K (numeric domains only are not counted)
  • Total number of keywords detected: 366K or 29K unique.

Following is the first 300 keywords sorted by frequency. As expected a lot of “skip words” appear at the top.

i 7200 em 404 up 259
india 6495 test 402 ads 258
on 3008 star 401 play 254
line 2504 or 401 micro 253
the 2447 your 401 finance 252
my 2348 mail 400 film 251
in 2149 love 399 wedding 251
tech 1786 digital 395 college 248
web 1495 ate 393 center 246
indian 1470 education 393 site 246
group 1447 us 391 way 246
world 1379 market 390 creative 246
net 1264 guide 386 people 244
it 1163 shop 382 inc 243
an 1102 technologies 382 security 242
travel 1077 loan 380 tour 241
info 1056 service 377 click 239
and 1038 blue 373 pay 239
free 1015 times 366 team 238
go 969 card 365 casino 237
jobs 958 chennai 361 today 236
solutions 956 network 358 lab 236
of 944 poker 357 good 236
en 925 hosting 353 directory 235
home 906 sun 352 dream 234
media 903 game 350 vision 233
health 892 food 345 future 233
global 862 ur 344 company 233
to 858 first 344 sky 233
city 853 biz 343 reliance 232
am 851 plus 339 royal 232
me 823 phone 337 san 231
tv 798 cheap 336 call 230
life 792 bio 335 baby 229
sex 774 data 331 products 228
ad 758 books 330 photo 227
design 755 red 330 planet 227
business 734 get 328 cars 227
news 730 os 328 simply 226
hotel 720 realest 327 movies 225
hotels 711 pc 325 corp 225
care 693 zone 323 cash 223
car 692 holidays 320 print 222
at 687 win 320 mall 222
mobile 685 tar 319 deals 220
art 670 travels 319 law 220
as 663 tours 318 mind 219
club 656 ind 317 girls 216
hop 641 max 316 tourism 215
services 639 med 316 video 215
hi 632 eco 314 corporate 215
pro 630 gold 312 academy 213
one 627 soft 311 consultants 213
air 608 capital 310 foundation 213
green 604 sms 309 solar 211
new 602 consulting 305 source 210
best 595 just 304 bazaar 209
no 589 ms 303 fun 209
all 588 marketing 302 tex 209
insurance 577 cricket 302 fly 208
is 576 internet 301 pages 208
for 565 properties 301 now 207
property 563 sports 301 centre 207
power 536 point 300 cards 207
music 536 raj 300 loans 206
job 532 direct 300 kids 206
delhi 529 time 300 dr 205
live 529 porn 300 techno 205
search 524 asia 296 dating 205
ala 524 energy 295 talk 204
do 520 homes 292 he 203
you 514 im 292 open 203
school 514 we 291 log 203
bank 509 space 290 work 201
money 494 box 288 radio 201
man 493 land 287 euro 201
smart 486 bangalore 287 help 200
auto 485 career 287 sale 200
com 475 studio 282 store 198
international 473 real 282 pace 196
systems 473 tel 282 ticket 196
games 469 management 281 by 195
buy 466 host 277 shopping 195
big 453 forum 275 retail 194
credit 448 fashion 274 solution 192
de 445 stock 274 technology 191
mart 437 movie 273 golf 190
book 430 find 273 day 189
trade 426 park 271 family 189
pr 426 tore 270 pal 189
ker 426 tax 269 liberty 189
oft 425 be 268 holiday 189
top 424 computer 267 mob 189
domain 421 office 267 plan 188
hot 418 super 265 yoga 188
easy 412 medical 261 realty 188
guru 412 water 260 trip 187
software 411 con 259 eye 187
house 407 express 259 labs 187
link 405 goa 259 tickets 186

Domain length distribution is shown on the next chart. It is based on all domains in zone file.

Domains Length distribution

Domains Length distribution

There are some names with the maximum size. It is really difficult to me to understand why some people register such names like:

“141592653589793238462643383279502884197169399375105820974944592.in” – this seems to be π after the dot e.g 3.141592…

“1i1i1ii1iii1i1i1ii1i1iii1i1i11i11i1i1i1ii1i1i1ii1i1ii1i1i1i1i1i.in” or

“angles-channels-beams-sections-structural-steels-exports-mumbai.in”   :)

Next chart shows distribution of lengths of detected keywords.

Keywords Length Distribution

Keywords Length Distribution

It is all for now. I will try to update this post if some new statistics is produced. Also, some work is underway to collect similar statistics in .com and .net zone files.

Domain Name price vs length for different TLD-s

February 18th, 2009

Following chart shows average price dependence of name length for different types of major extensions (TLD-s). I am working on more robust estimator for average price then simple mean that should give more smooth chart curves. For same combinations of extensions and length, sample is small and resulting point is inaccurate (at some very long names).

Domain name price vs length (for TLDs).

Domain name price vs length (for TLDs).

Note: price is in logarithmic scale.

Hyphens and Numbers in domain names

February 16th, 2009

Today I am studying how hyphens and numbers affect domain name price. It is wide spread story that domain names containing dashes and numbers are of lower value. This comparison will be done by calculation of average price of domains with similar characteristics (domain length). Statistics is on .com only. Following chart shows the price versus domain length in logarithmic scale. I have very small sample of domains with hyphens + numbers and associated line is not very reliable (purple one).

Prices of domains with hyphens and numbers

Prices of domains with hyphens and numbers

As we expected, domains without numbers and hyphens are of higher value. There is evidence that  domains with numbers are more valuable then domains with hyphens for lengths below 12 characters and in reverse for longer names.  These are just general observations  over a distribution of prices with big variation and not a rule for concrete domain valuation.

Sedo’s Secondary Domain Market Study 2008

February 10th, 2009

Sedo published new market study for 2008.

PDF: http://www.sedo.com/press/domainmarketstudy2008-us.pdf

The data presented is interesting and seems to be in conformance of my observations and data presented in previous posts. I skipped a chart before that presents domain sales for each day of the year that is comparable to chart “Domain Sales per Quarter” in this report. Also the shown trend that .DE is one of the most traded ccTLD-s is observation that seems to be true. SEDO have access to bigger sales database and more accurate statistics.

Domain Sales per month

Domain Sales per day

More charts on Domain Name count vs Length

January 31st, 2009

In previous post I charted the number of domain names traded vs domain length. In this post I would like to examine more deeply this relation. This time i used more comprehensive database that cover sales data for the past year to date. It contains data for different TLD-s and .COM is, as we expect, most traded TLD followed by .NET .DE .ORG etc. I observed that there is strong peak for domains with length 4 that are most intensely traded (2-6 times more then other lengths). Following charts also examine this relation in other TLD-s.

Count of domain names vs domain length (.com)

Count of domain names vs domain length (.com)

Logarithm of Prices vs Domain length (.com)

Logarithm of Prices vs Domain length (.com)

We can see that .net and .org TLD-s have similar peaks but for .net 3 and 4 letter domain names have maximum and .org have maximum only at 3 letter names.

Count of domain names vs length (.net .org)

Count of domain names vs length (.net .org)

Other TLD-s have no peak at 4 letter domains and some small at 3 letter. Also, .uk domain names seems to have no peak at small lengths and just a general bell curve.

Count of domain names vs length (.de .uk. eu)

Count of domain names vs length (.de .uk. eu)

Markets are always moving and TLD-s other then .com are following its behaviour. My predictions are that at some point in the future .de .uk .eu (others?) TLD-s will have strong peak at 4 and probably .COM at 5 as interest to domains with length of 5 increases.

Price dependence of Domain Name length

January 26th, 2009

Just a quick statistics extracted from Domain Names database. It is based on (1565  .COM only) sold domains list for the last 4 months (Oct 08, Jan 09). Some domain names contain dashes and numbers. Last column in table is average price for given name length.

len count min max average
2 12 8500 300000 80129.8
3 92 141 70000 7331.2
4 221 69 200000 3806.4
5 52 60 275000 18312.8
6 76 60 160000 8433.4
7 95 59 365000 12295.7
8 120 60 100000 4062.8
9 115 60 85000 4267.5
10 111 60 125000 7469.7
11 104 60 50000 3656.8
12 103 60 480000 7736
13 93 59 320000 5635.7
14 84 60 48000 3605
15 60 60 70326 2848.1
16 64 60 80000 3238.4
17 47 60 15000 1184.4
18 31 60 30240 1988.9
19 19 60 2610 495.1
20 23 60 12100 858.7
21 9 60 15000 2003.1
22 5 60 500 256.8
23 11 60 1250 419.4
24 5 60 400 213
25 2 69 380 224.5
26 3 400 2800 1400
27 3 400 1000 800
28 3 455 2800 1418.3
29 2 360 400 380

Chart shows this dependence in logarithmic price scale.

Log Domain Name price dependence of name length

Log Domain Name price dependence of name length

Next chart shows domain sales by size. It seems that 4 character domains are most traded at the moment.

Domain sales by name length

Domain sales by name length

Quantitative base for Domain Name Appraisal

November 20th, 2008
Fig. 1 Distribution of words length

Fig. 1 Distribution of words length.

In this article I would like to focus on quantitative factors that form the base for domain name price estimation and particularly the number of characters and number of words in domain name. It is well known that the less characters and words in the name, the more desirable it is.

Simple questions that I want to answer are:

  • How many domain names can be generated with fixed length and arbitrary combination of characters?
  • How many domain names can be generated that are combination of 1, 2, 3 or more dictionary words?

For this study I have compiled a dictionary of US and UK English words (totally 101,930 words and names) . I will not estimate domain names that contain words from other languages. Distribution of words length is shown on Fig.1.

It is easy to count number of all possible domain names with letters “a…z” (see table below). I will omit numbers and “-” for now.

Name Length 1 Word 2 Words 3 Words 4 Words # Total Combinations
4 3,118 20,620 11,232 1,296 456,976
5 6,426 207,352 282,924 89,856 11,881,376
6 10,534 1,393,145 4,520,456 3,042,144 308,915,776
7 14,505 6,557,828 50,892,618 66,396,864 8,031,810,176
8 15,852 22,587,140 426,332,664 1,043,041,432 208,827,064,576
9 14,856 60,492,356 2,748,432,685 12,500,650,016 5,429,503,678,976
10 12,491 134,160,158 13,981,916,226 118,442,764,472 141,167,095,653,376

Notes:

  • The numbers are just estimates. They dependent from dictionary size and especially the number of small words.
  • Some domain names can be split in words in multiple ways. Famous example is “who represents” and “whore presents”.
  • Not all words combinations create meaningful domain names, most of them not.

Work in progress …

Domain Name Appraisal Method

November 19th, 2008

Domain Name Appraisal is a process of price estimation of given domain name. This estimation is based on quantitative factors as well as subjective ones.

For example, if the domain name contains words and phrases that have meaning in some language and that words are desirable by multiple participants in domain market, estimated price is higher. Other factors that attract potential domain name buyers (and increase price) are the size (in terms of words count and characters count) of domain name.

Domains can be classified in given category (news, finance, business, etc.). Some categories are more profitable for domain owners then other and prices of domain names in given category are higher. There are a lot of subjective factors (people can like or dislike some names but can’t give explanation why) that are difficult to be quantitatively estimated.

Domain markets are not still, domain prices change due to economics conditions and buzz of given words and phrases. Because the given domain name is unique its value to domain market participants is also unique.

The best, one can do, is to quantitatively estimate “distance” between given domain name and domain names sold on auctions (actually sold domains prices, not desired prices) and based on this distance measure to price this domain accordingly. This measure can be estimated in the space of chars count, words count, links to domain, counts in search engines and news for this phrase, etc… Producing accurate model for domain name appraisal is possible with domain market experience and usage of robust and accurate statistical methods.