April 17th, 2009
Almost all domainers are interested in keyword frequencies appearing in different contexts. One way to valuate domain name, for example, is to obtain Google Search Results Count for phrase in domain name. I am interested particularly in connection to algorithms for domain name splitting in compound words and appraisal methodologies.
There are some well known word frequency lists:
I decided to create my own list of words and associated frequences based on all articles that are in the English version of Wikipedia.
Wikipedia is HUGE. Only the English part is 21GB in XML format. It takes a 5h to parse entire file and extract statistics for all tokens that looks like a word.
Some statistics:
- Total tokens (words, no numbers): 1,570,455,731
- Unique tokens (words, no numbers): 5,800,280
If you know what a Zipf distribution is, I created a chart of all words in log/log scale, just to observe later that the same chart can be seen here.
 Wikipedia Words Frequency List
The chart can be divided on four parts:
- Rank(1-50) Count(86M-3M) Examples(the, of, and, in, to, a, is) Words that are stop words.
- Rank(51-3K) Count(2.4M-56K) Examples(university, January, tea, sharp) Words form the “core” of the English dictionary — words that are most frequently used.
- Rank(3K-200K) Count(56K-118) Examples(officiates, polytonality, neoligism) Words that can be found in some large and comprehensive dictionaries (above rank 50K are mostly Long Tail words)
- Rank(200K-5.8M) Count(117-1) Examples(euprosthenops, eurotrochilus, lokottaravada) Terms from obscure niches, misspelled words, transliterated words from other languages, new words and “not words at all”
Google study shows that there are 14M one word and 315M two word phrases (bigrams). Currently I have no plans to extract two words phrases due to their large number, but it is interesting to analyze them in context of two words domain names.
Technical info
The process of extracting all words and counting them is not an easy task:
- Download a copy of Wikipedia. I used version dumped in XML format.
- Write parser to extract text from <title> and <text> tags.
- Wikipedia uses its own markup language. Write parser to extract all data from markup language and filter-out some unnecessary parts.
- Filter out numbers, special characters.
- Tokenize.
- Collect useful statistics.
The good news is that Wikipedia is much clean and organized then the rest of the web.
Some selected words and associated counts:
- Google 197920
- Twitter 894
- domain 111850
- domainer 22
- Wikipedia 3226237
- Wiki 176827
- Obama 22941
- Oprah 3885
- Moniker 4974
- GoDaddy 228
Tags: Domain Statistics, Word Frequency list Posted in Domain Statistics, Word Frequency list | Comments Off
March 31st, 2009
I am developing algorithm for keywords detection in domain names. Thanks to Jeff at inforum.in I obtained a copy of .IN zone file (from Feb. 09) and decided to analyze words distribution in it. Keywords splitting is not easy task. Some problems I am facing are:
- Keywords are from different languages. I use English dictionary with 100K words and this statistics is only for domains that seems to be composed of English words. There are lot of Indian proper names that are not in the dictionary. If domain name contain proper name or other unrecognizable part and English part, then only English part is included in final statistics. Numbers in domains are not counted in statistics.
- Most domain names can be slitted in different ways. My approach splits the names in all possible ways and choose one by some heuristic but more accurate approach can involve a statistical methods for keywords co-occurrence.
- Some domains can’t be splitted by humans too.
- Some domains are intentionally misspelled and I do not use algorithm that detects misspelled word variants.
Some statistics:
- Total number of domains: 315K
- Number of domains that are splitted: 175K (numeric domains only are not counted)
- Total number of keywords detected: 366K or 29K unique.
Following is the first 300 keywords sorted by frequency. As expected a lot of “skip words” appear at the top.
| i |
7200 |
em |
404 |
up |
259 |
| india |
6495 |
test |
402 |
ads |
258 |
| on |
3008 |
star |
401 |
play |
254 |
| line |
2504 |
or |
401 |
micro |
253 |
| the |
2447 |
your |
401 |
finance |
252 |
| my |
2348 |
mail |
400 |
film |
251 |
| in |
2149 |
love |
399 |
wedding |
251 |
| tech |
1786 |
digital |
395 |
college |
248 |
| web |
1495 |
ate |
393 |
center |
246 |
| indian |
1470 |
education |
393 |
site |
246 |
| group |
1447 |
us |
391 |
way |
246 |
| world |
1379 |
market |
390 |
creative |
246 |
| net |
1264 |
guide |
386 |
people |
244 |
| it |
1163 |
shop |
382 |
inc |
243 |
| an |
1102 |
technologies |
382 |
security |
242 |
| travel |
1077 |
loan |
380 |
tour |
241 |
| info |
1056 |
service |
377 |
click |
239 |
| and |
1038 |
blue |
373 |
pay |
239 |
| free |
1015 |
times |
366 |
team |
238 |
| go |
969 |
card |
365 |
casino |
237 |
| jobs |
958 |
chennai |
361 |
today |
236 |
| solutions |
956 |
network |
358 |
lab |
236 |
| of |
944 |
poker |
357 |
good |
236 |
| en |
925 |
hosting |
353 |
directory |
235 |
| home |
906 |
sun |
352 |
dream |
234 |
| media |
903 |
game |
350 |
vision |
233 |
| health |
892 |
food |
345 |
future |
233 |
| global |
862 |
ur |
344 |
company |
233 |
| to |
858 |
first |
344 |
sky |
233 |
| city |
853 |
biz |
343 |
reliance |
232 |
| am |
851 |
plus |
339 |
royal |
232 |
| me |
823 |
phone |
337 |
san |
231 |
| tv |
798 |
cheap |
336 |
call |
230 |
| life |
792 |
bio |
335 |
baby |
229 |
| sex |
774 |
data |
331 |
products |
228 |
| ad |
758 |
books |
330 |
photo |
227 |
| design |
755 |
red |
330 |
planet |
227 |
| business |
734 |
get |
328 |
cars |
227 |
| news |
730 |
os |
328 |
simply |
226 |
| hotel |
720 |
realest |
327 |
movies |
225 |
| hotels |
711 |
pc |
325 |
corp |
225 |
| care |
693 |
zone |
323 |
cash |
223 |
| car |
692 |
holidays |
320 |
print |
222 |
| at |
687 |
win |
320 |
mall |
222 |
| mobile |
685 |
tar |
319 |
deals |
220 |
| art |
670 |
travels |
319 |
law |
220 |
| as |
663 |
tours |
318 |
mind |
219 |
| club |
656 |
ind |
317 |
girls |
216 |
| hop |
641 |
max |
316 |
tourism |
215 |
| services |
639 |
med |
316 |
video |
215 |
| hi |
632 |
eco |
314 |
corporate |
215 |
| pro |
630 |
gold |
312 |
academy |
213 |
| one |
627 |
soft |
311 |
consultants |
213 |
| air |
608 |
capital |
310 |
foundation |
213 |
| green |
604 |
sms |
309 |
solar |
211 |
| new |
602 |
consulting |
305 |
source |
210 |
| best |
595 |
just |
304 |
bazaar |
209 |
| no |
589 |
ms |
303 |
fun |
209 |
| all |
588 |
marketing |
302 |
tex |
209 |
| insurance |
577 |
cricket |
302 |
fly |
208 |
| is |
576 |
internet |
301 |
pages |
208 |
| for |
565 |
properties |
301 |
now |
207 |
| property |
563 |
sports |
301 |
centre |
207 |
| power |
536 |
point |
300 |
cards |
207 |
| music |
536 |
raj |
300 |
loans |
206 |
| job |
532 |
direct |
300 |
kids |
206 |
| delhi |
529 |
time |
300 |
dr |
205 |
| live |
529 |
porn |
300 |
techno |
205 |
| search |
524 |
asia |
296 |
dating |
205 |
| ala |
524 |
energy |
295 |
talk |
204 |
| do |
520 |
homes |
292 |
he |
203 |
| you |
514 |
im |
292 |
open |
203 |
| school |
514 |
we |
291 |
log |
203 |
| bank |
509 |
space |
290 |
work |
201 |
| money |
494 |
box |
288 |
radio |
201 |
| man |
493 |
land |
287 |
euro |
201 |
| smart |
486 |
bangalore |
287 |
help |
200 |
| auto |
485 |
career |
287 |
sale |
200 |
| com |
475 |
studio |
282 |
store |
198 |
| international |
473 |
real |
282 |
pace |
196 |
| systems |
473 |
tel |
282 |
ticket |
196 |
| games |
469 |
management |
281 |
by |
195 |
| buy |
466 |
host |
277 |
shopping |
195 |
| big |
453 |
forum |
275 |
retail |
194 |
| credit |
448 |
fashion |
274 |
solution |
192 |
| de |
445 |
stock |
274 |
technology |
191 |
| mart |
437 |
movie |
273 |
golf |
190 |
| book |
430 |
find |
273 |
day |
189 |
| trade |
426 |
park |
271 |
family |
189 |
| pr |
426 |
tore |
270 |
pal |
189 |
| ker |
426 |
tax |
269 |
liberty |
189 |
| oft |
425 |
be |
268 |
holiday |
189 |
| top |
424 |
computer |
267 |
mob |
189 |
| domain |
421 |
office |
267 |
plan |
188 |
| hot |
418 |
super |
265 |
yoga |
188 |
| easy |
412 |
medical |
261 |
realty |
188 |
| guru |
412 |
water |
260 |
trip |
187 |
| software |
411 |
con |
259 |
eye |
187 |
| house |
407 |
express |
259 |
labs |
187 |
| link |
405 |
goa |
259 |
tickets |
186 |
Domain length distribution is shown on the next chart. It is based on all domains in zone file.
 Domains Length distribution
There are some names with the maximum size. It is really difficult to me to understand why some people register such names like:
“141592653589793238462643383279502884197169399375105820974944592.in” – this seems to be π after the dot e.g 3.141592…
“1i1i1ii1iii1i1i1ii1i1iii1i1i11i11i1i1i1ii1i1i1ii1i1ii1i1i1i1i1i.in” or
“angles-channels-beams-sections-structural-steels-exports-mumbai.in”
Next chart shows distribution of lengths of detected keywords.
 Keywords Length Distribution
It is all for now. I will try to update this post if some new statistics is produced. Also, some work is underway to collect similar statistics in .com and .net zone files.
Tags: Domain Statistics, Word Frequency list, zone file Posted in Domain Statistics, Word Frequency list | Comments Off
February 18th, 2009
Following chart shows average price dependence of name length for different types of major extensions (TLD-s). I am working on more robust estimator for average price then simple mean that should give more smooth chart curves. For same combinations of extensions and length, sample is small and resulting point is inaccurate (at some very long names).
 Domain name price vs length (for TLDs).
Note: price is in logarithmic scale.
Tags: Domain Name, Domain Statistics Posted in Domain Statistics | Comments Off
February 16th, 2009
Today I am studying how hyphens and numbers affect domain name price. It is wide spread story that domain names containing dashes and numbers are of lower value. This comparison will be done by calculation of average price of domains with similar characteristics (domain length). Statistics is on .com only. Following chart shows the price versus domain length in logarithmic scale. I have very small sample of domains with hyphens + numbers and associated line is not very reliable (purple one).
 Prices of domains with hyphens and numbers
As we expected, domains without numbers and hyphens are of higher value. There is evidence that domains with numbers are more valuable then domains with hyphens for lengths below 12 characters and in reverse for longer names. These are just general observations over a distribution of prices with big variation and not a rule for concrete domain valuation.
Tags: Domain Name, Domain Name Appraisal Posted in Domain Appraisal, Domain Statistics | Comments Off
February 10th, 2009
Sedo published new market study for 2008.
PDF: http://www.sedo.com/press/domainmarketstudy2008-us.pdf
The data presented is interesting and seems to be in conformance of my observations and data presented in previous posts. I skipped a chart before that presents domain sales for each day of the year that is comparable to chart “Domain Sales per Quarter” in this report. Also the shown trend that .DE is one of the most traded ccTLD-s is observation that seems to be true. SEDO have access to bigger sales database and more accurate statistics.
 Domain Sales per day
Tags: Domain Statistics Posted in Domain Statistics | Comments Off
January 31st, 2009
In previous post I charted the number of domain names traded vs domain length. In this post I would like to examine more deeply this relation. This time i used more comprehensive database that cover sales data for the past year to date. It contains data for different TLD-s and .COM is, as we expect, most traded TLD followed by .NET .DE .ORG etc. I observed that there is strong peak for domains with length 4 that are most intensely traded (2-6 times more then other lengths). Following charts also examine this relation in other TLD-s.
 Count of domain names vs domain length (.com)
 Logarithm of Prices vs Domain length (.com)
We can see that .net and .org TLD-s have similar peaks but for .net 3 and 4 letter domain names have maximum and .org have maximum only at 3 letter names.
 Count of domain names vs length (.net .org)
Other TLD-s have no peak at 4 letter domains and some small at 3 letter. Also, .uk domain names seems to have no peak at small lengths and just a general bell curve.
 Count of domain names vs length (.de .uk. eu)
Markets are always moving and TLD-s other then .com are following its behaviour. My predictions are that at some point in the future .de .uk .eu (others?) TLD-s will have strong peak at 4 and probably .COM at 5 as interest to domains with length of 5 increases.
Tags: Domain Name, Domain Statistics, Domainer Posted in Domain Statistics | Comments Off
January 26th, 2009
Just a quick statistics extracted from Domain Names database. It is based on (1565 .COM only) sold domains list for the last 4 months (Oct 08, Jan 09). Some domain names contain dashes and numbers. Last column in table is average price for given name length.
| len |
count |
min |
max |
average |
| 2 |
12 |
8500 |
300000 |
80129.8 |
| 3 |
92 |
141 |
70000 |
7331.2 |
| 4 |
221 |
69 |
200000 |
3806.4 |
| 5 |
52 |
60 |
275000 |
18312.8 |
| 6 |
76 |
60 |
160000 |
8433.4 |
| 7 |
95 |
59 |
365000 |
12295.7 |
| 8 |
120 |
60 |
100000 |
4062.8 |
| 9 |
115 |
60 |
85000 |
4267.5 |
| 10 |
111 |
60 |
125000 |
7469.7 |
| 11 |
104 |
60 |
50000 |
3656.8 |
| 12 |
103 |
60 |
480000 |
7736 |
| 13 |
93 |
59 |
320000 |
5635.7 |
| 14 |
84 |
60 |
48000 |
3605 |
| 15 |
60 |
60 |
70326 |
2848.1 |
| 16 |
64 |
60 |
80000 |
3238.4 |
| 17 |
47 |
60 |
15000 |
1184.4 |
| 18 |
31 |
60 |
30240 |
1988.9 |
| 19 |
19 |
60 |
2610 |
495.1 |
| 20 |
23 |
60 |
12100 |
858.7 |
| 21 |
9 |
60 |
15000 |
2003.1 |
| 22 |
5 |
60 |
500 |
256.8 |
| 23 |
11 |
60 |
1250 |
419.4 |
| 24 |
5 |
60 |
400 |
213 |
| 25 |
2 |
69 |
380 |
224.5 |
| 26 |
3 |
400 |
2800 |
1400 |
| 27 |
3 |
400 |
1000 |
800 |
| 28 |
3 |
455 |
2800 |
1418.3 |
| 29 |
2 |
360 |
400 |
380 |
Chart shows this dependence in logarithmic price scale.
 Log Domain Name price dependence of name length
Next chart shows domain sales by size. It seems that 4 character domains are most traded at the moment.
 Domain sales by name length
Tags: Appraisal Methodology, Domain Name, Domain Statistics Posted in Domain Appraisal, Domain Statistics | Comments Off
November 20th, 2008
 Fig. 1 Distribution of words length.
In this article I would like to focus on quantitative factors that form the base for domain name price estimation and particularly the number of characters and number of words in domain name. It is well known that the less characters and words in the name, the more desirable it is.
Simple questions that I want to answer are:
- How many domain names can be generated with fixed length and arbitrary combination of characters?
- How many domain names can be generated that are combination of 1, 2, 3 or more dictionary words?
For this study I have compiled a dictionary of US and UK English words (totally 101,930 words and names) . I will not estimate domain names that contain words from other languages. Distribution of words length is shown on Fig.1.
It is easy to count number of all possible domain names with letters “a…z” (see table below). I will omit numbers and “-” for now.
| Name Length |
1 Word |
2 Words |
3 Words |
4 Words |
# Total Combinations |
| 4 |
3,118 |
20,620 |
11,232 |
1,296 |
456,976 |
| 5 |
6,426 |
207,352 |
282,924 |
89,856 |
11,881,376 |
| 6 |
10,534 |
1,393,145 |
4,520,456 |
3,042,144 |
308,915,776 |
| 7 |
14,505 |
6,557,828 |
50,892,618 |
66,396,864 |
8,031,810,176 |
| 8 |
15,852 |
22,587,140 |
426,332,664 |
1,043,041,432 |
208,827,064,576 |
| 9 |
14,856 |
60,492,356 |
2,748,432,685 |
12,500,650,016 |
5,429,503,678,976 |
| 10 |
12,491 |
134,160,158 |
13,981,916,226 |
118,442,764,472 |
141,167,095,653,376 |
Notes:
- The numbers are just estimates. They dependent from dictionary size and especially the number of small words.
- Some domain names can be split in words in multiple ways. Famous example is “who represents” and “whore presents”.
- Not all words combinations create meaningful domain names, most of them not.
Work in progress …
Tags: Appraisal Methodology, Domain Statistics Posted in Domain Appraisal, Domain Statistics | Comments Off
November 19th, 2008
Domain Name Appraisal is a process of price estimation of given domain name. This estimation is based on quantitative factors as well as subjective ones.
For example, if the domain name contains words and phrases that have meaning in some language and that words are desirable by multiple participants in domain market, estimated price is higher. Other factors that attract potential domain name buyers (and increase price) are the size (in terms of words count and characters count) of domain name.
Domains can be classified in given category (news, finance, business, etc.). Some categories are more profitable for domain owners then other and prices of domain names in given category are higher. There are a lot of subjective factors (people can like or dislike some names but can’t give explanation why) that are difficult to be quantitatively estimated.
Domain markets are not still, domain prices change due to economics conditions and buzz of given words and phrases. Because the given domain name is unique its value to domain market participants is also unique.
The best, one can do, is to quantitatively estimate “distance” between given domain name and domain names sold on auctions (actually sold domains prices, not desired prices) and based on this distance measure to price this domain accordingly. This measure can be estimated in the space of chars count, words count, links to domain, counts in search engines and news for this phrase, etc… Producing accurate model for domain name appraisal is possible with domain market experience and usage of robust and accurate statistical methods.
Posted in Appraisal Methodology, Domain Appraisal, Domain Statistics | Comments Off
|
|