Character Frequency Analysis Info

Sorry for the long delay between postings. Between the IEEE Security & Privacy conference and moving back up to D.C. for the summer my computer time has been limited.

Well enough whining from me.  On to the data!

With character frequency analysis, I normally divide it up into three sections, first letter analysis, last letter analysis and overall analysis, (I probably should do middle as well, but I've found it closely mirrors overall analysis). There are a couple of reasons for this distinction:

1) They do vary quite a bit.  People tend to capitalize the first letter much more often than any other letter. Also people tend to put numbers at the end of passwords. You get the idea.

2)While I like using Markov models, (they track the conditional probability of letters appearing together, for example if you have a 'q' the next letter will almost always be a 'u'), they can be a pain sometimes to set up. In that case using letter frequency analysis  greatly helps when performing targeted brute force attacks. I can't stress this enough.  If you are using A-Z0-9 in your john the ripper config files, (or god forbid the default cain&able character sets), you are really hurting yourself. This way when I'm doing brute force, I'll use the first letter analysis to order my character set for the first character, last letter analysis to order the last character set, etc.

2-b) If anyone is interested I can throw up a table showing the Markov probabilities. Normally I just cheat though and use John the Ripper's built in Markov models. You can train it yourself by passing JtR a list of passwords which is really nice.  Eventually I should post my JtR character set file, (Markov probabilities for their "incremental" <-read brute force, attack), but if someone wants it sooner let me know and I can e-mail it to you.

3) The first character analysis can also be very useful when attacking pass-phrases where people only used the first letter of each word.

So here is the data. All data is for the phpbb.com list. (Edit I'm having trouble pasting in some of the non-ascii characters into blogger. The 'æ' is actually an 'i' with two dots).

Quick cheat sheet:

Overall Character Frequency Charset:
aeorisn1tl2md0cp3hbuk45g9687yfwjvzxqASERBTMLNPOIDCHGKFJUW.!Y*@V-ZQX_$#,/+?;^ %~=&`\)][:<(æ>"ü|{'öä}

First Character Frequency Charset:
s1mpabctdrlfhgkjnw2ei0ov3q45796z8yuxSMPBACDTJRLFGHKNEWVIOZUQY!X*@.$#_-`[,~=/^<+\?;%{]:(&

Last Character Frequency Charset:
e1nsra326yt0d954o78lkgmihbpcxuwfzjvq!ESANRDYBT.O*LHMGKCX@PI$#U-ZWFJ_Q?+^V/,;)%~=`]&æ\>:"}[

Now for the actual percentages:

Overall Character Frequecy Analysis (letter/probability):

a       7.52766

e       7.0925

o       5.17

r       4.96032

i       4.69732

s       4.61079

n       4.56899

1       4.35053

t       3.87388

l       3.77728

2       3.12312

m       2.99913

d       2.76401

0       2.74381

c       2.57276

p       2.45578

3       2.43339

h       2.41319

b       2.29145

u       2.10191

k       1.96828

4       1.94265

5       1.88577

g       1.85331

9       1.79558

6       1.75647

8       1.66225

7       1.621

y       1.52483

f       1.2476

w       1.24492

j       0.836677

v       0.833626

z       0.632558

x       0.573305

q       0.346119

A       0.130466

S       0.108132

E       0.0970865

R       0.08476

B       0.0806715

T       0.0801223

M       0.0782306

L       0.0775594

N       0.0748134

P       0.073715

O       0.0729217

I       0.070908

D       0.0698096

C       0.0660872

H       0.0544319

G       0.0497332

K       0.0460719

F       0.0417393

J       0.0363083

U       0.0350268

W       0.0320367

.       0.0316706

!       0.0306942

Y       0.0255073

*       0.0241648

@       0.0238597

V       0.0235546

-       0.0197712

Z       0.0170252

Q       0.0147064

X       0.0142182

_       0.0122655

$       0.00970255

#       0.00854313

,       0.00323418

/       0.00311214

+       0.00231885

?       0.00207476

;       0.00207476

^       0.00195272

        0.00189169

%       0.00170863

~       0.00152556

=       0.00140351

&       0.00134249

`       0.00115942

\       0.00115942

)       0.00115942

]       0.0010984

[       0.0010984

:       0.000549201

<       0.000427156

(       0.000427156

æ       0.000183067

>       0.000183067

"       0.000183067

ü       0.000122045

|       0.000122045

{       0.000122045

'       0.000122045

ö       6.10223e-05

ä       6.10223e-05

}       6.10223e-0


----------------------------------------

First Character Frequecy Analysis:

s       7.55118

1       6.26416

m       6.16403

p       6.0229

a       5.17827

b       4.96031

c       4.85069

t       4.37507

d       4.15582

r       3.11136

l       3.09842

f       3.06432

h       2.99915

g       2.96764

k       2.9124

j       2.84766

n       2.53389

w       2.26717

2       2.11309

e       1.91844

i       1.77903

0       1.76004

o       1.33104

v       1.22573

3       1.11179

q       1.07467

4       1.02461

5       0.957276

7       0.918433

9       0.906348

6       0.883905

z       0.871821

8       0.85542

y       0.705225

u       0.56021

x       0.518345

S       0.360813

S       0.360813

M       0.296074

P       0.282263

B       0.256799

A       0.24644

C       0.237809

D       0.227019

T       0.218387

J       0.183428

R       0.179112

L       0.173501

F       0.166164

G       0.162711

H       0.153648

K       0.143289

N       0.114804

E       0.101856

W       0.100562

V       0.0828661

I       0.0820029

O       0.0599916

Z       0.0474754

U       0.0392751

Q       0.0388435

Y       0.0332328

!       0.0258957

X       0.0224429

*       0.0220113

@       0.0202849

.       0.0151058

$       0.013811

#       0.0120846

_       0.00517913

-       0.00474754

`       0.00302116

[       0.00302116

,       0.00302116

~       0.00258957

=       0.00215797

/       0.00215797

^       0.00172638

<       0.00172638

+       0.00172638

\       0.00129478

?       0.00129478

;       0.00129478

%       0.00129478

{       0.000863189

]       0.000863189

:       0.000863189

(       0.000863189

&       0.000431594


----------------------------------------

Last Character Frequecy Analysis:

e       7.34531

1       6.7933

n       5.81012

s       5.6513

r       5.35566

a       5.1869

3       4.59734

2       3.91327

6       3.77602

y       3.59302

t       3.51404

0       3.48167

d       3.3004

9       3.07425

5       2.96031

4       2.9288

o       2.91887

7       2.88262

8       2.62323

l       2.45146

k       1.75918

g       1.66552

m       1.63747

i       1.6038

h       1.54209

b       1.32154

p       1.12258

c       1.06474

x       1.06043

u       0.848515

w       0.726805

f       0.69832

f       0.69832

z       0.612864

j       0.317654

v       0.277947

q       0.220976

!       0.130342

E       0.0975403

S       0.08934

A       0.08934

N       0.0815713

R       0.0694867

D       0.0604232

Y       0.0535177

B       0.0487702

T       0.0448858

.       0.0444542

O       0.0431594

*       0.0384119

L       0.0345276

H       0.0332328

M       0.02978

G       0.0293484

K       0.0250325

C       0.0250325

X       0.0241693

@       0.0237377

P       0.0228745

I       0.0220113

$       0.0198533

#       0.0185586

U       0.015969

-       0.0146742

Z       0.0142426

W       0.0142426

F       0.013811

J       0.0107899

_       0.00949508

Q       0.00733711

?       0.00733711

+       0.00733711

^       0.00474754

V       0.00474754

/       0.00474754

,       0.00431594

;       0.00388435

)       0.00388435

%       0.00388435

~       0.00302116

=       0.00302116

`       0.00215797

]       0.00172638

&       0.00172638

æ       0.000863189

\       0.000863189

>       0.000863189

:       0.000863189

"       0.000863189

}       0.000431594

[       0.000431594


Comments

Unknown said…
Nice data.. the only problem with using the phpbb.com password set is that it's untrained people (i.e. random Internet people).

Users who either:
1) Have been trained on choosing passwords
2) Have to meet strict password requirements will chose other patterns.

Will have a whole other set of characters that they will use.

Thats were using Markov Mode REALLY comes into its own! I love it.
Matt Weir said…
Good point Minga. I made another post to try and address your comments. I agree with you that Markov models are really the way to go, but adding a little extra knowledge to them, (such as making sure the first character is uppercase, and the last characters are numbers/special characters), can go a long way.

Popular posts from this blog

Tool Deep Dive: PRINCE

The RockYou 32 Million Password List Top 100

Cracking the MySpace List - First Impressions