Exploring Base64
Even if you have just started to dip your toe into security or web development you have probably already seen the use of Base64 encoding all over the place. I went to decode a short snippet today and realized that while I know the reasons for using Base64 encoding and it is trivial to decode and encode with numerous tools both online and offline…
Online:
Offline:
- Linux:
cat info.b64 |base64 -d
- Windows:
certutil -decode in.b64 out.txt
… I have never really taken the time to understand the process. So today I wanted to read up on it just as a simple exercise and jot down some notes.
To start this exercise I have chosen to encode a randomly selected word from a short list, just so I wont know what the result is supposed to be before trying to manually decode it.
I used (https://randomwordgenerator.com/) to generate a list of 50 random 3 syllable words and saved them to words.txt
After that a quick python script will spit out a random Base64 string for us to play with.
Script:
import random
import base64
lines = open('words.txt').read().splitlines()
line = random.choice(lines)
b64_bytes = base64.b64encode(line.encode('ascii'))
b64_out = b64_bytes.decode('ascii')
print(b64_out)
Output:
> ./rand_line_b64.py
cm9tYW50aWM=
Converting from Base64 by hand
From some quick googling there are many resources that cover how Base64 encoding is done but it is hard to beat the RFC where it is defined.
Excerpt from RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648)
The Base 64 encoding is designed to represent arbitrary sequences of octets in a form that allows the use of both upper- and lowercase letters but that need not be human readable.
A 65-character subset of US-ASCII is used, enabling 6 bits to be represented per printable character. (The extra 65th character, “=”, is used to signify a special processing function.)
The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right, a 24-bit input group is formed by concatenating 3 8-bit input groups. These 24 bits are then treated as 4 concatenated 6-bit groups, each of which is translated into a single character in the base 64 alphabet.
Each 6-bit group is used as an index into an array of 64 printable characters. The character referenced by the index is placed in the output string.
Probably the most foundational bit of information is that Base64 relies on a static Base64 alphabet as defined in RFC 4648 consisting of uppercase, lowercase, numbers, and the symbols +
, /
, =
where “=” is used for padding only.
Base64 alphabet table:
Index | Binary | Char | Index | Binary | Char | Index | Binary | Char | Index | Binary | Char | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000000 | A | 16 | 010000 | Q | 32 | 100000 | g | 48 | 110000 | w | |||
1 | 000001 | B | 17 | 010001 | R | 33 | 100001 | h | 49 | 110001 | x | |||
2 | 000010 | C | 18 | 010010 | S | 34 | 100010 | i | 50 | 110010 | y | |||
3 | 000011 | D | 19 | 010011 | T | 35 | 100011 | j | 51 | 110011 | z | |||
4 | 000100 | E | 20 | 010100 | U | 36 | 100100 | k | 52 | 110100 | 0 | |||
5 | 000101 | F | 21 | 010101 | V | 37 | 100101 | l | 53 | 110101 | 1 | |||
6 | 000110 | G | 22 | 010110 | W | 38 | 100110 | m | 54 | 110110 | 2 | |||
7 | 000111 | H | 23 | 010111 | X | 39 | 100111 | n | 55 | 110111 | 3 | |||
8 | 001000 | I | 24 | 011000 | Y | 40 | 101000 | o | 56 | 111000 | 4 | |||
9 | 001001 | J | 25 | 011001 | Z | 41 | 101001 | p | 57 | 111001 | 5 | |||
10 | 001010 | K | 26 | 011010 | a | 42 | 101010 | q | 58 | 111010 | 6 | |||
11 | 001011 | L | 27 | 011011 | b | 43 | 101011 | r | 59 | 111011 | 7 | |||
12 | 001100 | M | 28 | 011100 | c | 44 | 101100 | s | 60 | 111100 | 8 | |||
13 | 001101 | N | 29 | 011101 | d | 45 | 101101 | t | 61 | 111101 | 9 | |||
14 | 001110 | O | 30 | 011110 | e | 46 | 101110 | u | 62 | 111110 | + | |||
15 | 001111 | P | 31 | 011111 | f | 47 | 101111 | v | 63 | 111111 | / | |||
Padding | = |
Each of the characters in the Base64 alphabet is case sensitive and has its own assigned index (Ex. W = 22, w = 48).
So working from out unknown base64 string we generated earlier cm9tYW50aWM=
we can start to break this down.
c
= Index: 28, Binary: 011100m
= Index: 38, Binary: 1001109
= Index: 61, Binary: 111101t
= Index: 45, Binary: 101101Y
= Index: 24, Binary: 011000W
= Index: 22, Binary: 0101105
= Index: 57, Binary: 1110010
= Index: 52, Binary: 110100a
= Index: 26, Binary: 011010W
= Index: 22, Binary: 010110M
= Index: 12, Binary: 001100=
= Padding
As noted in the RFC each character is represented in 6-bit binary and in groups of 4 which are concatenated and split into 3 8-bit groups. So we can start by concatenating our first 4 characters binary values together into a 24-bit group and then splitting the result into 3 8-bit groups as follows.
c
= Index: 28, Binary: 011100m
= Index: 38, Binary: 1001109
= Index: 61, Binary: 111101t
= Index: 45, Binary: 101101
Concatenate to 24 bits: “011100” + “100110” + “111101” + “101101” = “011100100110111101101101”
Split into 8-bit groups: “011100100110111101101101” = “01110010”, “01101111”, ”01101101”
Now we can decode these 8-bit groups with the standard ASCII character table. (https://www.rapidtables.com/code/text/ascii-table.html)
01110010
= r01101111
= o01101101
= m
And repeat for each group of 4 for the rest of our base64 string.
Y
= Index: 24, Binary: 011000W
= Index: 22, Binary: 0101105
= Index: 57, Binary: 1110010
= Index: 52, Binary: 110100
Concatenate to 24 bits: ”011000010110111001110100”
8-bit conversion:
01100001
= a01101110
= n01110100
= t
repeat for our final group.
a
= Index: 26, Binary: 011010W
= Index: 22, Binary: 010110M
= Index: 12, Binary: 001100=
= Padding
Concatenate to 24 bits: ”011010010110001100”
8-bit conversion:
01101001
= i01100011
= c00
= null/drop
So our final decoded string turns out to be romantic
A quick read and a straighforward process, but its nice to have a better understanding of the stuff we see and kind of take for granted everyday.