ISA, opcodes, mnemonics, Dinosaurs and AI
Prerequisits: This post assumes basic familiarity with computer architecture concepts and assembly language.
It's the year 2081. Human civilization depends entirely on advanced AI systems, thanks to the AI revolution that began early in the 21st century. Ever since AI started reading minds, computer programmers have gone extinct; joining dinosaurs in the long timeline of life on earth.
Suddenly, a mysterious glitch emerges, causing the global supercomputer "FIKA" to fail, plunging earth into disaster: No internet, no electricity, cars can't fly anymore, politicians are silenced mid glorious election speech ... utter chaos! Fortunately for humanity, one person still exists who understands ancient machine languages: X.
Upon arriving at FIKA's main facility, X discovers that the high level control software is irreversibly corrupted, and the usual repair methods are no longer possible. The only functional part was an emergency machine that accepts only binary code, connected to an ancient device called 'keyboard' and a blinking prompt showing the following:
$FIKA-EMERGENCY>
To reboot the system and save humanity, X must manually enter binary instructions to calculate the result of the impossible to solve equation: 40 + 2
Inspecting the Machine
Recognizing the urgency, X tries to gather as much information as possible to determine what instructions the machine understands. After researching and having some discussions with people who work there, X realizes a crucial piece of information: "The machine was running Intel Core i3 !!!"
This discovery was positively surprising, as X was somewhat familiar with the ancient x86 architecture used by intel, which made him see a glimpse of hope. However, without internet access, X urgently contacted the government for immediate help to retrieve humanity's last surviving copy of the "Intel x86 Instruction Set Manual," carefully preserved behind bullet proof glass at the "History of Computing Museum" in London, stored beside an ancient vacuum tube.
So what is "ISA"
According to Wikipedia:
"ISA defines the supported instructions, data types, registers, the hardware support for managing main memory, fundamental features (such as the memory consistency, addressing modes, virtual memory), and the input/output model of implementations of the ISA"
The part we care about the most is understanding that Instruction set architecture (ISA) is the interface that defines the rules of communication between software and hardware, among other things. If we want to write code that the hardware understands, we need to look at the instructions this hardware understands, and those are listed in ISA manuals. In our example, since we have an intel machine, we know that the CPU is based on the x86 architecture, as intel used this architecture in its i3 series and many others.
If we consult the x86 manual which can be found here, we will notice that there is around 5 thousand pages, but the part we care about is in one volume. If you open Volume 2 (which includes 2A, 2B, 2C and 2D), you will see a list of all instructions supported by this architecture, along with extensive information about each instruction and other essential details.
Assuming you're already familiar with the basics of assembly language, we will proceed with the assumption that the code needed to save the world is basically made of two main instructions:
- Storing the value
40
into a register. - Adding
2
to that same register.
This is more than enough for our case.
Opcodes and mnemonics
For this section, we will need to pay attention to two parts of the manual:
First let's open Volume 2 we talked about and go to page 134. On this page you will find the ADD
instruction table:
Then let's move to page 772 and look at the lower half of the MOV
instruction table.
There is a lot of information to unpack here, but let's focus on two main things:
- Opcode
- Mnemonic (yes can't find it in the image, but we will get there shortly)
A common confusion when talking about assembly is whether it is machine language, a representation of machine language, or just the closest thing to a machine language. Our aim here is to clarify this point before bringing AI back to life and saving the world.
As mentioned earlier, our task involved two main instructions.
First, we start by storing 40
into a register. For that, we should do something like:
mov al, 40
This will move the number 40
into the register al
, which is the lower 8 bits of rax
on 64-bit systems and eax
on 32-bit systems. Now the question is: what happens when we assemble this code? Let's find out:
$ nasm -f bin file.asm -o output.bin
$ hexdump -C output.bin
00000000 b0 28 |.(|
We use nasm
and specify a binary output, then inspect the generated binary using hexdump
. You will notice the binary contains two bytes: b0
and 28
. Now let's compare this to the MOV
table above, specifically the row showing B0+ rb ib
under opcode. You'll see both start with b0
which is 1011_0000
in binary.
But what does +rb
mean? Looking at page 104 in the manual, we see:
"+rb, +rw, +rd, +ro — Indicated the lower 3 bits of the opcode byte is used to encode the register operand..."
So this is what indicates to the cpu the instruction target.
Taking a look at the table on the same page, we see that when copying a value to register AL
, the reg field is 0
, meaning the lower 3 bits will be set to zero: 1011_0[000]
.
If we were copying to CL
, the reg field would be 1
and the bits would become 1011_0001
. Let's verify that:
mov cl, 40
As before, we assemble with nasm
and check the bytes with hexdump
:
$ nasm -f bin file.asm -o output.bin
$ hexdump -C output.bin
00000000 b1 28 |.(|
Notice how now the first byte is b1
, which is 1011_0001
as we expected above!
So now we understand what the +rb
is, but what about the ib
. Again, on the same page, we see the following:
"ib, iw, id, io — A 1-byte (ib), 2-byte (iw), 4-byte (id) or 8-byte (io) immediate operand to the instruction that follows the opcode"
Going back to the same row in the MOV
table, we see the description of the opcode B0+ rb ib
, saying: "Move imm8 to r8". Which is in fact why we're looking at this row. In assembly, when writing instructions, we're often moving data from register to register, from register to memory, from memory to register, etc. One of the possible combinations is also to move an immediate value to a register, which is the case for us: we're directly moving the value 40
to al
, we're not loading it from any memory location.
And ib
is used for that, the immediate byte that follows the opcode contains the exact value that needs to be moved to al
.
And indeed, checking the hexdump
output again, we see that the 2nd byte was 28
in hexadecimal, which is 0010_1000
in binary and 40
in decimal.
The whole point of this investigation is to show one important thing:
When we moved 40
to AL
, we used the instruction MOV
, which generated a specific opcode, but when we moved 40
to CL
, we got a different opcode, since the lower bit became different.
1011_0000 --> mov to al
1011_0001 --> mov to cl
The key idea here is that the opcode is the exact machine language the CPU is designed to fetch, decode and execute. If you ever studied computer architecture or researched a bit about how computers work, you've probably encountered the famous trio (which is a simplified version of what truly happens nowadays):
Fetch, decode, execute
So we 'fetch' the instructions from memory, but what is the thing that gets decoded? The CPU examines the bytes, decodes them, figures out the instruction (MOV
), identifies the target by inspecting the lower 3 bits, and proceeds accordingly. The opcodes could be much more than 1 byte, in fact, in x86 opcodes are often several bytes long, so the decoding process could be more complicated than just looking at few bits.
If interested, check page 31 of the manual to see instruction format and how long an instruction could get.
So, if that's an opcode, then what are MOV
, ADD
, SUB
and all these words we use when writing assembly? They are called mnemonics: It's a human readable form of the machine opcodes. No one will remember binary numbers and write code using 0s and 1s (except for X in 2087), so they had to create a convention using regular letters to simplify the task of writing code. The CPU does not care whether you write MOV
and ADD
, it only needs the right binary format. Those who do care though are people who develop assemblers, they are the ones who map these mnemonics to their corresponding opcodes. In fact, as an exercise for motivated readers, you can implement your own assembler and name the mnemonic CUSTOM_MOV
or whatever you want, and make it generate the right opcodes we saw above, and things will still work fine.
One last thing to point out is that the full machine code does not only depend on the mnemonic you write, it also depends on the operands, as demonstrated when switching between AL
and CL
. Each combination of mnemonic + operands
will generate a different instruction. Note the word instruction here; a full instruction is NOT equivalent to a mnemonic alone, it is the mnemonic and its operands. Looking back at the manual we see this clearly:
Bringing FIKA back to life
We're now ready to complete our task and save the world. We started with:
mov al, 40 --> 1011_0000 0010_10000
Now we need to figure out the remaining piece of the puzzle. What does add al, 2
translate to in binary?
According the ADD
table show earlier, adding an immediate byte to AL
has the opcode 04 ib
, and we know by now that ib
is the immediate byte following the opcode which should contain the value we want to add to al
, 2
in our case. Thus we can conclude that the full opcode for this instruction will be: 0000_0100 0000_0010
.
Let's verify our expectation again with hexdump:
mov al, 40
add al, 2
$ nasm -f bin test.asm -o output.bin
$ hexdump -C output.bin
00000000 b0 28 04 02 |.(..|
With that, X is now equipped with all the knowledge necessary to save the world! All they need to do is pass the following sequence of bytes to the prompt:
$FIKA-EMERGENCY> 10110000 00101000 00000100 00000010
And who could have guessed that, in an era dominated by super AGI, the hero saving the world would be someone writing literal binary code...
Read more about 42: The Answer to the Ultimate Question of Life, the Universe, and Everything is 42
Keep coding and have fun! A few decades from now, you might be "X". :]