Starting Address of Functions in C

What’s there to C?

In the post about function pointers, there was a mention of the function’s name being the function’s starting address. It was also mentioned that we will take a look at how that happens. You will have to bear with me a little bit since we will be taking a brief look at the code generated by three of the four stages of the C compilation process. We will be skipping only the preprocessing step.

First things first

Let us briefly examine the memory layout for a program when it is being executed. Ignoring some compiler-specific sections, we can say that for each program, memory will be divided into five sections, namely, stack, heap, data, bss/comm, and text. How these sections are used is as follows:

Stack – all local variables will be on the stack and each function/block will have its own “stack frame”

Heap – any memory that is allocated by calls to malloc/calloc/realloc (dynamically allocated memory) will be on the heap.

bss/comm – conventionally, all uninitialized global variables will be in this section

Data – all initialized global variables will be in this section

Text – the actual code that forms a program (all the statements in a function) will be in the text section.

It is pretty obvious that, if the statements in a program are going to be in some part of memory, there will be addresses associated with the statements. And there will be one statement that will be the first statement in every function. The address of that first statement of a function will be the starting address of that function.

In the next section we will take a look at the following:

X86 assembly code that is generated by the compiler (output of compilation, the second stage after preprocessing)
disassembled object code (output of assembling, the third stage of compilation)
disassembled executable image (output of linking, the third stage of compilation) of a program.

An important point to understand is that whenever a label is created in C or assembly code (a label is a symbol followed by the colon (:) operator), it basically represents a memory address.

The next point of interest is what will a function call in C look like in assembly. If we are looking at X86 64-bit architecture, a function call will become the callq (call instruction, with the suffix q for 64-bit version) instruction. The operand to this instruction has to be the address of the function being called.

In the world of ARM 64 architecture, the instruction would be bl (branch and link) and again the operand has to be the address of the function being called.

The output after the compilation stage

First, a look at a simple C program.

#include <stdio.h>

int func (int a, int b)

{

int ret = a + b;

return ret;

}

int main ()

{

int i = 33, j = 44, res;

res = func (i, j);

return 0;

}

Below is a slightly truncated version of the X86 assembly code generated by the compiler. Remember, right now we are only interested in the labels and not the assembly code.

In this truncated output, we can see 4 labels – func, .LFB0, .LFE0, main, .LFB1 and .LFE1. Out of these the labels starting with .LF have been generated by the compiler whereas the func and main are from the C source file.

Also, note that between the labels func and .LFB0 (also between main and .LFB1) there is no assembly code and so the addresses will be the same for both labels.

So now it is clear that the names of the functions func and main have become labels. Now let us find out the actual addresses (relocatable address) of these functions.

Also, note that one single statement in C will be made up of multiple assembly statements. For example the statement int ret = a + b in the function func, has become 4 statements (in bold).

func:

.LFB0:

pushq %rbp

movq %rsp, %rbp

movl %edi, -20(%rbp)

movl %esi, -24(%rbp)

movl –20(%rbp), %edx

movl –24(%rbp), %eax

addl %edx, %eax

movl %eax, –4(%rbp)

movl -4(%rbp), %eax

popq %rbp

ret

.LFE0:

main:

.LFB1:

pushq %rbp

movq %rsp, %rbp

subq $16, %rsp

movl $33, -4(%rbp)

movl $44, -8(%rbp)

movl -8(%rbp), %edx

movl -4(%rbp), %eax

movl %edx, %esi

movl %eax, %edi

call func

movl %eax, -12(%rbp)

movl $0, %eax

leave

ret

.LFE1:

And what does the object file created by the assembler look like?

The next stage in the compilation process is assembling. The assembler takes the assembly code output of the compilation stage (shown above) and translates it to the machine statements or instructions. The output of the assembler is called object code.

These instructions will have the format opcode <operand list>. Some instructions do not have any operands while others have one or more operands.

The output is in binary format and is not human-readable. We use a tool called objdump (there are other tools also) to convert the object code into a human-readable version. The output that we get when we use a tool like objdump is referred to as disassembled code.

The disassembled output of using the objdump command on the object code of our program is shown below. Again it is a slightly truncated version.

A brief explanation of the disassembled code is below.

First, the line containing <func>:. This is just the label corresponding to the function and the offset is 0 (the offset of the first instruction (next line). There is a similar line with a label for the main function.

All the other lines start with the a hexadecimal number followed by a ‘:’ and one or more hexadecimal numbers. After that what is shown is the disassembled assembly code.

The first hexadecimal number is the offset of the statement/instruction.

The next bunch of hexadecimal number(s) are the object code representing the statements of the assembly code generated from the original C program.

And if we compare the hexadecimal numbers corresponding to the assembly statements, we can see that 55 is the object code (or more appropriately opcode for the assembly statement “push %rbp”.

As already mentioned, a function call in the 64-bit X86 architecture is the callq instruction. So in the main function, the callq instruction has to be present with the address of func as its operand.

Though there is a callq 0x3f statement at offset 3a (shown in maroon) there is nothing that indicates that it is actually a call to the function func. That is because this code has not gone through the process of linking. But we can see that in the text section of memory, the offset of func is 0x0 and the offset of the main is 0x1a

0000000000000000 <func>:

0: 55 push %rbp

1: 48 89 e5 mov %rsp,%rbp

4: 89 7d ec mov %edi,-0x14(%rbp)

7: 89 75 e8 mov %esi,-0x18(%rbp)

a: 8b 55 ec mov -0x14(%rbp),%edx

d: 8b 45 e8 mov -0x18(%rbp),%eax

10: 01 d0 add %edx,%eax

12: 89 45 fc mov %eax,-0x4(%rbp)

15: 8b 45 fc mov -0x4(%rbp),%eax

18: 5d pop %rbp

19: c3 retq

000000000000001a <main>:

1a: 55 push %rbp

1b: 48 89 e5 mov %rsp,%rbp

1e: 48 83 ec 10 sub $0x10,%rsp

22: c7 45 fc 21 00 00 00 movl $0x21,-0x4(%rbp)

29: c7 45 f8 2c 00 00 00 movl $0x2c,-0x8(%rbp)

30: 8b 55 f8 mov -0x8(%rbp),%edx

33: 8b 45 fc mov -0x4(%rbp),%eax

36: 89 d6 mov %edx,%esi

38: 89 c7 mov %eax,%edi

3a:e8 00 00 00 00 callq 3f <main+0x25>

3f: 89 45 f4 mov %eax,-0xc(%rbp)

42: b8 00 00 00 00 mov $0x0,%eax

47: c9 leaveq

48: c3 retq

And finally, look at the disassembled executable code

The executable code is the output of the linking stage of the compilation process. This is again in binary format is disassembled using the same objdump command.

The look and feel of the executable is very similar to the object that we looked at above.

The main difference is that all the labels and statements now have a relocatable address instead of offsets. Thus func:, the label corresponding to the function func, now has the relocatable address 0x4004ad.

Again the disassembled code below is a (hugely) truncated version of the disassembly of the executable code. Please note that even after the linking is completed and we get the executable file, the addresses of the functions and the statements in the functions are not absolute addresses. Those addresses will be available only when the program is executing.

As we saw already in the previous section, what is of interest to us is the callq instruction.

Now callq instruction, at relocatable address 0x4004e7, has a hexadecimal value, 0x4004ad, as the operand and 0x4004ad indeed is the relocatable address of the function func.

00000000004004ad <func>:

4004ad: 55 push %rbp

4004ae: 48 89 e5 mov %rsp,%rbp

4004b1: 89 7d ec mov %edi,-0x14(%rbp)

4004b4: 89 75 e8 mov %esi,-0x18(%rbp)

4004b7: 8b 55 ec mov -0x14(%rbp),%edx

4004ba: 8b 45 e8 mov -0x18(%rbp),%eax

4004bd: 01 d0 add %edx,%eax

4004bf: 89 45 fc mov %eax,-0x4(%rbp)

4004c2: 8b 45 fc mov -0x4(%rbp),%eax

4004c5: 5d pop %rbp

4004c6: c3 retq

00000000004004c7 <main>:

4004c7: 55 push %rbp

4004c8: 48 89 e5 mov %rsp,%rbp

4004cb: 48 83 ec 10 sub $0x10,%rsp

4004cf: c7 45 fc 21 00 00 00 movl $0x21,-0x4(%rbp)

4004d6: c7 45 f8 2c 00 00 00 movl $0x2c,-0x8(%rbp)

4004dd: 8b 55 f8 mov -0x8(%rbp),%edx

4004e0: 8b 45 fc mov -0x4(%rbp),%eax

4004e3: 89 d6 mov %edx,%esi

4004e5: 89 c7 mov %eax,%edi

4004e7: e8 c1 ff ff ff callq 4004ad <func>

4004ec: 89 45 f4 mov %eax,-0xc(%rbp)

4004ef: b8 00 00 00 00 mov $0x0,%eax

4004f4: c9 leaveq

4004f5: c3 retq

4004f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

4004fd: 00 00 00

So the name of a function indeed gives the starting address of it.

NOTE:

When using gcc/clang compilers (on Linx/UNIX platforms) the following commands can be used to generate the different outputs shown above (assembly, object file and executable). Assuming that the C source file is called pam.c, the various commands would be:

gcc -S pgm.c – This will generate pgm.s file

gcc -c pgm.s – This will generate pgm.o file (you can use pgm.c instead of pgm.s)

gcc pgm.o -o pgm – This is will generate an executable called pgm (you can use pgm.c instead of pgm.o)

Again on Linux/UNIX platforms, to disassemble the .o file and the executable file the program to use the objdump program. You can use this program to disassemble the object code/executable and also to see all the global symbols (variables and functions). Note that this program will just print its output on the screen.

objdump -d pgm – disassemble the code section of the executable file pgm.

objdump -t pgm – will print the global symbols from the symbol table.

Ready to take your skills to the next level? Explore job opportunities at Vayavya, where innovation meets talent. Apply now and join our dynamic team!

2Like

0Dislike

100% LikesVS

0% Dislikes

Author

Venu Kolathur

Venu Kolathur is Chief Architect and Co-Founder at Vayavya Labs and has over 38 years of industry & academic experience. He is responsible for product technology road-map, and design strategies.

View all posts