Mikrocontroller.net - Benutzerbeiträge [de]

Benutzer Diskussion:Ltjg boimler

2025-06-03T10:19:00Z

Nicolas: Die Seite wurde neu angelegt: „Na toll. Da hat mal wieder jemand seine Medikamente nicht bekommen.“

Na toll. Da hat mal wieder jemand seine Medikamente nicht bekommen.

ARM-ASM-Tutorial

2021-07-09T16:21:24Z

Nicolas: Literatur-Links auf static.docs.arm.com wieder erreichbar

[[Kategorie:ARM]][[Kategorie:STM32]][[Kategorie:Entwicklungstools]][[Kategorie:Programmiersprachen]]The [[ARM]] processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language.

Tutorial by [[Benutzer:Erlkoenig|Niklas Gürtler]]. [https://www.mikrocontroller.net/topic/482409 Thread in Forum] for feedback and questions.

== Introduction ==
=== Why assembly? ===
Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language!

As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation.

Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary.

=== About ARM ===
Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers.

Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications.

Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor.

=== Architecture and processor variants ===
A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”.

While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm.

ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations.

Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android).

An overview of ARM processors and their implemented architecture version can be found on [https://en.wikipedia.org/wiki/ARM_architecture#Cores Wikipedia].
This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use.

There is actually not a single, but three instruction sets for ARM processors:
* The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.
* The “A64” instruction set is for the new 64bit ARM processors
* The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes.

The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set.

== Prerequisites ==
First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components.

=== Microcontroller selection ===
For the microcontroller, an [https://www.st.com/en/microcontrollers-microprocessors/stm32f103c8.html STM32F103C8] or [https://www.st.com/en/microcontrollers-microprocessors/stm32f103rb.html STM32F103RB] by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: [https://www.st.com/en/evaluation-tools/nucleo-f103rb.html Nucleo-F103], “[https://os.mbed.com/users/hudakz/code/STM32F103C8T6_Hello/ Blue Pill]” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), [https://www.olimex.com/Products/Duino/STM32/OLIMEXINO-STM32/open-source-hardware Olimexino-STM32], [https://www.olimex.com/Products/ARM/ST/STM32-P103/ STM32-P103], [https://www.olimex.com/Products/ARM/ST/STM32-H103/ STM32-H103], [https://www.st.com/en/evaluation-tools/stm3210e-eval.html STM3210E-EVAL].

=== Processor type & documentation ===
First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation.

* The first source of information is the [https://www.st.com/resource/en/datasheet/stm32f103rb.pdf STM32F103RB/C8 datasheet]. According to the headline, this is a '''medium-density''' device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a '''Cortex-M3''' processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.
* The next important document is the [https://www.st.com/resource/en/reference_manual/cd00171190.pdf STM32F103 reference manual] that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.
* The [https://developer.arm.com/ip-products/processors/cortex-m/cortex-m3 ARM developer website] provides information about the Cortex-M3 processor core, particularly the [https://developer.arm.com/documentation/ddi0337/e ARM Cortex-M3 Processor Technical Reference Manual] (Manual can be downloaded as PFD. According to chapter 1.5.3, this processor implements the '''ARMv7-M architecture'''.
* The architecture is documented in the [https://developer.arm.com/documentation/ddi0403/eb/ ARMv7M Architecture Reference Manual]. Particularly, it contains the complete documentation of the instruction set.
* The [https://developer.arm.com/documentation/qrc0001/m/ ARM and Thumb-2 Instruction Set Quick Reference Card] contains all ARM and Thumb-2 instructions, but is a bit cluttered because of its brevity.

For any serious STM32 development, you should be familiar with all these documents.

=== Debug adapter ===
There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD.

Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the [https://www.st.com/en/development-tools/st-link-v2.html ST-Link] debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them.

=== Development Software ===
On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested).

First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the [https://www.st.com/en/development-tools/stsw-link004.html ST-Link Utility] can be useful, but is not strictly required.

Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm GNU Arm Embedded Toolchain] is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it.

Another component is required to talk with the debug adapter. For the ST-Link, this is done by [http://openocd.org/ OpenOCD], which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software.

Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable.

== Setup ==
Follow the instructions in the next chapters to set up your development environment.

=== Hardware ===
The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input.

If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB.

=== Software ===
==== Linux ====
Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM.

Download the Linux binary tarball from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”).

Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via:

<source lang=sh>
echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh
</source>

OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian):
<source lang=sh>
sudo apt-get install openocd
</source>

After that, log out and back in (or just reboot). In a terminal, type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source lang=sh>
$ arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source lang=sh>
$ openocd -v
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

==== Windows ====
[[File:ArmAsmTutorial_GccOptions.png|300px|thumb|right|Options for installing GCC]]Download the Windows installer from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”.

A Windows package for OpenOCD can be obtained from the [https://github.com/gnu-mcu-eclipse/openocd/releases gnu-mcu-eclipse downloads page]. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path.

[[File:ArmAsmTutorial_PcProperties.png|300px|thumb|right|Opening PC properties]][[File:ArmAsmTutorial_SetEnvVar.png|300px|thumb|right|Setting environment variable]]Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times.

Open a ''new'' command window (Windows Key + R, type “cmd” + Return). Type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source>
C:\>arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source>
C:\>openocd -v
GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

== Writing assembly applications ==
The full source code of the examples in the following chapters contain be found on [https://github.com/Erlkoenig90/ArmAsmTutorial GitHub]. The name of the corresponding directory is given after each example code below.
=== First rudimentary program ===
After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”.

Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

nop @ Do Nothing
b . @ Endless loop
</source>
Example name: “EmptyProgram”

When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents.

The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment.

The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later.

The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program.

To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing <code>cd <Path to Project Directory></code>. Call the assembler like this:

<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
</source>

This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this:

<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -Ttext=0x8000000
</source>

This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this:

<pre>arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000</pre>

This is not relevant for executing the program without an operating system and can be ignored.

=== Flashing the program ===
To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit"
</source>
Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption.
=== Starting the debugger ===
To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg
</source>
Then, open a new terminal/command window and start a GDB session:
<source lang=sh>
arm-none-eabi-gdb prog1.elf
</source>
GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance:
<source>
target remote :3333
</source>
Then, stop the currently running program:
<source>
monitor reset halt
</source>
If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing:
<source>
load
</source>
Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again:
<source>
monitor reset halt
</source>
Now, examine the contents of the CPU registers:
<source>
info reg
</source>
The output should look something like
<pre>
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0x0 0x0
lr 0x0 0
pc 0x8000000 0x8000000 <_stack+133693440>
xPSR 0x1000000 16777216
msp 0x20000400 0x20000400
psp 0x27e3fa34 0x27e3fa34
primask 0x0 0
basepri 0x0 0
faultmask 0x0 0
control 0x0 0
</pre>

At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing
<source>
stepi
</source>
If you type <code>info reg</code> again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do <source>stepi</source> again (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing
<source>
continue
</source>
You can interrupt the running program by pressing “Ctrl+C”. Run the commands
<source>
kill
quit
</source>
to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal.
=== Using processor registers ===
The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing.

All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”.

As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is:
<source lang=asm>
ldr r0, =123456789
</source>
The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead.

The instruction “mov” can be used to copy the contents from one register to another:
<source lang=asm>
mov r1, r0
</source>
This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers.

In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual.
=== Accessing periphery ===
To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses.

The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor.

As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step.

==== Clock Configuration ====
Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably.

This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004.

To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”.

In short: To enable GPIOA, the value '''0x00000004''' needs to be written to address '''0x40021018'''.

According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is:
* Load the value 0x00000004 into a register
* Load the value 0x40021018 into another register
* Store the value from the first register into the memory location specified by the second register.
This last step is performed by the “STR” instruction as follows:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
b .
</source>
The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet.

==== GPIO Configuration ====
By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”.

Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”.

Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on.

The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register).

Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”:
<source lang=asm>
ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

==== Writing GPIO pins ====
The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x100
ldr r1, =0x4001080C
str r0, [r1] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high

b .
</source>
Example name: “SetPin”

This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect!

=== Data processing ===
ARM supports many instructions for mathematical operations. For example, addition can be performed as:
<source lang=asm>
ldr r0, =222
ldr r1, =111
add r2, r0, r1
</source>
This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right.

You can also overwrite an input register with the result:
<source lang=asm>
add r0, r0, r1
</source>
This will write the result to r0, overwriting the previous value. This is commonly shortened to
<source lang=asm>
add r0, r1
</source>
The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are:
* '''sub''' for subtraction
* '''mul''' for multiplication
* '''and''' for bitwise and
* '''orr''' for bitwise or
* '''eor''' for bitwise exclusive or (“xor”)
* '''lsl''' for logical left shift
* '''lsr''' for logical right shift

Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example,
<source lang=asm>
add r0, r0, #23
</source>
adds 23 to the register r0 and stores the result in r0. This can again be shortened to
<source lang=asm>
add r0, #23
</source>

Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register.

Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program.

=== Reading periphery registers ===
The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle.

Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via:
<source lang=asm>
ldr r1, =0x40021018
ldr r0, [r1]
</source>
Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1.

The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction:
<source lang=asm>
orr r0, r0, #4
</source>
After that, we can store r0 as before.

With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value:

<source lang=asm>
ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

So, the modified program is:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “SetPin2”
=== Jump instructions ===
For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1]
</source>

By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning.

A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b 0x8000104
</source>
Example name: “Blink”

The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative.

When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there.

This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones.

Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon:

<source lang=asm>
BlinkLoop:

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b BlinkLoop
</source>
Example name: “Blink2”

This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this:

<source lang=asm>
b (.+4+42*2)
</source>

The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though.

=== Counting Loops ===
The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count ''down'' than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”:

<source lang=asm>
ldr r2, =1000000
subs r2, #1
</source>

Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is ''not'' set, i.e. when the result of the last math instruction (with an “s” appended) was ''not'' zero. The “beq” instruction is the opposite of that – it only performs a jump if the result ''was'' zero.

So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet:
<source lang=asm>
ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop
</source>

The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

BlinkLoop:
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r2, =1000000
delay2:
subs r2, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay”

You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change ''before'' the loop. Then, just use them inside the loop:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay2”

=== Using RAM ===
Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored.

As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right ''before'' that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash):

<source lang=asm>
.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text
@ Instructions go here...
</source>

Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text

ldr r0, =var1 @ Get address of var1
ldr r1, =0x12345678
str r1, [r0] @ Store 0x12345678 into memory block “var1”

ldr r1, [r0] @ Read memory block “var1”
and r1, #0xFF @ Set bits 8..31 to zero
ldr r0, =var2 @ Get address of var2
strb r1, [r0] @ Store a single byte into var2

b .
</source>
Example name: “RAMVariables”

Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored.

There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option <code>-Tdata=0x20000000</code> to the <code>arm-none-eabi-ld</code> call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly.

If you run this program via GDB, you can use the commands <code>x/1xw &var1</code> and <code>x/1xb &var2</code> to read the data stored in memory. After this quick introduction a more abstract overview is indicated.

== Memory Management ==
If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be used effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works.

=== Address space ===
In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123.

However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash.

While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined ''before'' the code, the code actually comes first in address space. What makes all this work is the Linker.

=== The Linker ===
Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as using multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes.

[[File:ArmAsmTutorial_Linking.svg|300px|thumb|right|Translation of native applications using assembler, compiler and linker]]You have already used a linker – the command <code>arm-none-eabi-ld</code> calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file.

When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable.

On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found [https://developer.arm.com/architectures/system-architectures/software-standards/abi here], and the generic specification for ELF on which the ARM ELF variant is based can be found [http://www.sco.com/developers/gabi/2003-12-17/contents.html here].

ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses.

Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well.

As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error:

<pre>(.text+0x132): undefined reference to `Foo'</pre>

Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker.

=== Linker Scripts ===
A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found [https://sourceware.org/binutils/docs/ld/Scripts.html here].

It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
*(.data)
} >SRAM
}
</source>
Example name: “LinkerScriptSimple”

This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations.

The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example:

<source>
SECTIONS {
.FlashText : {
*(.text)
} >FLASH

.RamData (NOLOAD) : {
*(.data)
} >SRAM
}
</source>

The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone.

To specify the filename of the linker script, use the “-T” option:
<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld
</source>
The <code>-Tdata</code> and <code>-Ttext</code> aren’t needed anymore, as the addresses are now defined in the linker script.

Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM):
<pre>arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH'
arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes</pre>

==== Reserving memory blocks ====
Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the ''beginning'' of SRAM, so put this before the “*(.data)” command:

<source>. = . + 0x400;</source>

Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400.

==== Defining symbols in linker scripts ====

As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file:

<source lang=asm>
.word 0x20000400
.word 0x080000ed
.space 0xe4
</source>

The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte ''after'' the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script:

<source>
.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
</source>
Example name: “LinkerScriptSymbols”

This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol:

<source lang=asm>
.word _StackEnd
</source>

The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores.

The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word _StackEnd
.word Reset_Handler
.space 0xe4

.type Reset_Handler, %function
Reset_Handler:

@ Put code here
</source>

The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now.

==== Absolute section placement ====
The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:
</source>
Example name: “LinkerScriptAbsolutePlacement”

The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.VectorTable)
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>

=== Program Structure ===
Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Assemble and link this source code with two assembler commands:
<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
arm-none-eabi-as -g vectortable.S -o vectortable.o
arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld
</source>

This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
@ Code here ...
</source>

This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a ''local'' symbol for each label, which can’t be used from other source files (same as ''static'' in C). The symbol is still there in the final program file, though - it can be used for debugging purposes.

== More assembly techniques ==
After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced.

=== Instruction set state ===
As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective.

Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32.

Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is:

<source lang=asm>
ldr r0, =SomeLabel
bx r0
</source>

Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions.

In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive:

<source lang=asm>
.type SomeLabel, %function
SomeLabel:
@ Some code...
</source>

That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch.

As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too:

<source lang=asm>
.type Reset_Handler, %function
</source>

If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example:

<source lang=asm>
.cpu cortex-a8
.syntax unified

@ Small but slower code here
.thumb

.type Block1, %function
Block1:
ldr r0, =Block2
bx r0

@ Larger but faster code here
.arm

.type Block2, %function
Block2:
@ ...
</source>

The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below).

=== Constants ===
The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly.

The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways:
<source lang=asm>
RCC_APB2ENR = 0x40021018
.set GPIOA_CRH, 0x40010804
.equ GPIOA_ODR, 0x4001080C
</source>

and then use it in place of the number:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_10MHz = 1
GPIOx_CRx_GP_PP_2MHz = 2
GPIOx_CRx_GP_PP_50MHz = 3

GPIOx_CRx_GP_OD_10MHz = 1|4
GPIOx_CRx_GP_OD_2MHz = 2|4
GPIOx_CRx_GP_OD_50MHz = 3|4

GPIOx_CRx_AF_PP_10MHz = 1|8
GPIOx_CRx_AF_PP_2MHz = 2|8
GPIOx_CRx_AF_PP_50MHz = 3|8

GPIOx_CRx_AF_OD_10MHz = 1|4|8
GPIOx_CRx_AF_OD_2MHz = 2|4|8
GPIOx_CRx_AF_OD_50MHz = 3|4|8

GPIOx_CRx_IN_ANLG = 0
GPIOx_CRx_IN_FLOAT = 4
GPIOx_CRx_IN_PULL = 8

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkConstants”

This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules.

The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate:

<source lang=asm>
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffff0ff
orr r0, #(GPIOx_CRx_GP_OD_10MHz<<8)
str r0, [r1]
</source>

C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values.

Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label.

A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register:

<source lang=asm>
MyData .req r7
ldr MyData, =123
add MyData, 3
</source>

This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code.
=== The Stack ===
In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”.

Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack.

On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs:

* Some operation that writes to r0
* '''Push''' (save) r0 to the stack
* Some operation that overwrites r0
* '''Pop''' (restore) r0 from the stack
* Use the value in r0 which is the same as initially assigned

ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ … Some code that overwrites r0 …

pop { r0 } @ Restore value 1000000
@ Continue using r0 ...
</source>

It is also possible to save/restore multiple registers in one go:

<source lang=asm>
ldr r0, =1000000
ldr r1, =1234567
@ Use r0 and r1 ...
push { r0, r1 } @ Save values 1000000 and 1234567

@ … Some code that overwrites r0 and r1 …

pop { r0, r2 } @ Restore 1000000 into r0 and 1234567 into r2
@ Continue using r0 and r2...
</source>

It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ Inner Code Block:

ldr r0, =123
@ Use r0 …

push { r0 } @ Save value 123

@ Inner-Inner Code Block that overwrites r0

pop { r0 } @ Restore value 123
@ Continue using r0 ...

pop { r0 } @ Restore value 1000000 into r0

@ Continue using r0 …
</source>

The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”.

At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa.

As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data.

When storing a 32bit register value using “push”, the stack pointer is '''first''' decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later.

An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte '''after''' the stack memory area. If the stack is completely full, it contains the address of the very first byte '''inside''' the stack memory area. This means that the stack grows '''downward'''. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash.

The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address ''inside'' the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4127.html ARM ABI requires] the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8.

In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the ''end'' of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the ''beginning'' of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial.
=== Function calls ===
Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the ''next'' instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again).

This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

bl EnableClockGPIOA @ Call function to enable GPIOA's peripheral clock

@ Some more code ...
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1]

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

</source>

Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”.

When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files.
==== Using the stack for functions ====
In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r5, =5 @ Number of LED flashes.
bl Blink
b .

.type Blink, %function
Blink:
push { lr }
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

bl Delay

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

bl Delay

subs r5, #1
bne BlinkLoop

pop { lr }
bx lr

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

.type Delay, %function
Delay:
mov r4, r3
DelayLoop:
subs r4, #1
bne DelayLoop @ Iterate delay loop
bx lr

</source>
Example name: “BlinkFunctions”

The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions.

The two lines

<source lang=asm>
pop { lr }
bx lr
</source>

are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC:

<source lang=asm>
pop { pc }
</source>

This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running.
==== Calling Convention ====
Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller.

When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics:

* Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.
* The LR is used as shown for the return address.
* When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).
* The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.
* The register r0 may be used to pass a result value back to the caller, which is called the return value.

This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved ''outside'' the function (“caller-save”), and the registers r4-r11 are (if at all) saved ''inside'' the function (“callee-save”).

A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack.

If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =5
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r7, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
mov r7, r0 @ Number of LED flashes.

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r7, #1
bne BlinkLoop

pop { r4-r7, pc }

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
DelayLoop:
subs r0, #1
bne DelayLoop @ Iterate delay loop
bx lr
</source>
Example name: “BlinkFunctionCallingConvention”

The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7.

Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes:

<source lang=asm>
.type Blink, %function
Blink:
push { r4, lr }

mov r4, r0

BlinkLoop:
ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r4, #1
bne BlinkLoop

pop { r4, pc }
</source>
Example name: “BlinkFunctionCallingConvention2”

A third variant would not use any of the callee-save-registers (r4-r11) at all, and instead just save r0 before the function calls and restore it as needed

<source lang=asm>
.type Blink, %function
Blink:
push { lr }

BlinkLoop:
push { r0 }

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

pop { r0 }
subs r0, #1
bne BlinkLoop

pop { pc }

</source>
Example name: “BlinkFunctionCallingConvention3”

The frequent stack accesses would however make this slower. Be sure to always document the meaning (and units, if applicable) of parameters e.g. via comments.
=== Conditional Execution ===
As mentioned, the conditional variants of the “b” instruction (e.g. “bne”) can be used to execute certain blocks of code only if a certain condition is met. First, more ways to formulate conditions will be shown. Next, the ARM instruction “it” will be introduced, which makes executing small blocks of code conditionally more efficient.
==== Conditions ====
All conditions for conditional execution depend on the outcome of some mathematical operation. When instructions such as “adds”, “subs”, “ands” are used, they update the flags in the APSR register depending on the outcome, which are then read by the conditional variants of “b” to decide whether to actually perform the jump.

Often it is necessary to compare two numbers without actually doing a calculation. This can be done with the “cmp” instruction to which you can pass two registers or a register and a literal:

<source lang=asm>
cmp r0, #42
cmp r0, r1
</source>

The “cmp” instruction is very similar to “subs” - it subtracts the second operand from the first, but doesn’t save the result anywhere, i.e. the registers keep their values. Just the flags in the APSR are updated according to the result, just as with “subs”. For example, if both operands were equal, the result of the subtraction is zero, and the zero flag will be set. So, to test whether two numbers are equal:

<source lang=asm>
cmp r0, #42
beq TheAnswer

@ This is executed if r0 is not 42

TheAnswer:
@ This is executed if r0 is 42
</source>

The “bne” instruction is the opposite of “beq”.

The “tst” instruction works similarly to “cmp”, but instead of subtracting, perform a bitwise “and” operation - like the “ands” instruction, but without keeping the result. This way, you can test whether a bit in a register is set:

<source lang=asm>
tst r0, #4
beq BitNotSet

@ This is executed if bit 2 in r0 is set

BitNotSet:
@ This is executed if bit 2 in r0 is not set
</source>

A more useful use case for “tst” is to pass the same register twice. Applying “and” to the same value twice yields the same result as the input, so “tst” in this case effectively checks the properties of the input (negative/positive, zero):

<source lang=asm>
tst r0, r0
beq ValueZero

@ This is executed if r0 is not zero

ValueZero:
@ This is executed if r0 is zero
</source>

There is also the “teq” instruction which performs an exclusive or operation.

As mentioned, the suffixes “eq” and “ne” are called condition codes. ARM has 14 of those which define how the flags in the APSR form the condition. The details about how a subtraction (by “subs” or “cmp”) sets the flags in the APSR and how their interpretation by the different condition codes correlates to the mathematical result are somewhat complicated, involving the way the 2’s complement format works and relying on the fact that subtracting works by adding a negated number. Instead of diving into all the details, a table with a more high-level view and a practical interpretation of the condition should be more helpful:

{| class="wikitable"
! style="text-align:left;" | Code
! style="text-align:left;" | Meaning
! style="text-align:left;" | Unsigned / Signed
! style="text-align:left;" | Flags
! style="text-align:left;" | Condition after “cmp/subs r0, r1”
! style="text-align:left;" | Condition after “tst r0, r0”
|-
| style="text-align:center;" | EQ
| Equal
| style="text-align:center;" | U+S
| Z==1
| r0 = r1
| r0 = 0
|-
| style="text-align:center;" | NE
| Not equal
| style="text-align:center;" | U+S
| Z==0
| r0 ≠ r1
| r0 ≠ 0
|-
| style="text-align:center;" | MI
| Negative
| style="text-align:center;" | S
| N==1
| ---
| r0 < 0
|-
| style="text-align:center;" | PL
| Positive or Zero
| style="text-align:center;" | S
| N==0
| ---
| r0 ≥ 0
|-
| style="text-align:center;" | VS
| Overflow
| style="text-align:center;" | S
| V==1
| r0-r1 out of range¹
| ---
|-
| style="text-align:center;" | VC
| No Overflow
| style="text-align:center;" | S
| V==0
| r0-r1 in range¹
| ---
|-
| style="text-align:center;" | HS
| Unsigned higher or same
| style="text-align:center;" | U
| C==1
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LO
| Unsigned lower
| style="text-align:center;" | U
| C==0
| r0 < r1
| ---
|-
| style="text-align:center;" | HI
| Unsigned higher
| style="text-align:center;" | U
| C==1 and Z==0
| r0 > r1
| ---
|-
| style="text-align:center;" | LS
| Unsigned lower or same
| style="text-align:center;" | U
| C==0 or Z==1
| r0 ≤ r1
| ---
|-
| style="text-align:center;" | GE
| Signed greater or equal
| style="text-align:center;" | S
| N==V
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LT
| Signed less than
| style="text-align:center;" | S
| N!=V
| r0 < r1
| ---
|-
| style="text-align:center;" | GT
| Signed greater than
| style="text-align:center;" | S
| Z==0 and N==V
| r0 > r1
| ---
|-
| style="text-align:center;" | LE
| Signed less or equal
| style="text-align:center;" | S
| Z==1 or N!=V
| r0 ≤ r1
| ---
|}
1: Range meaning the numbers from -(2^31) until (2^31-1), inclusive

To determine which condition code you need, first think about whether the number is unsigned (range 0 to 2^32-1) or is using two's complement to represent signed numbers (range -2^31 to 2^31-1). Ignore all rows in the table with the wrong format.

If you want to compare two numbers, use the “cmp” instruction, and search for the desired condition within the “cmp”-condition column of the table. If you want to test a single number's properties, use the “tst”-column. Use the condition code from the first column with the conditional “b” instruction (“bne”, “beq”, “bmi”, “bpl”, “bhs”, ... ) right after the appropriate “cmp”/”tst” instruction.

Note that all the condition codes have a corresponding inverse code that has exactly the negated meaning. Most also have a swapped partner code, using which is equivalent to swapping the operands for cmp.
==== The IT instruction ====
Jumping is inefficient, so having many conditional jumps may slow down your program. The ARM architecture offers a way to make a few instructions conditional without requiring a jump via the “it” (if-then) instruction. It is used in place of a conditional jump after an instruction that set the flags (“cmp”, “tst”, “adds”…) and also needs a condition code. The next instruction right after the it will then only be executed when the condition is met, and skipped otherwise. You have to repeat the condition code and add it to that instruction; this is just to make the code clearer and avoid confusion.

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

it hi @ Make the next instruction conditional
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

This checks if r0 is higher than r1 (it isn’t), and only sets the pin PA8 to high if this condition is met. Up to 4 instructions can be made conditional like this; for each one, an additional “t” has to be appended to the “it” instruction:

<source lang=asm>
cmp r0, r1 @ Perform comparison

ittt hi @ Make the next instruction conditional
ldrhi r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldrhi r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

You can also add instructions that will be executed if the condition was ''not'' met (like an “else”-case in high-level-languages), by appending “e” instead of “t” to the “it” instruction. Since the “t” in “it” is fixed, the first instruction is always executed if the condition is met; only the next three instructions can be either a “then” case (“t”) or “else” case (“e”). You also have to provide the inverted condition code for the “else”-instructions:

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

ite hi @ Make the next two instructions conditional (if-then-else)
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
strls r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low
</source>

There are several restrictions on which instructions may appear within an it-block. Most importantly, instructions that set the flags are forbidden here, as is the “b” instruction except for the last instruction in an “it” block. Directly jumping to one of the conditional instructions is forbidden too.

In T32 code, only the conditional “b” instruction is capable of encoding a condition code together with some operation, so the “it” instruction is provided to make any instruction conditional. On A32, most instructions include a condition code and can therefore be conditional, and the “it” instruction is actually ignored by the assembler here. You can and should still put “it” into code intended for A32, as this makes it compatible with T32. This is one of the reasons why A32 is more time-efficient, and T32 more space-efficient.

Conditional instructions sometimes make surprisingly compact programs. For example, the euclidean algorithm for calculating the greatest common divisor (gcd) of two numbers can be written in ARM assembly like this:

<source lang=asm>
gcd:
cmp r0, r1
ite gt
subgt r0, r0, r1
suble r1, r1, r0
bne gcd
</source>

While the C equivalent is actually longer:

<source lang=asm>
int gcd(int a, int b) {
while (a != b) {
if (a > b)
a = a - b;
else
b = b - a;
}
return a;
}
</source>

The usage of conditional instructions is also [https://developer.arm.com/docs/dui0801/latest/condition-codes/example-showing-the-benefits-of-conditional-instructions-in-a32-and-t32-code faster] than using conditional jumps. Note that the final “bne” instruction is independent of the “if-then” block; it just directly uses the result of “cmp”.
=== 8/16 bit arithmetic ===
So far, all numbers had 32 bit. However, especially for space reasons, smaller numbers are needed with 8 or 16 bit. Cortex-M3 doesn't provide any instructions for calculating 8 or 16 bit numbers directly. Instead, after loading such a number from memory into a processor register, it has to be extended into 32bit to allow the 32bit instructions to work properly. When storing the result back, only the lower 8/16 bit are used. If 8/16bit overflow behavior is required (i.e. overflow at -128/127 for 8bit signed, 0/256 for 8bit unsigned, -32768/32767 for 16bit signed, 0/65536 for 16bit unsigned) for calculations, the numbers have to be truncated after each calculation. This actually makes it slightly less efficient to deal with smaller numbers.

A 16bit value (“halfword”) can be read from memory with the ldrh instruction:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
</source>

“ldrh” loads 16bit from memory, writes them into the lower 16 bits of the destination register (here: r1), and sets the upper 16bits to zero. If the value is signed, it has to be sign-extended so that it can be used with 32bit-calculations:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
sxth r1, r1
</source>

The “sxth” instruction copies the sign bit (i.e. bit 15) into the upper 16 bits (“sign-extension”); this makes sure that negative 16bit-numers keep their value when interpreted as 32 bits. The “ldrsh” instruction combines both “ldrh” and “sxth”. “ldrb”, “sxtb”, “ldrsb” are for loading and sign-extending 8bit-values and the combination of both, respectively.

To simulate 8/16bit overflow behaviour after a mathematical operation, use uxtb/uxth for unsigned 8/16 bit numbers, or sxtb/sxth for signed 8/16 bit numbers:

<source lang=asm>
add r0, #1
uxth r0, r0
</source>

The “uxth”/”uxtb” instructions copy the lower 16/8 bits of a register into another one, setting the upper 16/24 bits to zero. This way, if r0 contained 65535 before, the result will be 0 instead of 65536 after using “uxth”.

This is a common trap when coding in C - when using e.g. the "uint16_t" type for local variables such as loop counters, this implicitly requests 16bit overflow behavior, requiring the truncating after each calculation, even though the overflow may actually never happen. This is why e.g. uint16_fast_t should be used for local variables, as this is 32 bit on ARM, which is faster.
=== Alignment ===
There are certain restrictions on the address when accessing data in memory using the “str”/”ldr” variants:
* The “ldrd”/”strd”/”ldm”/”stm” instructions, which can load/store multiple registers at once, always require the address to be a multiple of 4. If it isn’t, the program will crash.
* The “ldr”/”str” instructions require the address to be a multiple of 4, and “strh”/”ldrh” require it to be a multiple of 2. If it isn’t, the behaviour depends on the ARM version:
** On ARMv6-M and before, the program will crash.
** On ARMv7-M:
*** If the CCR.UNALIGN_TRP is set to zero (the default), the access will be slow
*** If the CCR.UNALIGN_TRP bit is set to one, the program will crash, emulating the ARMv6-M behaviour

For “strb”/”ldrb” there are no such requirements.

The number of which the address needs to be a multiple of is called the “alignment” (e.g. 2-byte-alignment, 4-byte-alignment, …). An access with an address that is a multiple of 2/4 as specified above is called an “aligned access”; others are called “unaligned access” (which are slow or cause a crash).

Even though slow accesses may be acceptable, it is still a good idea to make sure all accesses are always correctly aligned in case the code is ported to an ARM version or operating system that requires it. The addresses of periphery registers are already aligned correctly, so there is no need to worry. When placing data in RAM however, you should make sure that the addresses of the individual elements that are accessed via one of the “ldr” variants are aligned properly. For example, if a previous example code was modified like this:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

The address of “var1” will not be a multiple of 4, and an access via “ldr” would be unaligned. This could be improved by adding a space of 3 bytes in between:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.space 3
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

This would require you to keep in mind all the other things in memory that were declared before, which is impractical especially if multiple assembly files are used. Therefore, the assembler offers the “.align” directive:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.align 2
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

When using “.align X”, the assembler makes sure that the next address will be a multiple of 2^X, so in this case, a multiple of 2^2=4. The assembler will therefore insert 0 to 2^X-1 bytes of space. The section containing the directive in the object code file will also be marked to require that alignment, such that the linker will automatically place it at the appropriate location in address space.
=== Offset addressing ===
The various “ldr”/”str” instructions can optionally perform some calculation on the address before executing the memory access. What is shown for “ldr” here works for “str” and the variants for halfwords and bytes equivalently. There are several variants for this. This first one adds a fixed offset that is encoded within the instruction itself to the address:

<source lang=asm>
ldr r0, [r1, #8]
</source>

This adds 8 to r1 and uses the result as the address to access. The number can also be negative. This variant is useful for accessing members of a heterogeneous container organized like a C struct or the registers in a periphery module. For example, you can load the base address of a periphery module into a register, and then access the various registers using offset-addressing without having to load each address individually:

<source lang=asm>
GPIOA=0x40010800
GPIOx_CRH = 0x04
GPIOx_BSRR = 0x10

GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA

ldr r1, =GPIOA

ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r0, [r1, #GPIOx_BSRR] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “OffsetAddressing”

This way, you can avoid repeated loads of similar addresses. This variant is also capable of writing the newly calculated address back into the address register by appending a “!”:

<source lang=asm>
ldr r0, [r1, #8]!
</source>

This will add 8 to r1, write the result into r1, and also use it as an address from which to load 4 bytes and store them into r0. The variant

<source lang=asm>
ldr r0, [r1], #8
</source>

works just the opposite - r1 is used as an address from which to load the data, and “r1+8” is written back to r1. The next variant adds two registers to obtain the memory address:

<source lang=asm>
ldr r0, [r1, r2]
</source>

This loads the data from the address calculated by “r1+r2”. The second register (here: r2) can also be optionally shifted left by a fixed number of bits in the range 0-3:

<source lang=asm>
ldr r0, [r1, r2, lsl #2]
</source>

This shifts r2 left by two bits (i.e. multiplies it by 4), adds it to r1, and uses that as the address (r2 itself is not modified).
=== Iterating arrays ===
The offset addressing mechanism is perfectly suited to iterating arrays. This could be used to make an array defining a sequence of LED flashes that is iterated by the LED blinker application. Such an array would contain the duration of each on-and off-cycle (as passed to the “Delay” function) and be placed in flash memory:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =BlinkTableEnd @ Move address of "BlinkTableEnd" into r8

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
.word 2500000, 1000000, 2500000, 1000000, 2500000, 1000000
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
BlinkTableEnd:
</source>
Example name: “BlinkPattern”

The “.word” directive is used to place a sequence of 32bit- numbers into flash memory. The label “BlinkTable” will refer the the start address of the array, and “BlinkTableEnd” to the first address ''after'' the array. These two addresses are loaded into registers before the loop. The “.align” directive is used to make sure the 32bit-words are stored at properly aligned addresses. Inside the loop, the “ldr” instruction is used to load a 32bit-word from the array and pass it to the “Delay” function. The r7 register is advanced by 4 bytes to the next 32bit-word. This is done twice, for the on-and off-time. At the end of the loop, the address register is compared with the address of “BlinkTableEnd” - until that address has been reached, the loop will continue.

Another possibility is to keep the base address of the array in a register, and increment another register that contains the offset:

<source lang=asm>
.type Blink, %function
Blink:
push { r4-r9, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =0
ldr r9, =18

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

cmp r8, r9
blo BlinkLoop

pop { r4-r9, pc }
</source>
Example name: “BlinkPattern2”

Here, r8 is incremented in steps of 1 to denote the index in the array. The “lsl” syntax for “ldr” is used to multiply r8 by 4 (since each word is 4 bytes in size) and add it to r7, which contains the array’s base address. At the end of the loop, r8 is compared with 18, which is the number of entries in the array. This variant is actually less efficient, as it needs to keep both the base address and the index in registers and also has to increment the index in each iteration.
=== Literal loads ===

Regardless of architecture, any processor obviously needs to work with addresses in its own address space a lot. ARM can do calculations with its 32bit addresses just fine, but there is a bottleneck: The instruction set itself. To work with any address, it needs to be initially loaded into a processor register, but ARM instructions are only 16 or 32bit in size - not enough space for an arbitrary 32bit number plus the instruction encoding. Allowing even larger instructions (e.g. 40 bit) would complicate matters, so ARM instead uses several tricks to deal with this problem, which will be discussed here.

The “ldr r0, =1234” syntax allows you to load any arbitrary 32bit numbers, but is not actually a machine code instruction, but is translated by the assembler into one. In this chapter, the actual instructions for loading immediate numbers are discussed.

==== The “mov”-instruction ====
The most basic way of loading an immediate number into a register is the “mov” instruction:

<source lang=asm>
mov r0, #1234
</source>

This allows you to load any 16bit number (0 to 2^16-1) into a register. “mov” also includes some clever encodings that allow you to load certain commonly-used patterns:

* Any 32bit number that consists of one byte of arbitrary bits (i.e. 8 adjacent arbitrary bits) at any location, and zeros otherwise, e.g. 0x00000045, 0x00045000, 0x7f800000.
* Any 32bit number that consists of the same byte repeated 2 or 4 times in fixed places, as in 0x23002300, 0x00230023, 0x23232323
* The bit-wise negated result of any of these two patterns, e.g. 0xffffffba, 0xfffbafff, 0x807fffff or 0xdcffdcff. The assembler will actually use the “mvn” instruction for this, which works identically to “mov”, but negates the value.

By specifying a number that falls into one of these patterns, the assembler will automatically use the appropriate encoding. The first two ways of encoding numbers are not only available with “mov”, but also several other mathematical instructions that expect some immediate value: “add”, “and”, “bic”, “cmn”, “cmp”, “eor”, “mov”, “mvn”, “orn”, “orr”, “rsb”, “sbc”, “sub”, “teq”, “tst”. In the ARM Architecture Reference Manual, check the description of the instructions and look out for “ThumbExpandImm” to see whether it supports the first two patterns above.

You can also use the “mvn” instruction directly, e.g.:

<source lang=asm>
mov r0, #0xf807ffff
mvn r0, #0x07f80000
</source>

both lines are identical and write the number 0xf807ffff into r0.
==== The “movt” instruction ====
While supporting many common patterns, this does not allow arbitrary 32 bit numbers. One way to load any 32bit number is to split the number into two 16bit halves, and use both “mov” and “movt” to combine these two half-words into one register:

<source lang=asm>
mov r0, #0xabcd
movt r0, #0x1234
</source>

The “movt” instruction loads the given number into the upper 16 bits of the register, so this example loads 0x1234abcd into r0. The order is important, as “mov” overwrites the upper 16 bits with zeros, but “movt” keeps the lower 16 bits. If a single “mov” can't fit the desired number, the combination of “mov” and “movt” is the fastest way of loading any 32bit number. As two 32bit instructions are needed, this consumes 8 bytes of program memory. If you want to load the address of a symbol into a register, you need to tell the assembler to split it automatically. This can be achieved by prefixing the symbol with “:lower16:” or “:upper16:”, e.g.:

<source lang=asm>
movw r0, #:lower16:GPIOA_BSRR
movt r0, #:upper16:GPIOA_BSRR
</source>

Note that “movw” needs to be specified in this case to explicitly tell the assembler to use the “mov” variant that accepts 16bit numbers (which it otherwise does automatically when a direct value is given).
==== PC-relative loads ====
The other way of loading arbitrary 32bit values into registers is to place the value directly in flash memory, and load it from there using “ldr”:

<source lang=asm>
@ Some code …
mov r0, … address of Literal ...
ldr r1, [r0]
@ More code …
Literal:
.word 0x12345678
</source>

However, there is a Chicken-And-Egg problem - the address of “Literal” is a 32bit number itself, so how to load it into r0? Luckily, there is a register that contains a number close to the one needed - the program counter (PC, r15) indicates the address of the instruction currently being executed. By reading it and adding a small offset that fits into the instruction itself, the address of “Literal” can be obtained, provided that “Literal” is located close enough. Consider this example of the EnableClockGPIOA function:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
add r1, pc, #12
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
.word RCC_APB2ENR
</source>

The 32bit-value “RCC_APB2ENR” is stored in flash memory. The “add” instruction is used to add the offset 12 to the address of the instruction itself to obtain the address of said 32bit-value, which is then loaded via “ldr”. The offset 12 is actually not easy to calculate and even depends on the alignment of the “add” instruction itself (hence the “.align” to ensure a consistent example). The assembler is capable of doing the calculation on its own, for which the “adr” instruction is used:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
adr r1, LiteralRCC_APB2ENR
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

The label LiteralRCC_APB2ENR refers to the address of the 32bit-value in memory. “adr” is actually a variant of “add” that instructs the assembler to calculate the offset and place it into the instruction itself, which then lets the processor add it to PC and write the result to r1. This address is then used by “ldr”.

The “adr” instruction is useful when the address of some literal is explicitly needed; for example, in the blinker program, it can be used to obtain the addresses of the array:

<source lang=asm>
adr r7, BlinkTable @ Move address of "BlinkTable" into r7
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r8
</source>

However, for loading a single value, the address is actually not needed. In this case, “adr” and “ldr” can be combined:

<source lang=asm>
.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, LiteralRCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

This special variant of “ldr” lets the assembler calculate to offset as with “adr”, adds it to “PC” at runtime and loads the data found at the address into r1. This is much easier than the first variant, as all calculations are done automatically. It is still somewhat cumbersome having to write three lines just to obtain a single 32bit value. Therefore, the assembler offers this already introduced syntax:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

This is a special command for the assembler. If possible, the assembler will use the “mov” or “mvn” instruction to load the value. If the value won’t fit, it will be put into flash memory, and a “ldr” instruction as above will be used. In this case, the “ldr rX, =...” syntax is equivalent to the combination of specifying a label for the value, the “.word” directive and “ldr rX, <Label>”. Therefore, this syntax is usually the best way to load immediates.

The assembler places the literals at the end of the file. If the file is long, the offset will be too long for the “ldr” and “adr” instructions and the assembler will emit an error. You can instruct the assembler to place all literals that have been declared so far at a specific point using the “.ltorg” directive. It is recommended to place an “.ltorg” after each function (after the “bx lr”) - just make sure that execution will never reach there. If a single function is so long that an “.ltorg” at the end is too far away from “ldr”/”adr” at the beginning, you can place an “.ltorg” somewhere in the middle and jump over it with “b”.

In summary, the following rules can help make literal loads more efficient
* Avoid literal loads if possible; try to calculate needed values from other values that have already been loaded, possibly by using offset-addressing in “ldr”/”str”
* When accessing multiple registers of a single periphery module, load its base address once and use offset addressing to access the individual registers
* If you need a pointer to a location in flash memory, try using “adr”
* If speed is important, use “movw”+”movt” to load the value
* Else, use “ldr rX, =...” to have the assembler choose the optimal encoding
* Place “.ltorg” after each function

The “ldr … =” instruction can also be used to load any immediate 32bit value into the PC to cause a jump to that address, simply by specifying “pc” as the target register. If you perform an ordinary branch (via “b” or “bl”) to some function whose address is too far away from the current code location, the linker will insert a “wrapper” function that does exactly that to perform the “far” jump. That function is called a “veneer”.
=== The SysTick timer ===
An important aspect of many embedded systems is to control timing of technical processes. In the blinker example, the timing of the LED flashes was handled by having the processor execute dummy instructions to pass time. It is however virtually impossible to accurately predict the runtime of any piece of code on a complex processor such as ARM ones, and the runtime may vary among multiple runs and depending on the actual micro controller and its configuration. For a simple LED blinker this may be acceptable, but not for e.g. a closed loop controller for some mechanical actor. Therefore, almost all micro controllers and also application processors feature one or more hardware timers, which allow to measure time independently of the execution speed of the software. Timer features vary widely among different processors, but that basic idea is to increment or decrement some digital counter at each clock cycle and trigger some event when it reaches a certain value.

All ARMv7-M processors feature the so-called “SysTick”-Timer as part of the processor core itself. This is a rather simple 24bit-timer that counts from a configurable value back to zero, then resets to that value and triggers an event. This timer is frequently used as a time base for RTOS or other runtime libraries. The timer uses three periphery registers: “RVR” contains the value from which to count down. “CVR” contains the current value, and “CSR” contains some status and control bits. The timer can be used for the “Delay” function like this:

<source lang=asm>
SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
ldr r1, =SCS
add r0, r0, r0, lsl #1

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

DelayLoop:
ldr r0, [r1, #SCS_SYST_CSR]
tst r0, #0x10000
beq DelayLoop

ldr r0, =0
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>

The SysTick is part of the “System Control Space”, SCS. The SCS base address is defined as a symbol, and the relative addresses of the registers as well. The count value is stored in “RVR”, after which “CVR” has to be set to zero. The timer is started by writing “5” into the “CSR” register. The loop repeatedly reads the “CSR” register and continues until bit 16 is set. The “tst” instruction is used to perform an “and” operation with the register contents and an immediate value without keeping the result while just updating the flags. At the end, the “CSR” register is set to zero to disable the timer. The “add” instruction at the beginning is used to multiply the count value by 3: r0 is shifted left by one, i.e. multiplied by two, and then added to itself, as in r0*2^1+r0. This is a common trick to quickly multiply by constants. By including this multiplication, the duration is the same as with the previous “Delay” variant, which, on this microcontroller, uses about 3 cycles per loop iteration.

Managing timing this way (or any other kind of “Delay” function) is still not very accurate. The time needed to call the function, start the timer, return, and set the pins is added to the actual duration and may also vary each time. The timing errors accumulate over time - a clock implemented this way will quickly go wrong. The proper way to achieve accurate timing is to start the timer once, let it run continuously, and react to its events. The internal clock source used by the microcontroller is also quite inaccurate (up to 2.5% deviation), which can be improved upon by a quartz crystal (typical accuracy of e.g. 0.005%), which will be covered later. Reacting to events instead of calling a function that executes dummy code requires restructuring the program code, without using any kind of “Delay” function.

To do that, the timer is started once at program startup and kept running. After setting the LED pin, wait for the timer event, and repeat. In the last example, the values 3000000 and 7500000 are used for the timer register (3x1000000 and 3x2500000, respectively). Changing the timer value while it is running continuously is problematic, so one fixed value should be used; to achieve variable blinker duration, multiple timer events need to be counted. The greatest common denominator of the two numbers is 1500000, so to achieve the two different times, 2 and 5 timer events need to be registered, respectively. Since these numbers fit into a single byte, the table entries and corresponding access instructions are changed to byte. A function “StartSysTick” is implemented to start the timer once, and a function “WaitSysTick” to wait for a given number of timer events:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =TimerValue
bl StartSysTick
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
adr r7, BlinkTable @ Move address of "BlinkTable" into r8
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r9

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2, 2
BlinkTableEnd:
.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

bx lr

@ r0 = Number of timer events to wait for
.type WaitSysTick, %function
WaitSysTick:
ldr r1, =SCS

WaitSysTickLoop:
ldr r2, [r1, #SCS_SYST_CSR]
tst r2, #0x10000
beq WaitSysTickLoop

subs r0, #1
bne WaitSysTickLoop

bx lr

</source>
Example name: “BlinkSysTick”

This way, the blinker frequency will be as stable and accurate as possible with the given clock source.
=== Exceptions & Interrupts ===
Exceptions and interrupts play an important role in low-level development. They provide a facility for hardware to notify the software of events, such as received data blocks or a timer event. On ARM, interrupts are a sub-group of exceptions – there are some “system-level” exceptions mostly for dealing with processor errors and providing operating system support, while interrupts are “special” exceptions for events signaled by periphery modules. When writing “regular” microcontroller software, you will mostly work with interrupts.

Exceptions (and interrupts) interrupt normal program flow, and cause the processor to execute some other piece of code which is called the exception handler or Interrupt Service Routine (ISR) (even for the “system-level” exceptions that are not interrupts). After dealing with the indicated event, the ISR typically returns and normal program flow resumes. As exceptions can interrupt the program anytime, data (and periphery) may be in any kind of inconsistent state, so special care must be taken to avoid corrupting program state in an ISR. The ARMv7-M processor (including the Cortex-M3) provide sophisticated support for exceptions, with configurable priorities and nested exception calls. This chapter will only cover the basics for using exceptions.

On ARMv7-M, exception handlers are implemented as regular functions, for example:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
@ Handle event ...
bx lr
</source>

Like any other function, it has a label, returns with “bx lr”, and is also made globally visible to other source files using “.global”. The “.type … %function” is required here for the same reason as for the already-mentioned “Reset_Handler”. Exception handlers can be located anywhere in flash memory, among the other regular functions. To tell the processor where the exception handlers for the various exception types are located, the vector table needs to be adjusted. Until now, the vector table was defined as:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Recall that the first 32bit-word in flash memory contains the initial stack pointer (defined via “.word _StackEnd”) and the second word contains the address of the first instruction of the program (defined via “.word Reset_Handler”). Actually, resetting the controller is an exception too, and the code to be executed after reset (or start-up) is the handler for the reset exception (hence the name “Reset_Handler”). The next 228 bytes of flash memory contain 57 32bit-addresses of the handlers of the other exceptions, including interrupts. The “.space” directive just fills those with zeroes. To tell the processor the address of an exception handler, the appropriate entry in this table needs to be set to that address. In chapter 10.1.2, table 63 of the controller’s reference manual, the format of the vector table, and which exception’s address should go where, is defined. Only the interrupts up until position 42 actually exist on the STM32F103RB/C8, as defined in chapter 2.3.5 of the datasheet; everything from “TIM8_BRK” is only present on larger controllers. According to the table, the SysTick exception handler’s address needs to be put at location 0x3C relative to the beginning of flash memory. Since the first 8 bytes are already occupied, 0x34 bytes of space are needed after those first 8 bytes.

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0x34
.word SysTick_Handler
.space 0xac
</source>

With this modification, the SysTick_Handler function is now declared as the handler for the SysTick exception. By default, the SysTick timer does not trigger an exception. To do that, you have to set bit 2 in the SCS_SYST_CSR register. By placing the logic for the blinker into the timer's ISR, you get an interrupt-based blinker:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

ldr r0, =TimerValue
bl StartSysTick
SleepLoop:
wfi
b SleepLoop

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq Return

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs Return

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne Return

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs SkipRestart

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

SkipRestart:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

Return:
bx lr

.align 2
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr

</source>
Example name: “BlinkSysTickInterrupt”

The regular program flow now consists only of initializing the periphery, timer, and the first step of the blinker (setting the pin high). After that, the processor should do nothing but wait for exceptions, which is achieved by a simple endless loop. The “wfi” instruction suspends the processor; when an exception occurs, the processor will wake up, execute the ISR, and return execution after the “wfi”. Therefore, “wfi” is usually put in an endless loop as shown. This technique can reduce the processor’s power consumption significantly, as it is only running when something needs to be done, as indicated via interrupts. The ISR first checks whether the interrupt flag in the timer register is set - this is necessary, since exceptions can sometimes occur “spuriously”, i.e. without an actual event causing it. The decision whether to set or reset the pin state is taken based on the lowest bit of the table index, such that the output alternates between 1 and 0.

The code inside the ISR needs to know which step in the blinking sequence is currently active, and how many timer events have already occurred inside the current step. Therefore, two 1-byte-variables are stored in RAM. To access them, offset addressing is used, where r0 contains the base address of the variables in memory, and the offsets inside “ldrb” and “strb” are set accordingly. The last number of the blink sequence table is omitted, since it is actually superfluous, because no action is taken after the last delay has elapsed. Because the table size is now odd, an “.align” directive after it is required. Always putting “.align” after outputting data is a good idea anyways.

Since exceptions can occur at any point in regular program flow, the processor registers may contain some data that will be used after the exception handler returns. Therefore, if the exception handler writes anything into the registers, they need to be restored when returning from the exception. Upon exception entry, the Cortex-M3/4 processors automatically store the registers r0-r3, r12, r14 (LR) and APSR (including the flags) on the stack. The link register is filled with a special “dummy” value, and when the exception handler returns via “bx lr” using this value, the processor restores said registers to their previous state. This effectively means that you can implement exception handlers like any other function, i.e. freely overwrite r0-r3, r12 and the flags and push/pop r4-r11 and the LR if needed.
=== Macros ===
The assembler provides a few mechanisms to make assembly-language development easier. One of those are macros, which allow you to define snippets of assembly code that you can then insert easily whenever you need them. While looking similar to function invocations, the code inside the macro is actually copied each time the macro is used, so don’t overuse them. Macros are started with “.macro” and end at the next “.endm” directive. For example, the following macro sets the LED pin to 0 or 1:

<source lang=asm>
.macro SETLED value
ldr r0, =GPIOA_BSRR
ldr r1, =(((!\value) << 24) | (\value<<8))
str r1, [r0]
.endm

SETLED 0
SETLED 1
</source>

The macro name is defined as “SETLED”, and a single parameter with name “value” is given. By typing “\value”, the value of the parameter is substituted in the macro body. Some bit-shifting is used to calculate the right bit pattern to write into BSRR to set or reset the pin accordingly.
=== Weak symbols ===
As explained before, labels defined in assembly files get translated into symbols in the object code files, which are resolved by the linker. Sometimes it is desirable to provide a “default” or “fallback” implementation of some function (or data block) which is only used when no other implementation is given. This can be achieved by marking the “fallback” variant with “.weak”:

<source lang=asm>
.type Function1, %function
.global Function1
.weak Function1
Function1:
@ Default implementation …
…
bl Function1 @ Call the function
</source>

With this code alone, “Function1” will be used normally. If you put another function with the same name in a different assembly source file, that second variant will be used.
=== Symbol aliases ===
It is also possible to define aliases for symbols using “.thumb_set”, which sets the address accordingly. For example:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

When trying to call “Function2”, the linker will automatically fill in the address of “Function1”. This can also be combined with “.weak” to define a weak alias:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.weak Function2
.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

If you now define another “Function2” in a different assembly source file, that will be used. If not, “Function1” will be called, which is the target of the alias definition. This is useful if you want to define one default implementation for several different functions, for each of which you need one “.weak” and one “.thumb_set” directive.
=== Improved vector table ===
The techniques from the last three sections can be used to improve the definition of the vector table. The way it was defined before is not very flexible; to insert new entries, you have to calculate the new gap sizes and offsets. First, define a default handler ISR that is called by exceptions for which no other ISR is defined, and a macro that defines an alias for one exception with the default handler as the target, and finally a table of all exceptions by using the macro:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.macro defisr name
.global \name
.weak \name
.thumb_set \name, Default_Handler
.word \name
.endm

.global VectorTable
.section .VectorTable, "a"
.type VectorTable, %object
VectorTable:
.word _StackEnd
defisr Reset_Handler
defisr NMI_Handler
defisr HardFault_Handler
defisr MemManage_Handler
defisr BusFault_Handler
defisr UsageFault_Handler
.word 0
.word 0
.word 0
.word 0
defisr SVC_Handler
defisr DebugMon_Handler
.word 0
defisr PendSV_Handler
defisr SysTick_Handler
defisr WWDG_IRQHandler
defisr PVD_IRQHandler
defisr TAMPER_IRQHandler
defisr RTC_IRQHandler
defisr FLASH_IRQHandler
defisr RCC_IRQHandler
defisr EXTI0_IRQHandler
defisr EXTI1_IRQHandler
defisr EXTI2_IRQHandler
defisr EXTI3_IRQHandler
defisr EXTI4_IRQHandler
defisr DMA1_Channel1_IRQHandler
defisr DMA1_Channel2_IRQHandler
defisr DMA1_Channel3_IRQHandler
defisr DMA1_Channel4_IRQHandler
defisr DMA1_Channel5_IRQHandler
defisr DMA1_Channel6_IRQHandler
defisr DMA1_Channel7_IRQHandler
defisr ADC1_2_IRQHandler
defisr USB_HP_CAN1_TX_IRQHandler
defisr USB_LP_CAN1_RX0_IRQHandler
defisr CAN1_RX1_IRQHandler
defisr CAN1_SCE_IRQHandler
defisr EXTI9_5_IRQHandler
defisr TIM1_BRK_IRQHandler
defisr TIM1_UP_IRQHandler
defisr TIM1_TRG_COM_IRQHandler
defisr TIM1_CC_IRQHandler
defisr TIM2_IRQHandler
defisr TIM3_IRQHandler
defisr TIM4_IRQHandler
defisr I2C1_EV_IRQHandler
defisr I2C1_ER_IRQHandler
defisr I2C2_EV_IRQHandler
defisr I2C2_ER_IRQHandler
defisr SPI1_IRQHandler
defisr SPI2_IRQHandler
defisr USART1_IRQHandler
defisr USART2_IRQHandler
defisr USART3_IRQHandler
defisr EXTI15_10_IRQHandler
defisr RTCAlarm_IRQHandler
defisr USBWakeUp_IRQHandler

.text

.type Default_Handler, %function
.global Default_Handler
Default_Handler:
bkpt
b.n Default_Handler
</source>

There are a few empty entries in the table that are not used by the processor. At the beginning, there is still the definition for the initial stack pointer and the “Reset_Handler”. If you replace your “vectortable.S” by this code, you get a “proper” vector table. The “SysTick_Handler” will continue to work as before, and if you need to define any other ISR, for example for USART1, just define a function by the exact name “USART1_IRQHandler”. The address of this function will automatically be put in the vector table. If an exception without a corresponding ISR occurs, the “Default_Handler” will be called, which uses the “bkpt” instruction to force a breakpoint via the attached debugger. This helps debugging missed exceptions while avoiding to define several individual dummy handler functions.
=== .include ===
Having to put the register and bit definitions (“RCC_APB2ENR”, “RCC_APB2ENR_IOPAEN”, ...) in each assembly source file is redundant and error-prone. Instead, you can put them into a separate file (e.g. called “stm32f103.inc”) and use the “.include” directive to reference it:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

@ Normal code ...
</source>

The assembler will read the code from the included file and pretend it was written instead of the “.include” line. This can help improve code structure. While working on the project structure, you can also restructure the definitions for the GPIO registers to facilitate offset addressing:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000
</source>

The next example incorporates these changes in addressing the registers.
=== Local Labels ===
Having to invent unique labels for all jump targets inside functions (e.g. for conditional code and loops) can be tedious. When using a disassembler (see below), each label will appear as its own functions. Therefore, the GNU assembler supports local labels. These are labels whose name consist only a number. Local names need not be unique; several labels called e.g. “1” may exist in one file. To perform a jump to a local label, use the number and append a “f” or “b” to indicate whether to jump forward or backward. Local labels can not be exported with the “.global” directive. The interrupt-based blinker can be modified like this using local labels:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

1:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>
Example name: “BlinkLocalLabels”
=== Initializing RAM ===
The blinker program uses 2 byte-variables in memory, which have to be initialized to some value at startup. For large programs with many variables, this quickly becomes hard to maintain and also inefficient. Assembler and linker can help producing an “image” of how the RAM contents should look like after initializing, and place this image in flash memory alongside the normal program data. At startup, this image can be simply copied 1:1 into RAM in a loop. Most programs contain many variables that will be initialized with zero, so placing a (possibly large) block of zeroes in flash memory is wasteful; therefore, an additional loop is used to initialize all zero-variables to zero. Both techniques are also employed by C and C++ compilers, so implementing the initialization code is required there too. First, change the declaration of your variables by using “.byte”, “.hword” and “.word” and include the desired initialization value. Variables that should be initialized by zero get to be placed after a “.bss” directive to put them into the equally-named section. They don’t get an initialization value but just reserved space by using “.space”:

<source lang=asm>
.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1
</source>

From the assembler’s point of view, the initialization data - in this case, just one byte of value “2” - will directly end up in RAM. However, this is not possible on microcontrollers, as the RAM always contains random data on startup and isn’t automatically initialized. To achieve that, change the linker script as follows:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
. = ALIGN(4);
} >FLASH

.stack (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
} >SRAM

.data : {
_DataStart = .;
*(.data);
. = ALIGN(4);
_DataEnd = .;
} >SRAM AT >FLASH

_DataLoad = LOADADDR(.data);

.bss (NOLOAD) : {
_BssStart = .;
*(.bss);
. = ALIGN(4);
_BssEnd = .;
} >SRAM
}
</source>
Example name: “BlinkInitRAM”

The stack got put in its own section with the “NOLOAD” attribute, since it doesn’t need initializing. The data is now put in the “.data” section. The initial data for that section is put into flash memory via the “>SRAM AT >FLASH” construct. The addresses of symbols inside the “.data” section are still the addresses in RAM, so accesses to the symbols from assembly code still work. The symbol “_DataStart” is assigned the beginning of the initialized data in RAM, and “_DataEnd” the end. The “LOADADDR” function is used to get the beginning of the initialization data in flash, and assign it to “_DataLoad”. The “.bss” section contains all the variables that should be zero-initialized, and the symbols “_BssStart” and “_BssEnd” are set to its beginning and end address, respectively. It is marked with “NOLOAD” as well as we don't want to store (potentially many) zeroes in the linked program file, and we will explicitly initialize it (see below). As the beginning and size of the stack are already a multiple of 4, the beginning of “.data” is as well. The size of .data might not be a multiple of 4 however, so an “.=ALIGN(4)” command is inserted right before the definition of “_DataEnd”. This adds 0-3 dummy bytes by incrementing the location counter to make sure the address is a multiple of 4. The same thing is done right before “_BssEnd” and also at the end of the “.text” section, to make sure that “_BssEnd” and “_DataLoad” are multiples of 4 as well.

The only thing left is the actual initialization of the RAM. To do that, change the “Reset_Handler” as follows:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b
.ltorg
</source>

The explicit initialization of the variables was removed. Instead, the addresses for “_DataStart”, “_DataEnd” and “_DataLoad” that were defined in the linker script are loaded. Then, a short loop repeatedly loads a word from flash (i.e. starting with “_DataLoad”) and stores it into RAM (starting at “_DataStart”). The address pointers are incremented by the “ldr”/”str” instructions after the access. The pointer for the RAM location is compared with the end of the RAM area (“_DataEnd”) to decide whether to jump back to the beginning of the loop. To start the loop, a jump directly to the comparison is performed; this avoids the need to do the comparison at the beginning and inside of the loop. The second loop performs the zero-initialization of the area between “_BssStart” and “_BssEnd”; it works similarly, but does not need to load any data.

Unfortunately, the program as shown can’t be translated - as the two variables now reside in two different sections (“.data” and “.bss”), the offset addressing in the “SysTick_Handler” doesn’t work anymore. Therefore, direct addressing has to be used:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr
</source>

=== Peripheral interrupts ===
Interrupts, i.e. exceptions called by periphery modules, need a little extra code compared to the “core” exceptions including the SysTick. The Cortex-M’s interrupt controller (the NVIC) contains several registers for configuring these interrupts. It is possible to configure the priority and manually trigger interrupts, but for most applications, the only necessary thing to do is enabling the desired interrupt. This is done via the registers “NVIC_ISER0” through “NVIC_ISER15”, which are documented in the ARMv7M Architecture Reference Manual in chapter B3.4.4. Each of those registers contains 32 bits with which 32 of the interrupts can be enabled. The STM32F103RB/C8 has 43 interrupts, so only two of the possible 16 registers are present. The number of interrupts is given in chapter 2.3.5 of the controller’s datasheet. So, to enable some interrupt x, the bit “x mod 32” in register NVIC_ISER''y'' with y=x/32 has to be set. This register’s address is 0xE000E100+y*4. Given an interrupt’s number in r0, the following function does just that:

<source lang=asm>
NVIC_ISER0 = 0xE000E100

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>
Example name: “BlinkTIM1”

The “and” instruction calculates “x mod 32”, and the following left-shift (“lsls”) calculates the value where bit “x mod 32” is one, and all others are zero. To calculate the offset address “y*4”, i.e. “(x/32)*4”, the register is first shifted right by 5 bits and then shifted back left by 2 bits. This is the same as shifting 3 bits right and zeroing out the lower 2 bits; but two shift instructions actually consume less program memory space. Finally, the calculated value is written into the register by using offset addressing.

In addition to enabling the interrupt in the processor core’s NVIC, it also has to be enabled in the periphery module. Many periphery modules support several different events, each of which has to be enabled in the periphery’s register individually. Depending on the controller, these can be mapped to one single processor interrupt (and hence, one single ISR) or multiple ones, and need to be configured in the NVIC appropriately.

This example uses the STM32’s periphery timer TIM1 instead of the SysTick timer:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500
TimerPrescaler=1000

.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl EnableClockTIM1
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TIM1_UP_IRQn
bl EnableIRQ
bl StartTIM1
1:
wfi
b 1b
.ltorg

.type TIM1_UP_IRQHandler, %function
.global TIM1_UP_IRQHandler
TIM1_UP_IRQHandler:
ldr r0, =TIM1
ldr r2, =(~(1 << TIMx_SR_UIF))

ldr r1, [r0, #TIMx_SR]
bics r1, r2
beq 2f

str r2, [r0, #TIMx_SR]

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type EnableClockTIM1, %function
EnableClockTIM1:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_TIM1EN)
str r0, [r1, #RCC_APB2ENR] @ Set TIM1EN bit in RCC_APB2ENR to 1 to enable TIM1
bx lr @ Return to caller
.ltorg

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartTIM1, %function
StartTIM1:
ldr r0, =TIM1
ldr r1, =(1 << TIMx_CR1_URS)
str r1, [r0, #TIMx_CR1]

ldr r1, =TimerPrescaler
str r1, [r0, #TIMx_PSC]

ldr r1, =TimerValue
str r1, [r0, #TIMx_ARR]

ldr r1, =(1 << TIMx_DIER_UIE)
str r1, [r0, #TIMx_DIER]

ldr r1, =(1 << TIMx_EGR_UG)
str r1, [r0, #TIMx_EGR]

dsb

ldr r1, =(1 << TIMx_CR1_CEN)
str r1, [r0, #TIMx_CR1]

bx lr
.ltorg

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>

The corresponding stm32f103.inc file with the added definitions for the timer registers is:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

RCC = 0x40021000
RCC_APB2ENR = 0x18
RCC_APB2ENR_IOPAEN = 2
RCC_APB2ENR_TIM1EN = 11

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

TIM1 = 0x40012C00

TIMx_CR1 = 0
TIMx_CR1_ARPE = 7
TIMx_CR1_URS = 2
TIMx_CR1_CEN = 0

TIMx_DIER = 0xC
TIMx_DIER_UIE = 0

TIMx_SR = 0x10
TIMx_SR_UIF = 0

TIMx_EGR = 0x14
TIMx_EGR_UG = 0

TIMx_PSC = 0x28
TIMx_ARR = 0x2C

TIM1_UP_IRQn = 25

NVIC_ISER0 = 0xE000E100
</source>

The source code enables the timer’s clock in the RCC before configuring it. The timer supports both a freely configurable prescaler for dividing the clock and a freely configurable maximum value, both of which are set by the StartTIM1 function. The TIMx_DIER_UIE bit is set to enable the interrupt for the so-called “update event”, which is triggered whenever the timer reaches the maximum value. A delicate sequence of register accesses is required to start the timer with the right configuration but without triggering the interrupt right away: To apply the modified settings immediately, the “TIMx_EGR_UG” bit is set to trigger an “artificial” update event. To prevent this from also triggering the interrupt, the “TIMx_CR1_URS” bit is set and cleared before and after, respectively. The timer is started by setting the “TIMx_CR1_CEN” bit at the end. Before that, a “dsb” instruction is inserted. This “Data Synchronization Barrier” waits until all write accesses before that have been completely processed - usually, the processors pipeline is working on several instructions at once. Because the timer configuration needs to be truly finished before starting the timer, this instruction is required. There are some other situations where the processor is too fast for the periphery and needs to be temporarily halted by a “dsb”. If some periphery-accessing code works in step-by-step mode while debugging, but not when executing normally, a well-placed “dsb” might help.

The ISR “TIM1_UP_IRQHandler” is used for the timer. It checks the “TIMx_SR_UIF” bit to verify an update event has actually happened. In that case, the register is overwritten with the value 0xFFFFFFFE, i.e. all bits are written with “1” except the UIF bit. Writing ones has no effect on the bits in this register, and writing a zero clears the respective bit. Therefore, this write access clears the UIF bit but keeps the others. These interrupt flags must always be cleared as soon as possible in the ISR, or the periphery might trigger the interrupt again immediately. The rest of the ISR stays the same.
=== Analysis tools ===
When working on a low level directly with linker scripts and assembly code, it is frequently necessary to directly verify the translation output, as you can’t rely on a compiler doing it right automatically, and flashing the program each time to see whether it works isn’t the most efficient way. This was, in fact, important in creating the example codes for this tutorial. The “binutils” package, of which assembler and linker are part of, offers a few tools that help with analyzing the assembler’s and linker’s output.
==== Disassembler ====
As the name implies, a disassembler is the opposite of an assembler - it turns binary machine code back into a (more or less) readable textual representation. If you feed an ELF file generated by the assembler or linker into the disassembler, it will read the header information to tell apart data (i.e. constants) and code, get names of symbols (and therefore, labels) and can even tell which instructions were generated from which assembly source file, if it was assembler with debug information (i.e. the “-g” flag was used). If you disassemble a binary flash image, the disassembler doesn’t have all those information and will produce a much less readable output and attempt to decode data bytes as instructions.

The disassembler from binutils is called “objdump”. Invoking it on the blinker looks like this:
<pre>
$ arm-none-eabi-objdump -d -s prog1.elf

prog1.elf: file format elf32-littlearm

Contents of section .VectorTable:
8000000 00040020 ed000008 ed010008 ed010008 ... ............
8000010 ed010008 ed010008 ed010008 00000000 ................
8000020 00000000 00000000 00000000 ed010008 ................
8000030 ed010008 00000000 ed010008 49010008 ............I...
8000040 ed010008 ed010008 ed010008 ed010008 ................
8000050 ed010008 ed010008 ed010008 ed010008 ................
8000060 ed010008 ed010008 ed010008 ed010008 ................
8000070 ed010008 ed010008 ed010008 ed010008 ................
8000080 ed010008 ed010008 ed010008 ed010008 ................
8000090 ed010008 ed010008 ed010008 ed010008 ................
80000a0 ed010008 ed010008 ed010008 ed010008 ................
80000b0 ed010008 ed010008 ed010008 ed010008 ................
80000c0 ed010008 ed010008 ed010008 ed010008 ................
80000d0 ed010008 ed010008 ed010008 ed010008 ................
80000e0 ed010008 ed010008 ed010008 ............
Contents of section .text:
80000ec 0f481049 104a03e0 52f8043b 40f8043b .H.I.J..R..;@..;
80000fc 8842f9d3 0d480e49 4ff00002 01e040f8 .B...H.IO.....@.
800010c 042b8842 fbd300f0 47f800f0 4bf84ff4 .+.B....G...K.O.
800011c 80710848 01600848 00f058f8 30bffde7 .q.H.`.H..X.0...
800012c 00040020 04040020 f0010008 04040020 ... ... .......
800013c 08040020 10080140 60e31600 4ff0e020 ... ...@`...O..
800014c 006910f4 803f1dd0 1a480178 b1f1110f .i...?...H.x....
800015c 18d21948 0378013b 1cbf0370 12e001f1 ...H.x.;...p....
800016c 0101b1f1 110f02d2 144a535c 03701148 .........JS\.p.H
800017c 017011f0 01010cbf 4ff48071 4ff08071 .p......O..qO..q
800018c 0f480160 70470202 02020202 05020502 .H.`pG..........
800019c 05020202 02020200 0a490868 40f00400 .........I.h@...
80001ac 08607047 08490868 20f00f00 40f00200 .`pG.I.h ...@...
80001bc 08607047 04040020 00040020 92010008 .`pG... ... ....
80001cc 10080140 18100240 04080140 4ff0e021 ...@...@...@O..!
80001dc 48614ff0 00008861 4ff00700 08617047 HaO....aO....apG
80001ec 00befde7 ....
Contents of section .data:
20000400 02000000 ....
Contents of section .ARM.attributes:
0000 41200000 00616561 62690001 16000000 A ...aeabi......
0010 05436f72 7465782d 4d330006 0a074d09 .Cortex-M3....M.
0020 02 .
Contents of section .debug_line:
0000 98000000 02001e00 00000201 fb0e0d00 ................
0010 01010101 00000001 00000100 70726f67 ............prog
0020 312e5300 00000000 000502ec 00000803 1.S.............
0030 15012121 22212f2f 21222121 30212f21 ..!!"!//!"!!0!/!
0040 222f302f 21232130 21036120 2f2f362f "/0/!#!0!.a //6/
0050 030c2e32 030a2e2f 212f2222 222f2221 ...2.../!/"""/"!
0060 21222121 222f2f22 21212321 222f212f !"!!"//"!!#!"/!/
0070 30212303 0d9e2121 2f212421 212f2f21 0!#...!!/!$!!//!
0080 03422035 030c2e03 0d2e0311 2e36030b .B 5.........6..
0090 2e30212f 222f2202 01000101 3b000000 .0!/"/".....;...
00a0 02002400 00000201 fb0e0d00 01010101 ..$.............
00b0 00000001 00000100 76656374 6f727461 ........vectorta
00c0 626c652e 53000000 00000005 02ec0100 ble.S...........
00d0 0803d000 01210201 000101 .....!.....
Contents of section .debug_info:
0000 22000000 02000000 00000401 00000000 "...............
0010 ec000008 ec010008 00000000 08000000 ................
0020 12000000 01802200 00000200 14000000 ......".........
0030 04019c00 0000ec01 0008f001 00082100 ..............!.
0040 00000800 00001200 00000180 ............
Contents of section .debug_abbrev:
0000 01110010 06110112 01030e1b 0e250e13 .............%..
0010 05000000 01110010 06110112 01030e1b ................
0020 0e250e13 05000000 .%......
Contents of section .debug_aranges:
0000 1c000000 02000000 00000400 00000000 ................
0010 ec000008 00010000 00000000 00000000 ................
0020 1c000000 02002600 00000400 00000000 ......&.........
0030 ec010008 04000000 00000000 00000000 ................
Contents of section .debug_str:
0000 70726f67 312e5300 2f746d70 2f746573 prog1.S./tmp/tes
0010 7400474e 55204153 20322e32 392e3531 t.GNU AS 2.29.51
0020 00766563 746f7274 61626c65 2e5300 .vectortable.S.

Disassembly of section .text:

080000ec <Reset_Handler>:
80000ec: 480f ldr r0, [pc, #60] ; (800012c <Reset_Handler+0x40>)
80000ee: 4910 ldr r1, [pc, #64] ; (8000130 <Reset_Handler+0x44>)
80000f0: 4a10 ldr r2, [pc, #64] ; (8000134 <Reset_Handler+0x48>)
80000f2: e003 b.n 80000fc <Reset_Handler+0x10>
80000f4: f852 3b04 ldr.w r3, [r2], #4
80000f8: f840 3b04 str.w r3, [r0], #4
80000fc: 4288 cmp r0, r1
80000fe: d3f9 bcc.n 80000f4 <Reset_Handler+0x8>
8000100: 480d ldr r0, [pc, #52] ; (8000138 <Reset_Handler+0x4c>)
8000102: 490e ldr r1, [pc, #56] ; (800013c <Reset_Handler+0x50>)
8000104: f04f 0200 mov.w r2, #0
8000108: e001 b.n 800010e <Reset_Handler+0x22>
800010a: f840 2b04 str.w r2, [r0], #4
800010e: 4288 cmp r0, r1
8000110: d3fb bcc.n 800010a <Reset_Handler+0x1e>
8000112: f000 f847 bl 80001a4 <EnableClockGPIOA>
8000116: f000 f84b bl 80001b0 <ConfigurePA8>
800011a: f44f 7180 mov.w r1, #256 ; 0x100
800011e: 4808 ldr r0, [pc, #32] ; (8000140 <Reset_Handler+0x54>)
8000120: 6001 str r1, [r0, #0]
8000122: 4808 ldr r0, [pc, #32] ; (8000144 <Reset_Handler+0x58>)
8000124: f000 f858 bl 80001d8 <StartSysTick>
8000128: bf30 wfi
800012a: e7fd b.n 8000128 <Reset_Handler+0x3c>
800012c: 20000400 .word 0x20000400
8000130: 20000404 .word 0x20000404
8000134: 080001f0 .word 0x080001f0
8000138: 20000404 .word 0x20000404
800013c: 20000408 .word 0x20000408
8000140: 40010810 .word 0x40010810
8000144: 0016e360 .word 0x0016e360

08000148 <SysTick_Handler>:
8000148: f04f 20e0 mov.w r0, #3758153728 ; 0xe000e000
800014c: 6900 ldr r0, [r0, #16]
800014e: f410 3f80 tst.w r0, #65536 ; 0x10000
8000152: d01d beq.n 8000190 <SysTick_Handler+0x48>
8000154: 481a ldr r0, [pc, #104] ; (80001c0 <ConfigurePA8+0x10>)
8000156: 7801 ldrb r1, [r0, #0]
8000158: f1b1 0f11 cmp.w r1, #17
800015c: d218 bcs.n 8000190 <SysTick_Handler+0x48>
800015e: 4819 ldr r0, [pc, #100] ; (80001c4 <ConfigurePA8+0x14>)
8000160: 7803 ldrb r3, [r0, #0]
8000162: 3b01 subs r3, #1
8000164: bf1c itt ne
8000166: 7003 strbne r3, [r0, #0]
8000168: e012 bne.n 8000190 <SysTick_Handler+0x48>
800016a: f101 0101 add.w r1, r1, #1
800016e: f1b1 0f11 cmp.w r1, #17
8000172: d202 bcs.n 800017a <SysTick_Handler+0x32>
8000174: 4a14 ldr r2, [pc, #80] ; (80001c8 <ConfigurePA8+0x18>)
8000176: 5c53 ldrb r3, [r2, r1]
8000178: 7003 strb r3, [r0, #0]
800017a: 4811 ldr r0, [pc, #68] ; (80001c0 <ConfigurePA8+0x10>)
800017c: 7001 strb r1, [r0, #0]
800017e: f011 0101 ands.w r1, r1, #1
8000182: bf0c ite eq
8000184: f44f 7180 moveq.w r1, #256 ; 0x100
8000188: f04f 7180 movne.w r1, #16777216 ; 0x1000000
800018c: 480f ldr r0, [pc, #60] ; (80001cc <ConfigurePA8+0x1c>)
800018e: 6001 str r1, [r0, #0]
8000190: 4770 bx lr

08000192 <BlinkTable>:
8000192: 0202 0202 0202 0205 0205 0205 0202 0202 ................
80001a2: .

080001a3 <BlinkTableEnd>:
...

080001a4 <EnableClockGPIOA>:
80001a4: 490a ldr r1, [pc, #40] ; (80001d0 <ConfigurePA8+0x20>)
80001a6: 6808 ldr r0, [r1, #0]
80001a8: f040 0004 orr.w r0, r0, #4
80001ac: 6008 str r0, [r1, #0]
80001ae: 4770 bx lr

080001b0 <ConfigurePA8>:
80001b0: 4908 ldr r1, [pc, #32] ; (80001d4 <ConfigurePA8+0x24>)
80001b2: 6808 ldr r0, [r1, #0]
80001b4: f020 000f bic.w r0, r0, #15
80001b8: f040 0002 orr.w r0, r0, #2
80001bc: 6008 str r0, [r1, #0]
80001be: 4770 bx lr
80001c0: 20000404 .word 0x20000404
80001c4: 20000400 .word 0x20000400
80001c8: 08000192 .word 0x08000192
80001cc: 40010810 .word 0x40010810
80001d0: 40021018 .word 0x40021018
80001d4: 40010804 .word 0x40010804

080001d8 <StartSysTick>:
80001d8: f04f 21e0 mov.w r1, #3758153728 ; 0xe000e000
80001dc: 6148 str r0, [r1, #20]
80001de: f04f 0000 mov.w r0, #0
80001e2: 6188 str r0, [r1, #24]
80001e4: f04f 0007 mov.w r0, #7
80001e8: 6108 str r0, [r1, #16]
80001ea: 4770 bx lr

080001ec <Default_Handler>:
80001ec: be00 bkpt 0x0000
80001ee: e7fd b.n 80001ec <Default_Handler>
</pre>

This is a lot of information. The “-d” flag tells objdump to disassemble code sections, and the “-s” flag lets it output data sections. At first, it prints the contents of “.VectorTable”. Each line is prefixed with the address of where this data is found in memory. Then, the 32bit data blocks from the vector table are output. The disassembler prints the bytes in the order they appear in memory, which, since the Cortex-M3 uses little endian, is reversed - for example, the printed “ed000008” actually refers to the address “0x080000ed”, which is the address of the “Reset_Handler” with the lowest bit set to one, as it is a thumb function. Most of the addresses in the vector table reflect the address of the default handler, 0x080001ec, except for the zero-entries and the SysTick_Handler. The contents of the “.text” section is the hexadecimal representation of the machine code, and hardly readable. The “.data” section contains a single “two” - this is the “02” put into “TimerEvents”. The contents of “.ARM.attributes:” and the various “.debug” sections is not very interesting, as it does not end up on the controller, and is only read by the various analysis tools to provide nicer output.

After that comes the actual disassembly. This is a list of all the instructions in the code section. The list is grouped by the symbols found in the input file. For C Code, each symbol usually matches one function, so each block in the disassembly represents one C function. In assembly code, if you put non-local labels into a function, that function will be split into multiple blocks by the disassembler, making it harder to read - the main reason for using local labels. Each instruction is translated into one line inside the blocks. The first column is the address where that instruction is found. The next column contains the hexadecimal representation of the 2 or 4 bytes that make up the machine code of that instruction, i.e. the actual content of flash memory. After that comes a textual representation of that instruction, as inferred by the disassembler. If the instruction contains some number, the disassembler sometimes outputs a semicolon followed by some interpretation of that number. If the instruction employs PC-relative addressing, that interpretation will be the absolute address. As many instructions have multiple spellings, there can be discrepancies between the original code and the disassembly. The disassembler will also output data, such as the “BlinkTable” and the literal pools, as such. Using the “.type” directive is helpful in that case so that the disassembler does not attempt to interpret the data bytes as code.

objdump can also be used to disassembly raw binary files that can be obtained by reading back the flash memory of some controller. To do this, use this command line:

<source lang=sh>
$ arm-none-eabi-objdump -b binary -m arm -D prog1.bin -Mforce-thumb --adjust-vma=0x08000000
</source>

The address of the binary in flash memory is specified so that the printed instruction addresses are correct. However, as the disassembler can’t tell data and code apart, the result will be of limited use. If you have to analyze a binary without having an ELF file or the source code, a more sophisticated disassembler such as IDA Pro is helpful. If you have the code and only need the disassembler to identify potential problems with the project (esp. the linker script), objdump is usually sufficient.
==== readelf ====
The “readelf” program is a powerful utility that can read and output various information from ELF files. The most useful option is the “-S” flag, which lets readelf print a summary of the sections in the respective file, e.g.:

<pre>
$ arm-none-eabi-readelf -S prog1.elf
There are 15 section headers, starting at offset 0x11268:

Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .VectorTable PROGBITS 08000000 010000 0000ec 00 A 0 0 1
[ 2] .text PROGBITS 080000ec 0100ec 000104 00 AX 0 0 4
[ 3] .stack NOBITS 20000000 020000 000400 00 WA 0 0 1
[ 4] .data PROGBITS 20000400 010400 000004 00 WA 0 0 1
[ 5] .bss NOBITS 20000404 010404 000004 00 WA 0 0 1
[ 6] .ARM.attributes ARM_ATTRIBUTES 00000000 010404 000021 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 010425 0000db 00 0 0 1
[ 8] .debug_info PROGBITS 00000000 010500 00004c 00 0 0 1
[ 9] .debug_abbrev PROGBITS 00000000 01054c 000028 00 0 0 1
[10] .debug_aranges PROGBITS 00000000 010578 000040 00 0 0 8
[11] .debug_str PROGBITS 00000000 0105b8 00002f 01 MS 0 0 1
[12] .symtab SYMTAB 00000000 0105e8 0006a0 10 13 45 4
[13] .strtab STRTAB 00000000 010c88 000550 00 0 0 1
[14] .shstrtab STRTAB 00000000 0111d8 000090 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
</pre>

For each section, one line is output. The sections “.strtab”, “.shstrtab”, “.symtab” and “NULL” are an integral part of ELF and always present. The “.debug” sections are present if the source was assembled with the “-g” flag. The “.ARM.attributes” section defines for which ARM processor the contained code was translated. These sections don’t end up on the microcontroller. The remaining sections were defined in the linker script: “.VectorTable” contains the addresses of the exception handlers, “.text” contains the program code and constant data for flash memory, “.stack” the stack in RAM, “.data” contains variables in RAM and “.bss” contains zero-initialized variables in RAM. For these sections, the column “Type” contains either “PROGBITS” or “NOBITS” that tells you whether the section in the ELF file actually contains some data - this is only the case for “.VectorTable”, “.text” and “.data”. The sections “.bss” and “.stack” only reserve memory that is written at runtime, but the ELF file doesn’t contain data to be written in these sections. The column “Addr” defines where this section begins in the address space. The most useful column is “Size”: If you sum up the sizes of the sections “.VectorTable”, “.text” and “.data”, you can obtain the used flash memory. By summing ob “.data”, “.stack” and “.bss”, you get the used amount of RAM. Note that “.data” is counted twice, as the initialization data is stored in flash.
==== nm ====
The “nm” utility prints the symbols defined in an ELF file, for example:

<pre>
$ arm-none-eabi-nm prog1.elf
080001ec W ADC1_2_IRQHandler
20000404 b BlinkStep
08000192 t BlinkTable
080001a3 t BlinkTableEnd
20000408 B _BssEnd
20000404 B _BssStart
…
</pre>

This can be helpful in analyzing errors in linker scripts where symbols might get assigned wrong addresses.
==== addr2line ====
The “addr2line” utility reads the debug information from an ELF file to determine which line in which source file produced the instruction found at a particular given address. For example:

<source lang=sh>
$ arm-none-eabi-addr2line 0x080000f0 -e prog1.elf
/tmp/test/prog1.S:24
</source>

Here, line 24 of “prog1.S” contains the assembler command that produced the instruction that ends up at address 0x080000f0.

==== objcopy ====
The “objcopy” utility allows you to translate program files between different formats. It is useful to convert the ELF files to both the Intel Hex format and a simple binary representation. For example,

<source lang=sh>
arm-none-eabi-objcopy -O ihex prog1.elf prog1.hex
</source>

produces a “.hex” file that contains an image of the flash contents in hexadecimal form. With

<source lang=sh>
arm-none-eabi-objcopy -O binary prog1.elf prog1.bin
</source>

a binary file is created which contains an exact 1:1 image of the flash contents. Some flashing tools require these formats instead of ELF, and viewing the binary file with a hex editor can be interesting as well.

=== Interfacing C and C++ code ===
Since assembly is rarely used to implement entire complex projects, but mostly for few time-critical or especially low-level routines that are part of larger code bases written in a high-level-language, interfacing C and assembly code is an important topic, which will be covered here. While it is possible to write the main project structure in assembly and integrate some C modules, it is usually done the other way round. Most of the code shown is already ready to be included in C programs. Most of this topic works the same way for C++, apart from C++ exceptions (not to be confused with ARM processor exceptions) - but these are rarely used on embedded targets anyways.

If you compile C, C++ and assembly code into individual .o object files, you can link these together using “ld” as before. However, C and C++ code usually requires access to the respective standard library, and “ld” doesn’t link these by default - therefore it is necessary to substitute “ld” for a call to “gcc” or “g++” for C or C++, respectively. This will call “ld” internally and pass the required libraries.
==== Environment setup for C and C++ ====
Many C projects use a reset handler and vector table implemented in assembly, although writing them in C is possible too. As required by the C standard, C programs start with the “main()” function, so the (assembly) reset handler should setup the environment such that it is ready for C, and then call “main”. The C code might then later call some assembly functions or inline assembly. When using C++ code, or some GCC extension for C code, it is required to call some additional functions before calling “main”. This is used by C++ to call the constructors of global objects. The C and C++ compilers emit a table of function pointers to functions that should be called at startup. This table has to be put into flash memory by modifying the linker script as follows:

<source>
.text : {
*(.text)
. = ALIGN(4);

_InitArrayStart = .;
*(SORT(.preinit_array*))
*(SORT(.init_array*))
_InitArrayEnd = .;
} >FLASH
</source>

The table of function pointers is sorted to keep the order needed by the compiler. The symbols “_InitArrayStart” and “_InitArrayEnd” mark beginning and end of that table. A reset handler that performs the memory initialization as before and calls the table of initialization functions could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

ldr r4, =_InitArrayStart
ldr r5, =_InitArrayEnd

b 2f
1: ldr r0, [r4], #4
blx r0
2: cmp r4, r5
blo 1b

bl main
1: bkpt
b 1b
.ltorg
</source>

Note that for iterating the table, registers r4 and r5 are used, since the called functions may not overwrite those. The “blx” instruction is needed to perform the indirect function call. When everything is set up, the main function is called. For embedded programs, the main function should never return (i.e. contain an endless loop). If it does, that’s an error, and to make it easier to find, an endless loop with a forced breakpoint is put right after the call to “main”.
==== Calling functions ====
To call assembly functions from C code and vice-versa, the assembly functions should observe the calling convention, as mentioned before. C functions can be called just like assembly functions from assembly code, by placing the parameters in register r0-r3 and on the stack, calling the function using “bl” and retrieving the return value from r0. To call an assembly function from C code, you need to declare it in C first just like a C function. For example, to call a function that takes 2 integer arguments and returns an integer:

<source lang=c>
int AssemblyFunction (int a, int b);
</source>

If you now define a function named “AssemblyFunction” in your assembly code and export it via “.global”, you can call it from C code just like any function.
==== Accessing global variables ====
Global variables defined in C can be accessed from assembly code just like variables defined in assembly code, by using the variable’s name. To access an assembly variable from C code, you need to declare it first by specifying the type. For example, to declare an integer variable:

<source lang=c>
extern int AssemblyVariable;
</source>

If you now define a variable named “AssemblyVariable” in your assembly code and export it via “.global”, you can access it from C code just like any variable. The “extern” is required to make sure the C code doesn’t attempt to declare another variable of the same name.
=== Clock configuration ===
By default, STM32 controllers use an internal RC-oscillator with 8 MHz as a clock source for the core and periphery. This oscillator is too inaccurate for implementing a clock or using serial interfaces such as UART, USB or CAN. To obtain a more accurate clock, an external quartz crystal is usually applied. Many STM32 boards feature an 8 MHz crystal. To use it, some initialization code is required that activates the microcontroller’s built-in crystal-oscillator circuit and switches the clock input to that. The STM32 controllers also include a PLL which can multiply some input clock by a configurable factor before feeding it to the processor core and peripherals. This way, a precise and fast clock can be achieved - the STM32F103 supports up to 72 MHz core frequency. Unfortunately, flash memory is not capable of keeping up with such a high frequency. Therefore, when enabling a fast clock, the flash memory needs to be configured to use wait states depending on the frequency.

The following function configures the flash wait states, enables the crystal oscillator, configures the PLL to multiply the input clock by a factor of 9, and use that as the system clock. The prescaler for the internal bus APB1 is set to 2. Assuming an 8 MHz crystal, this achieves the maximum performance possible with this microcontroller - 72 MHz for the core and APB2 domain, 36 MHz for APB1. If a different crystal is used, the PLL factors have to be adjusted.

<source lang=asm>
RCC = 0x40021000

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

.type ConfigureSysClock, %function
.global ConfigureSysClock
ConfigureSysClock:
@ Turn on HSE
ldr r0, =RCC
ldr r1, =((1 << RCC_CR_HSION)|(1 << RCC_CR_HSEON))
str r1, [r0, #RCC_CR]

@ Configure (but not start yet) PLL
@ Mul = 9, Prediv = 1, APB1 Prescaler = 2, APB2 Prescaler = 1, AHB Prescaler = 1
ldr r2, =(((9-2)<<RCC_CFGR_PLLMUL)|(1 << RCC_CFGR_USBPRE)|(1 << RCC_CFGR_PLLSRC)|(4 << RCC_CFGR_PPRE1))
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
orr r1, #(1 << RCC_CR_PLLON)

@ Wait for HSE ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_HSERDY)
beq 1b

@ Turn on PLL
str r1, [r0, #RCC_CR]

@ Pre-Calculate value for RCC_CFGR
orr r2, #(2 << RCC_CFGR_SW)

@ Wait for PLL ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_PLLRDY)
beq 1b

@ Set flash wait states to 2
ldr r0, =FLASH
ldr r3, =((1<<FLASH_ACR_PRFTBE)|(2<<FLASH_ACR_LATENCY))
str r3, [r0, #FLASH_ACR]
ldr r0, =RCC

@ Switch system clock to PLL
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
bic r1, #(1 << RCC_CR_HSION)

@ Wait for switch to PLL
1: ldr r3, [r0, #RCC_CFGR]
and r3, #(3 << RCC_CFGR_SWS)
cmp r3, #(2 << RCC_CFGR_SWS)
bne 1b

@ Turn off HSI to save power
str r1, [r0, #RCC_CR]

bx lr
.ltorg
</source>

Many projects perform the clock configuration by the reset handler before calling the main function. If you want to follow that practice, place a “bl ConfigureSysClock” as the first instruction in the “Reset_Handler” - this way, all the setup will run with the higher clock frequency, making start-up faster. This and the completed startup code from the previous chapters is implemented in the “startup.S” file in the example repository. If you use it, put your code in the “main” function, where RAM and system clock will already be initialized. This is shown in the “BlinkStartup” example.

=== Project template & makefile ===
To quickly start your own project, a project template is supplied in the examples repository under the directory [https://github.com/Erlkoenig90/ArmAsmTutorial/tree/master/ProjectTemplate-STM32F103RB ProjectTemplate-STM32F103RB]. Put your own application code in the program.S file. The startup.S and vectortable.S contain the reset handler with RAM initialization and the vector table with default handler, respectively. A linker script is included too.

The project also contains a makefile. This allows you to quickly translate your project without having to type the assembler and linker commands. Simply type
<source lang=sh>
make
</source>
To translate the code and produce program.elf, program.bin and program.hex files. All “.S” files in the directory will be automatically translated. Writing makefiles is a complex topic on its own with a lot of information already available on the web, so no further explanations on that will be made here.

ARM-ASM-Tutorial

2021-07-09T06:32:57Z

Nicolas: Änderung 104573 von Nicolas (Diskussion) rückgängig gemacht.

[[Kategorie:ARM]][[Kategorie:STM32]][[Kategorie:Entwicklungstools]][[Kategorie:Programmiersprachen]]The [[ARM]] processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language.

Tutorial by [[Benutzer:Erlkoenig|Niklas Gürtler]]. [https://www.mikrocontroller.net/topic/482409 Thread in Forum] for feedback and questions.

== Introduction ==
=== Why assembly? ===
Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language!

As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation.

Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary.

=== About ARM ===
Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers.

Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications.

Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor.

=== Architecture and processor variants ===
A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”.

While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm.

ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations.

Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android).

An overview of ARM processors and their implemented architecture version can be found on [https://en.wikipedia.org/wiki/ARM_architecture#Cores Wikipedia].
This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use.

There is actually not a single, but three instruction sets for ARM processors:
* The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.
* The “A64” instruction set is for the new 64bit ARM processors
* The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes.

The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set.

== Prerequisites ==
First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components.

=== Microcontroller selection ===
For the microcontroller, an [https://www.st.com/en/microcontrollers-microprocessors/stm32f103c8.html STM32F103C8] or [https://www.st.com/en/microcontrollers-microprocessors/stm32f103rb.html STM32F103RB] by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: [https://www.st.com/en/evaluation-tools/nucleo-f103rb.html Nucleo-F103], “[https://os.mbed.com/users/hudakz/code/STM32F103C8T6_Hello/ Blue Pill]” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), [https://www.olimex.com/Products/Duino/STM32/OLIMEXINO-STM32/open-source-hardware Olimexino-STM32], [https://www.olimex.com/Products/ARM/ST/STM32-P103/ STM32-P103], [https://www.olimex.com/Products/ARM/ST/STM32-H103/ STM32-H103], [https://www.st.com/en/evaluation-tools/stm3210e-eval.html STM3210E-EVAL].

=== Processor type & documentation ===
First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation.

* The first source of information is the [https://www.st.com/resource/en/datasheet/stm32f103rb.pdf STM32F103RB/C8 datasheet]. According to the headline, this is a '''medium-density''' device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a '''Cortex-M3''' processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.
* The next important document is the [https://www.st.com/resource/en/reference_manual/cd00171190.pdf STM32F103 reference manual] that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.
* The [https://developer.arm.com/ip-products/processors/cortex-m/cortex-m3 ARM developer website] provides information about the Cortex-M3 processor core, particularly the [https://static.docs.arm.com/100165/0201/arm_cortexm3_processor_trm_100165_0201_01_en.pdf ARM Cortex-M3 Processor Technical Reference Manual]. According to chapter 1.5.3, this processor implements the '''ARMv7-M architecture'''.
* The architecture is documented in the [https://static.docs.arm.com/ddi0403/ed/DDI0403E_d_armv7m_arm.pdf ARMv7M Architecture Reference Manual]. Particularly, it contains the complete documentation of the instruction set.

For any serious STM32 development, you should be familiar with all these documents.

=== Debug adapter ===
There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD.

Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the [https://www.st.com/en/development-tools/st-link-v2.html ST-Link] debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them.

=== Development Software ===
On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested).

First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the [https://www.st.com/en/development-tools/stsw-link004.html ST-Link Utility] can be useful, but is not strictly required.

Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm GNU Arm Embedded Toolchain] is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it.

Another component is required to talk with the debug adapter. For the ST-Link, this is done by [http://openocd.org/ OpenOCD], which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software.

Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable.

== Setup ==
Follow the instructions in the next chapters to set up your development environment.

=== Hardware ===
The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input.

If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB.

=== Software ===
==== Linux ====
Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM.

Download the Linux binary tarball from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”).

Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via:

<source lang=sh>
echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh
</source>

OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian):
<source lang=sh>
sudo apt-get install openocd
</source>

After that, log out and back in (or just reboot). In a terminal, type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source lang=sh>
$ arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source lang=sh>
$ openocd -v
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

==== Windows ====
[[File:ArmAsmTutorial_GccOptions.png|300px|thumb|right|Options for installing GCC]]Download the Windows installer from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”.

A Windows package for OpenOCD can be obtained from the [https://github.com/gnu-mcu-eclipse/openocd/releases gnu-mcu-eclipse downloads page]. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path.

[[File:ArmAsmTutorial_PcProperties.png|300px|thumb|right|Opening PC properties]][[File:ArmAsmTutorial_SetEnvVar.png|300px|thumb|right|Setting environment variable]]Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times.

Open a ''new'' command window (Windows Key + R, type “cmd” + Return). Type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source>
C:\>arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source>
C:\>openocd -v
GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

== Writing assembly applications ==
The full source code of the examples in the following chapters contain be found on [https://github.com/Erlkoenig90/ArmAsmTutorial GitHub]. The name of the corresponding directory is given after each example code below.
=== First rudimentary program ===
After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”.

Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

nop @ Do Nothing
b . @ Endless loop
</source>
Example name: “EmptyProgram”

When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents.

The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment.

The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later.

The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program.

To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing <code>cd <Path to Project Directory></code>. Call the assembler like this:

<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
</source>

This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this:

<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -Ttext=0x8000000
</source>

This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this:

<pre>arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000</pre>

This is not relevant for executing the program without an operating system and can be ignored.

=== Flashing the program ===
To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit"
</source>
Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption.
=== Starting the debugger ===
To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg
</source>
Then, open a new terminal/command window and start a GDB session:
<source lang=sh>
arm-none-eabi-gdb prog1.elf
</source>
GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance:
<source>
target remote :3333
</source>
Then, stop the currently running program:
<source>
monitor reset halt
</source>
If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing:
<source>
load
</source>
Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again:
<source>
monitor reset halt
</source>
Now, examine the contents of the CPU registers:
<source>
info reg
</source>
The output should look something like
<pre>
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0x0 0x0
lr 0x0 0
pc 0x8000000 0x8000000 <_stack+133693440>
xPSR 0x1000000 16777216
msp 0x20000400 0x20000400
psp 0x27e3fa34 0x27e3fa34
primask 0x0 0
basepri 0x0 0
faultmask 0x0 0
control 0x0 0
</pre>

At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing
<source>
stepi
</source>
If you type <code>info reg</code> again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do <source>stepi</source> again (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing
<source>
continue
</source>
You can interrupt the running program by pressing “Ctrl+C”. Run the commands
<source>
kill
quit
</source>
to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal.
=== Using processor registers ===
The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing.

All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”.

As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is:
<source lang=asm>
ldr r0, =123456789
</source>
The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead.

The instruction “mov” can be used to copy the contents from one register to another:
<source lang=asm>
mov r1, r0
</source>
This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers.

In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual.
=== Accessing periphery ===
To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses.

The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor.

As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step.

==== Clock Configuration ====
Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably.

This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004.

To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”.

In short: To enable GPIOA, the value '''0x00000004''' needs to be written to address '''0x40021018'''.

According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is:
* Load the value 0x00000004 into a register
* Load the value 0x40021018 into another register
* Store the value from the first register into the memory location specified by the second register.
This last step is performed by the “STR” instruction as follows:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
b .
</source>
The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet.

==== GPIO Configuration ====
By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”.

Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”.

Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on.

The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register).

Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”:
<source lang=asm>
ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

==== Writing GPIO pins ====
The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x100
ldr r1, =0x4001080C
str r0, [r1] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high

b .
</source>
Example name: “SetPin”

This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect!

=== Data processing ===
ARM supports many instructions for mathematical operations. For example, addition can be performed as:
<source lang=asm>
ldr r0, =222
ldr r1, =111
add r2, r0, r1
</source>
This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right.

You can also overwrite an input register with the result:
<source lang=asm>
add r0, r0, r1
</source>
This will write the result to r0, overwriting the previous value. This is commonly shortened to
<source lang=asm>
add r0, r1
</source>
The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are:
* '''sub''' for subtraction
* '''mul''' for multiplication
* '''and''' for bitwise and
* '''orr''' for bitwise or
* '''eor''' for bitwise exclusive or (“xor”)
* '''lsl''' for logical left shift
* '''lsr''' for logical right shift

Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example,
<source lang=asm>
add r0, r0, #23
</source>
adds 23 to the register r0 and stores the result in r0. This can again be shortened to
<source lang=asm>
add r0, #23
</source>

Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register.

Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program.

=== Reading periphery registers ===
The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle.

Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via:
<source lang=asm>
ldr r1, =0x40021018
ldr r0, [r1]
</source>
Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1.

The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction:
<source lang=asm>
orr r0, r0, #4
</source>
After that, we can store r0 as before.

With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value:

<source lang=asm>
ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

So, the modified program is:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “SetPin2”
=== Jump instructions ===
For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1]
</source>

By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning.

A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b 0x8000104
</source>
Example name: “Blink”

The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative.

When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there.

This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones.

Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon:

<source lang=asm>
BlinkLoop:

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b BlinkLoop
</source>
Example name: “Blink2”

This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this:

<source lang=asm>
b (.+4+42*2)
</source>

The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though.

=== Counting Loops ===
The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count ''down'' than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”:

<source lang=asm>
ldr r2, =1000000
subs r2, #1
</source>

Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is ''not'' set, i.e. when the result of the last math instruction (with an “s” appended) was ''not'' zero. The “beq” instruction is the opposite of that – it only performs a jump if the result ''was'' zero.

So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet:
<source lang=asm>
ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop
</source>

The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

BlinkLoop:
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r2, =1000000
delay2:
subs r2, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay”

You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change ''before'' the loop. Then, just use them inside the loop:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay2”

=== Using RAM ===
Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored.

As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right ''before'' that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash):

<source lang=asm>
.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text
@ Instructions go here...
</source>

Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text

ldr r0, =var1 @ Get address of var1
ldr r1, =0x12345678
str r1, [r0] @ Store 0x12345678 into memory block “var1”

ldr r1, [r0] @ Read memory block “var1”
and r1, #0xFF @ Set bits 8..31 to zero
ldr r0, =var2 @ Get address of var2
strb r1, [r0] @ Store a single byte into var2

b .
</source>
Example name: “RAMVariables”

Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored.

There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option <code>-Tdata=0x20000000</code> to the <code>arm-none-eabi-ld</code> call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly.

If you run this program via GDB, you can use the commands <code>x/1xw &var1</code> and <code>x/1xb &var2</code> to read the data stored in memory. After this quick introduction a more abstract overview is indicated.

== Memory Management ==
If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be used effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works.

=== Address space ===
In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123.

However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash.

While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined ''before'' the code, the code actually comes first in address space. What makes all this work is the Linker.

=== The Linker ===
Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as using multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes.

[[File:ArmAsmTutorial_Linking.svg|300px|thumb|right|Translation of native applications using assembler, compiler and linker]]You have already used a linker – the command <code>arm-none-eabi-ld</code> calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file.

When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable.

On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found [https://developer.arm.com/architectures/system-architectures/software-standards/abi here], and the generic specification for ELF on which the ARM ELF variant is based can be found [http://www.sco.com/developers/gabi/2003-12-17/contents.html here].

ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses.

Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well.

As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error:

<pre>(.text+0x132): undefined reference to `Foo'</pre>

Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker.

=== Linker Scripts ===
A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found [https://sourceware.org/binutils/docs/ld/Scripts.html here].

It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
*(.data)
} >SRAM
}
</source>
Example name: “LinkerScriptSimple”

This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations.

The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example:

<source>
SECTIONS {
.FlashText : {
*(.text)
} >FLASH

.RamData (NOLOAD) : {
*(.data)
} >SRAM
}
</source>

The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone.

To specify the filename of the linker script, use the “-T” option:
<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld
</source>
The <code>-Tdata</code> and <code>-Ttext</code> aren’t needed anymore, as the addresses are now defined in the linker script.

Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM):
<pre>arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH'
arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes</pre>

==== Reserving memory blocks ====
Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the ''beginning'' of SRAM, so put this before the “*(.data)” command:

<source>. = . + 0x400;</source>

Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400.

==== Defining symbols in linker scripts ====

As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file:

<source lang=asm>
.word 0x20000400
.word 0x080000ed
.space 0xe4
</source>

The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte ''after'' the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script:

<source>
.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
</source>
Example name: “LinkerScriptSymbols”

This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol:

<source lang=asm>
.word _StackEnd
</source>

The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores.

The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word _StackEnd
.word Reset_Handler
.space 0xe4

.type Reset_Handler, %function
Reset_Handler:

@ Put code here
</source>

The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now.

==== Absolute section placement ====
The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:
</source>
Example name: “LinkerScriptAbsolutePlacement”

The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.VectorTable)
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>

=== Program Structure ===
Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Assemble and link this source code with two assembler commands:
<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
arm-none-eabi-as -g vectortable.S -o vectortable.o
arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld
</source>

This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
@ Code here ...
</source>

This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a ''local'' symbol for each label, which can’t be used from other source files (same as ''static'' in C). The symbol is still there in the final program file, though - it can be used for debugging purposes.

== More assembly techniques ==
After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced.

=== Instruction set state ===
As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective.

Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32.

Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is:

<source lang=asm>
ldr r0, =SomeLabel
bx r0
</source>

Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions.

In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive:

<source lang=asm>
.type SomeLabel, %function
SomeLabel:
@ Some code...
</source>

That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch.

As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too:

<source lang=asm>
.type Reset_Handler, %function
</source>

If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example:

<source lang=asm>
.cpu cortex-a8
.syntax unified

@ Small but slower code here
.thumb

.type Block1, %function
Block1:
ldr r0, =Block2
bx r0

@ Larger but faster code here
.arm

.type Block2, %function
Block2:
@ ...
</source>

The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below).

=== Constants ===
The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly.

The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways:
<source lang=asm>
RCC_APB2ENR = 0x40021018
.set GPIOA_CRH, 0x40010804
.equ GPIOA_ODR, 0x4001080C
</source>

and then use it in place of the number:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_10MHz = 1
GPIOx_CRx_GP_PP_2MHz = 2
GPIOx_CRx_GP_PP_50MHz = 3

GPIOx_CRx_GP_OD_10MHz = 1|4
GPIOx_CRx_GP_OD_2MHz = 2|4
GPIOx_CRx_GP_OD_50MHz = 3|4

GPIOx_CRx_AF_PP_10MHz = 1|8
GPIOx_CRx_AF_PP_2MHz = 2|8
GPIOx_CRx_AF_PP_50MHz = 3|8

GPIOx_CRx_AF_OD_10MHz = 1|4|8
GPIOx_CRx_AF_OD_2MHz = 2|4|8
GPIOx_CRx_AF_OD_50MHz = 3|4|8

GPIOx_CRx_IN_ANLG = 0
GPIOx_CRx_IN_FLOAT = 4
GPIOx_CRx_IN_PULL = 8

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkConstants”

This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules.

The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate:

<source lang=asm>
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffff0ff
orr r0, #(GPIOx_CRx_GP_OD_10MHz<<8)
str r0, [r1]
</source>

C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values.

Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label.

A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register:

<source lang=asm>
MyData .req r7
ldr MyData, =123
add MyData, 3
</source>

This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code.
=== The Stack ===
In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”.

Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack.

On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs:

* Some operation that writes to r0
* '''Push''' (save) r0 to the stack
* Some operation that overwrites r0
* '''Pop''' (restore) r0 from the stack
* Use the value in r0 which is the same as initially assigned

ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ … Some code that overwrites r0 …

pop { r0 } @ Restore value 1000000
@ Continue using r0 ...
</source>

It is also possible to save/restore multiple registers in one go:

<source lang=asm>
ldr r0, =1000000
ldr r1, =1234567
@ Use r0 and r1 ...
push { r0, r1 } @ Save values 1000000 and 1234567

@ … Some code that overwrites r0 and r1 …

pop { r0, r2 } @ Restore 1000000 into r0 and 1234567 into r2
@ Continue using r0 and r2...
</source>

It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ Inner Code Block:

ldr r0, =123
@ Use r0 …

push { r0 } @ Save value 123

@ Inner-Inner Code Block that overwrites r0

pop { r0 } @ Restore value 123
@ Continue using r0 ...

pop { r0 } @ Restore value 1000000 into r0

@ Continue using r0 …
</source>

The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”.

At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa.

As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data.

When storing a 32bit register value using “push”, the stack pointer is '''first''' decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later.

An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte '''after''' the stack memory area. If the stack is completely full, it contains the address of the very first byte '''inside''' the stack memory area. This means that the stack grows '''downward'''. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash.

The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address ''inside'' the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4127.html ARM ABI requires] the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8.

In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the ''end'' of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the ''beginning'' of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial.
=== Function calls ===
Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the ''next'' instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again).

This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

bl EnableClockGPIOA @ Call function to enable GPIOA's peripheral clock

@ Some more code ...
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1]

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

</source>

Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”.

When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files.
==== Using the stack for functions ====
In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r5, =5 @ Number of LED flashes.
bl Blink
b .

.type Blink, %function
Blink:
push { lr }
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

bl Delay

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

bl Delay

subs r5, #1
bne BlinkLoop

pop { lr }
bx lr

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

.type Delay, %function
Delay:
mov r4, r3
DelayLoop:
subs r4, #1
bne DelayLoop @ Iterate delay loop
bx lr

</source>
Example name: “BlinkFunctions”

The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions.

The two lines

<source lang=asm>
pop { lr }
bx lr
</source>

are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC:

<source lang=asm>
pop { pc }
</source>

This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running.
==== Calling Convention ====
Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller.

When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics:

* Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.
* The LR is used as shown for the return address.
* When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).
* The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.
* The register r0 may be used to pass a result value back to the caller, which is called the return value.

This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved ''outside'' the function (“caller-save”), and the registers r4-r11 are (if at all) saved ''inside'' the function (“callee-save”).

A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack.

If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =5
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r7, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
mov r7, r0 @ Number of LED flashes.

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r7, #1
bne BlinkLoop

pop { r4-r7, pc }

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
DelayLoop:
subs r0, #1
bne DelayLoop @ Iterate delay loop
bx lr
</source>
Example name: “BlinkFunctionCallingConvention”

The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7.

Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes:

<source lang=asm>
.type Blink, %function
Blink:
push { r4, lr }

mov r4, r0

BlinkLoop:
ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r4, #1
bne BlinkLoop

pop { r4, pc }
</source>
Example name: “BlinkFunctionCallingConvention2”

A third variant would not use any of the callee-save-registers (r4-r11) at all, and instead just save r0 before the function calls and restore it as needed

<source lang=asm>
.type Blink, %function
Blink:
push { lr }

BlinkLoop:
push { r0 }

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

pop { r0 }
subs r0, #1
bne BlinkLoop

pop { pc }

</source>
Example name: “BlinkFunctionCallingConvention3”

The frequent stack accesses would however make this slower. Be sure to always document the meaning (and units, if applicable) of parameters e.g. via comments.
=== Conditional Execution ===
As mentioned, the conditional variants of the “b” instruction (e.g. “bne”) can be used to execute certain blocks of code only if a certain condition is met. First, more ways to formulate conditions will be shown. Next, the ARM instruction “it” will be introduced, which makes executing small blocks of code conditionally more efficient.
==== Conditions ====
All conditions for conditional execution depend on the outcome of some mathematical operation. When instructions such as “adds”, “subs”, “ands” are used, they update the flags in the APSR register depending on the outcome, which are then read by the conditional variants of “b” to decide whether to actually perform the jump.

Often it is necessary to compare two numbers without actually doing a calculation. This can be done with the “cmp” instruction to which you can pass two registers or a register and a literal:

<source lang=asm>
cmp r0, #42
cmp r0, r1
</source>

The “cmp” instruction is very similar to “subs” - it subtracts the second operand from the first, but doesn’t save the result anywhere, i.e. the registers keep their values. Just the flags in the APSR are updated according to the result, just as with “subs”. For example, if both operands were equal, the result of the subtraction is zero, and the zero flag will be set. So, to test whether two numbers are equal:

<source lang=asm>
cmp r0, #42
beq TheAnswer

@ This is executed if r0 is not 42

TheAnswer:
@ This is executed if r0 is 42
</source>

The “bne” instruction is the opposite of “beq”.

The “tst” instruction works similarly to “cmp”, but instead of subtracting, perform a bitwise “and” operation - like the “ands” instruction, but without keeping the result. This way, you can test whether a bit in a register is set:

<source lang=asm>
tst r0, #4
beq BitNotSet

@ This is executed if bit 2 in r0 is set

BitNotSet:
@ This is executed if bit 2 in r0 is not set
</source>

A more useful use case for “tst” is to pass the same register twice. Applying “and” to the same value twice yields the same result as the input, so “tst” in this case effectively checks the properties of the input (negative/positive, zero):

<source lang=asm>
tst r0, r0
beq ValueZero

@ This is executed if r0 is not zero

ValueZero:
@ This is executed if r0 is zero
</source>

There is also the “teq” instruction which performs an exclusive or operation.

As mentioned, the suffixes “eq” and “ne” are called condition codes. ARM has 14 of those which define how the flags in the APSR form the condition. The details about how a subtraction (by “subs” or “cmp”) sets the flags in the APSR and how their interpretation by the different condition codes correlates to the mathematical result are somewhat complicated, involving the way the 2’s complement format works and relying on the fact that subtracting works by adding a negated number. Instead of diving into all the details, a table with a more high-level view and a practical interpretation of the condition should be more helpful:

{| class="wikitable"
! style="text-align:left;" | Code
! style="text-align:left;" | Meaning
! style="text-align:left;" | Unsigned / Signed
! style="text-align:left;" | Flags
! style="text-align:left;" | Condition after “cmp/subs r0, r1”
! style="text-align:left;" | Condition after “tst r0, r0”
|-
| style="text-align:center;" | EQ
| Equal
| style="text-align:center;" | U+S
| Z==1
| r0 = r1
| r0 = 0
|-
| style="text-align:center;" | NE
| Not equal
| style="text-align:center;" | U+S
| Z==0
| r0 ≠ r1
| r0 ≠ 0
|-
| style="text-align:center;" | MI
| Negative
| style="text-align:center;" | S
| N==1
| ---
| r0 < 0
|-
| style="text-align:center;" | PL
| Positive or Zero
| style="text-align:center;" | S
| N==0
| ---
| r0 ≥ 0
|-
| style="text-align:center;" | VS
| Overflow
| style="text-align:center;" | S
| V==1
| r0-r1 out of range¹
| ---
|-
| style="text-align:center;" | VC
| No Overflow
| style="text-align:center;" | S
| V==0
| r0-r1 in range¹
| ---
|-
| style="text-align:center;" | HS
| Unsigned higher or same
| style="text-align:center;" | U
| C==1
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LO
| Unsigned lower
| style="text-align:center;" | U
| C==0
| r0 < r1
| ---
|-
| style="text-align:center;" | HI
| Unsigned higher
| style="text-align:center;" | U
| C==1 and Z==0
| r0 > r1
| ---
|-
| style="text-align:center;" | LS
| Unsigned lower or same
| style="text-align:center;" | U
| C==0 or Z==1
| r0 ≤ r1
| ---
|-
| style="text-align:center;" | GE
| Signed greater or equal
| style="text-align:center;" | S
| N==V
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LT
| Signed less than
| style="text-align:center;" | S
| N!=V
| r0 < r1
| ---
|-
| style="text-align:center;" | GT
| Signed greater than
| style="text-align:center;" | S
| Z==0 and N==V
| r0 > r1
| ---
|-
| style="text-align:center;" | LE
| Signed less or equal
| style="text-align:center;" | S
| Z==1 or N!=V
| r0 ≤ r1
| ---
|}
1: Range meaning the numbers from -(2^31) until (2^31-1), inclusive

To determine which condition code you need, first think about whether the number is unsigned (range 0 to 2^32-1) or is using two's complement to represent signed numbers (range -2^31 to 2^31-1). Ignore all rows in the table with the wrong format.

If you want to compare two numbers, use the “cmp” instruction, and search for the desired condition within the “cmp”-condition column of the table. If you want to test a single number's properties, use the “tst”-column. Use the condition code from the first column with the conditional “b” instruction (“bne”, “beq”, “bmi”, “bpl”, “bhs”, ... ) right after the appropriate “cmp”/”tst” instruction.

Note that all the condition codes have a corresponding inverse code that has exactly the negated meaning. Most also have a swapped partner code, using which is equivalent to swapping the operands for cmp.
==== The IT instruction ====
Jumping is inefficient, so having many conditional jumps may slow down your program. The ARM architecture offers a way to make a few instructions conditional without requiring a jump via the “it” (if-then) instruction. It is used in place of a conditional jump after an instruction that set the flags (“cmp”, “tst”, “adds”…) and also needs a condition code. The next instruction right after the it will then only be executed when the condition is met, and skipped otherwise. You have to repeat the condition code and add it to that instruction; this is just to make the code clearer and avoid confusion.

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

it hi @ Make the next instruction conditional
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

This checks if r0 is higher than r1 (it isn’t), and only sets the pin PA8 to high if this condition is met. Up to 4 instructions can be made conditional like this; for each one, an additional “t” has to be appended to the “it” instruction:

<source lang=asm>
cmp r0, r1 @ Perform comparison

ittt hi @ Make the next instruction conditional
ldrhi r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldrhi r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

You can also add instructions that will be executed if the condition was ''not'' met (like an “else”-case in high-level-languages), by appending “e” instead of “t” to the “it” instruction. Since the “t” in “it” is fixed, the first instruction is always executed if the condition is met; only the next three instructions can be either a “then” case (“t”) or “else” case (“e”). You also have to provide the inverted condition code for the “else”-instructions:

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

ite hi @ Make the next two instructions conditional (if-then-else)
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
strls r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low
</source>

There are several restrictions on which instructions may appear within an it-block. Most importantly, instructions that set the flags are forbidden here, as is the “b” instruction except for the last instruction in an “it” block. Directly jumping to one of the conditional instructions is forbidden too.

In T32 code, only the conditional “b” instruction is capable of encoding a condition code together with some operation, so the “it” instruction is provided to make any instruction conditional. On A32, most instructions include a condition code and can therefore be conditional, and the “it” instruction is actually ignored by the assembler here. You can and should still put “it” into code intended for A32, as this makes it compatible with T32. This is one of the reasons why A32 is more time-efficient, and T32 more space-efficient.

Conditional instructions sometimes make surprisingly compact programs. For example, the euclidean algorithm for calculating the greatest common divisor (gcd) of two numbers can be written in ARM assembly like this:

<source lang=asm>
gcd:
cmp r0, r1
ite gt
subgt r0, r0, r1
suble r1, r1, r0
bne gcd
</source>

While the C equivalent is actually longer:

<source lang=asm>
int gcd(int a, int b) {
while (a != b) {
if (a > b)
a = a - b;
else
b = b - a;
}
return a;
}
</source>

The usage of conditional instructions is also [https://developer.arm.com/docs/dui0801/latest/condition-codes/example-showing-the-benefits-of-conditional-instructions-in-a32-and-t32-code faster] than using conditional jumps. Note that the final “bne” instruction is independent of the “if-then” block; it just directly uses the result of “cmp”.
=== 8/16 bit arithmetic ===
So far, all numbers had 32 bit. However, especially for space reasons, smaller numbers are needed with 8 or 16 bit. Cortex-M3 doesn't provide any instructions for calculating 8 or 16 bit numbers directly. Instead, after loading such a number from memory into a processor register, it has to be extended into 32bit to allow the 32bit instructions to work properly. When storing the result back, only the lower 8/16 bit are used. If 8/16bit overflow behavior is required (i.e. overflow at -128/127 for 8bit signed, 0/256 for 8bit unsigned, -32768/32767 for 16bit signed, 0/65536 for 16bit unsigned) for calculations, the numbers have to be truncated after each calculation. This actually makes it slightly less efficient to deal with smaller numbers.

A 16bit value (“halfword”) can be read from memory with the ldrh instruction:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
</source>

“ldrh” loads 16bit from memory, writes them into the lower 16 bits of the destination register (here: r1), and sets the upper 16bits to zero. If the value is signed, it has to be sign-extended so that it can be used with 32bit-calculations:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
sxth r1, r1
</source>

The “sxth” instruction copies the sign bit (i.e. bit 15) into the upper 16 bits (“sign-extension”); this makes sure that negative 16bit-numers keep their value when interpreted as 32 bits. The “ldrsh” instruction combines both “ldrh” and “sxth”. “ldrb”, “sxtb”, “ldrsb” are for loading and sign-extending 8bit-values and the combination of both, respectively.

To simulate 8/16bit overflow behaviour after a mathematical operation, use uxtb/uxth for unsigned 8/16 bit numbers, or sxtb/sxth for signed 8/16 bit numbers:

<source lang=asm>
add r0, #1
uxth r0, r0
</source>

The “uxth”/”uxtb” instructions copy the lower 16/8 bits of a register into another one, setting the upper 16/24 bits to zero. This way, if r0 contained 65535 before, the result will be 0 instead of 65536 after using “uxth”.

This is a common trap when coding in C - when using e.g. the "uint16_t" type for local variables such as loop counters, this implicitly requests 16bit overflow behavior, requiring the truncating after each calculation, even though the overflow may actually never happen. This is why e.g. uint16_fast_t should be used for local variables, as this is 32 bit on ARM, which is faster.
=== Alignment ===
There are certain restrictions on the address when accessing data in memory using the “str”/”ldr” variants:
* The “ldrd”/”strd”/”ldm”/”stm” instructions, which can load/store multiple registers at once, always require the address to be a multiple of 4. If it isn’t, the program will crash.
* The “ldr”/”str” instructions require the address to be a multiple of 4, and “strh”/”ldrh” require it to be a multiple of 2. If it isn’t, the behaviour depends on the ARM version:
** On ARMv6-M and before, the program will crash.
** On ARMv7-M:
*** If the CCR.UNALIGN_TRP is set to zero (the default), the access will be slow
*** If the CCR.UNALIGN_TRP bit is set to one, the program will crash, emulating the ARMv6-M behaviour

For “strb”/”ldrb” there are no such requirements.

The number of which the address needs to be a multiple of is called the “alignment” (e.g. 2-byte-alignment, 4-byte-alignment, …). An access with an address that is a multiple of 2/4 as specified above is called an “aligned access”; others are called “unaligned access” (which are slow or cause a crash).

Even though slow accesses may be acceptable, it is still a good idea to make sure all accesses are always correctly aligned in case the code is ported to an ARM version or operating system that requires it. The addresses of periphery registers are already aligned correctly, so there is no need to worry. When placing data in RAM however, you should make sure that the addresses of the individual elements that are accessed via one of the “ldr” variants are aligned properly. For example, if a previous example code was modified like this:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

The address of “var1” will not be a multiple of 4, and an access via “ldr” would be unaligned. This could be improved by adding a space of 3 bytes in between:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.space 3
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

This would require you to keep in mind all the other things in memory that were declared before, which is impractical especially if multiple assembly files are used. Therefore, the assembler offers the “.align” directive:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.align 2
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

When using “.align X”, the assembler makes sure that the next address will be a multiple of 2^X, so in this case, a multiple of 2^2=4. The assembler will therefore insert 0 to 2^X-1 bytes of space. The section containing the directive in the object code file will also be marked to require that alignment, such that the linker will automatically place it at the appropriate location in address space.
=== Offset addressing ===
The various “ldr”/”str” instructions can optionally perform some calculation on the address before executing the memory access. What is shown for “ldr” here works for “str” and the variants for halfwords and bytes equivalently. There are several variants for this. This first one adds a fixed offset that is encoded within the instruction itself to the address:

<source lang=asm>
ldr r0, [r1, #8]
</source>

This adds 8 to r1 and uses the result as the address to access. The number can also be negative. This variant is useful for accessing members of a heterogeneous container organized like a C struct or the registers in a periphery module. For example, you can load the base address of a periphery module into a register, and then access the various registers using offset-addressing without having to load each address individually:

<source lang=asm>
GPIOA=0x40010800
GPIOx_CRH = 0x04
GPIOx_BSRR = 0x10

GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA

ldr r1, =GPIOA

ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r0, [r1, #GPIOx_BSRR] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “OffsetAddressing”

This way, you can avoid repeated loads of similar addresses. This variant is also capable of writing the newly calculated address back into the address register by appending a “!”:

<source lang=asm>
ldr r0, [r1, #8]!
</source>

This will add 8 to r1, write the result into r1, and also use it as an address from which to load 4 bytes and store them into r0. The variant

<source lang=asm>
ldr r0, [r1], #8
</source>

works just the opposite - r1 is used as an address from which to load the data, and “r1+8” is written back to r1. The next variant adds two registers to obtain the memory address:

<source lang=asm>
ldr r0, [r1, r2]
</source>

This loads the data from the address calculated by “r1+r2”. The second register (here: r2) can also be optionally shifted left by a fixed number of bits in the range 0-3:

<source lang=asm>
ldr r0, [r1, r2, lsl #2]
</source>

This shifts r2 left by two bits (i.e. multiplies it by 4), adds it to r1, and uses that as the address (r2 itself is not modified).
=== Iterating arrays ===
The offset addressing mechanism is perfectly suited to iterating arrays. This could be used to make an array defining a sequence of LED flashes that is iterated by the LED blinker application. Such an array would contain the duration of each on-and off-cycle (as passed to the “Delay” function) and be placed in flash memory:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =BlinkTableEnd @ Move address of "BlinkTableEnd" into r8

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
.word 2500000, 1000000, 2500000, 1000000, 2500000, 1000000
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
BlinkTableEnd:
</source>
Example name: “BlinkPattern”

The “.word” directive is used to place a sequence of 32bit- numbers into flash memory. The label “BlinkTable” will refer the the start address of the array, and “BlinkTableEnd” to the first address ''after'' the array. These two addresses are loaded into registers before the loop. The “.align” directive is used to make sure the 32bit-words are stored at properly aligned addresses. Inside the loop, the “ldr” instruction is used to load a 32bit-word from the array and pass it to the “Delay” function. The r7 register is advanced by 4 bytes to the next 32bit-word. This is done twice, for the on-and off-time. At the end of the loop, the address register is compared with the address of “BlinkTableEnd” - until that address has been reached, the loop will continue.

Another possibility is to keep the base address of the array in a register, and increment another register that contains the offset:

<source lang=asm>
.type Blink, %function
Blink:
push { r4-r9, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =0
ldr r9, =18

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

cmp r8, r9
blo BlinkLoop

pop { r4-r9, pc }
</source>
Example name: “BlinkPattern2”

Here, r8 is incremented in steps of 1 to denote the index in the array. The “lsl” syntax for “ldr” is used to multiply r8 by 4 (since each word is 4 bytes in size) and add it to r7, which contains the array’s base address. At the end of the loop, r8 is compared with 18, which is the number of entries in the array. This variant is actually less efficient, as it needs to keep both the base address and the index in registers and also has to increment the index in each iteration.
=== Literal loads ===

Regardless of architecture, any processor obviously needs to work with addresses in its own address space a lot. ARM can do calculations with its 32bit addresses just fine, but there is a bottleneck: The instruction set itself. To work with any address, it needs to be initially loaded into a processor register, but ARM instructions are only 16 or 32bit in size - not enough space for an arbitrary 32bit number plus the instruction encoding. Allowing even larger instructions (e.g. 40 bit) would complicate matters, so ARM instead uses several tricks to deal with this problem, which will be discussed here.

The “ldr r0, =1234” syntax allows you to load any arbitrary 32bit numbers, but is not actually a machine code instruction, but is translated by the assembler into one. In this chapter, the actual instructions for loading immediate numbers are discussed.

==== The “mov”-instruction ====
The most basic way of loading an immediate number into a register is the “mov” instruction:

<source lang=asm>
mov r0, #1234
</source>

This allows you to load any 16bit number (0 to 2^16-1) into a register. “mov” also includes some clever encodings that allow you to load certain commonly-used patterns:

* Any 32bit number that consists of one byte of arbitrary bits (i.e. 8 adjacent arbitrary bits) at any location, and zeros otherwise, e.g. 0x00000045, 0x00045000, 0x7f800000.
* Any 32bit number that consists of the same byte repeated 2 or 4 times in fixed places, as in 0x23002300, 0x00230023, 0x23232323
* The bit-wise negated result of any of these two patterns, e.g. 0xffffffba, 0xfffbafff, 0x807fffff or 0xdcffdcff. The assembler will actually use the “mvn” instruction for this, which works identically to “mov”, but negates the value.

By specifying a number that falls into one of these patterns, the assembler will automatically use the appropriate encoding. The first two ways of encoding numbers are not only available with “mov”, but also several other mathematical instructions that expect some immediate value: “add”, “and”, “bic”, “cmn”, “cmp”, “eor”, “mov”, “mvn”, “orn”, “orr”, “rsb”, “sbc”, “sub”, “teq”, “tst”. In the ARM Architecture Reference Manual, check the description of the instructions and look out for “ThumbExpandImm” to see whether it supports the first two patterns above.

You can also use the “mvn” instruction directly, e.g.:

<source lang=asm>
mov r0, #0xf807ffff
mvn r0, #0x07f80000
</source>

both lines are identical and write the number 0xf807ffff into r0.
==== The “movt” instruction ====
While supporting many common patterns, this does not allow arbitrary 32 bit numbers. One way to load any 32bit number is to split the number into two 16bit halves, and use both “mov” and “movt” to combine these two half-words into one register:

<source lang=asm>
mov r0, #0xabcd
movt r0, #0x1234
</source>

The “movt” instruction loads the given number into the upper 16 bits of the register, so this example loads 0x1234abcd into r0. The order is important, as “mov” overwrites the upper 16 bits with zeros, but “movt” keeps the lower 16 bits. If a single “mov” can't fit the desired number, the combination of “mov” and “movt” is the fastest way of loading any 32bit number. As two 32bit instructions are needed, this consumes 8 bytes of program memory. If you want to load the address of a symbol into a register, you need to tell the assembler to split it automatically. This can be achieved by prefixing the symbol with “:lower16:” or “:upper16:”, e.g.:

<source lang=asm>
movw r0, #:lower16:GPIOA_BSRR
movt r0, #:upper16:GPIOA_BSRR
</source>

Note that “movw” needs to be specified in this case to explicitly tell the assembler to use the “mov” variant that accepts 16bit numbers (which it otherwise does automatically when a direct value is given).
==== PC-relative loads ====
The other way of loading arbitrary 32bit values into registers is to place the value directly in flash memory, and load it from there using “ldr”:

<source lang=asm>
@ Some code …
mov r0, … address of Literal ...
ldr r1, [r0]
@ More code …
Literal:
.word 0x12345678
</source>

However, there is a Chicken-And-Egg problem - the address of “Literal” is a 32bit number itself, so how to load it into r0? Luckily, there is a register that contains a number close to the one needed - the program counter (PC, r15) indicates the address of the instruction currently being executed. By reading it and adding a small offset that fits into the instruction itself, the address of “Literal” can be obtained, provided that “Literal” is located close enough. Consider this example of the EnableClockGPIOA function:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
add r1, pc, #12
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
.word RCC_APB2ENR
</source>

The 32bit-value “RCC_APB2ENR” is stored in flash memory. The “add” instruction is used to add the offset 12 to the address of the instruction itself to obtain the address of said 32bit-value, which is then loaded via “ldr”. The offset 12 is actually not easy to calculate and even depends on the alignment of the “add” instruction itself (hence the “.align” to ensure a consistent example). The assembler is capable of doing the calculation on its own, for which the “adr” instruction is used:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
adr r1, LiteralRCC_APB2ENR
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

The label LiteralRCC_APB2ENR refers to the address of the 32bit-value in memory. “adr” is actually a variant of “add” that instructs the assembler to calculate the offset and place it into the instruction itself, which then lets the processor add it to PC and write the result to r1. This address is then used by “ldr”.

The “adr” instruction is useful when the address of some literal is explicitly needed; for example, in the blinker program, it can be used to obtain the addresses of the array:

<source lang=asm>
adr r7, BlinkTable @ Move address of "BlinkTable" into r7
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r8
</source>

However, for loading a single value, the address is actually not needed. In this case, “adr” and “ldr” can be combined:

<source lang=asm>
.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, LiteralRCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

This special variant of “ldr” lets the assembler calculate to offset as with “adr”, adds it to “PC” at runtime and loads the data found at the address into r1. This is much easier than the first variant, as all calculations are done automatically. It is still somewhat cumbersome having to write three lines just to obtain a single 32bit value. Therefore, the assembler offers this already introduced syntax:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

This is a special command for the assembler. If possible, the assembler will use the “mov” or “mvn” instruction to load the value. If the value won’t fit, it will be put into flash memory, and a “ldr” instruction as above will be used. In this case, the “ldr rX, =...” syntax is equivalent to the combination of specifying a label for the value, the “.word” directive and “ldr rX, <Label>”. Therefore, this syntax is usually the best way to load immediates.

The assembler places the literals at the end of the file. If the file is long, the offset will be too long for the “ldr” and “adr” instructions and the assembler will emit an error. You can instruct the assembler to place all literals that have been declared so far at a specific point using the “.ltorg” directive. It is recommended to place an “.ltorg” after each function (after the “bx lr”) - just make sure that execution will never reach there. If a single function is so long that an “.ltorg” at the end is too far away from “ldr”/”adr” at the beginning, you can place an “.ltorg” somewhere in the middle and jump over it with “b”.

In summary, the following rules can help make literal loads more efficient
* Avoid literal loads if possible; try to calculate needed values from other values that have already been loaded, possibly by using offset-addressing in “ldr”/”str”
* When accessing multiple registers of a single periphery module, load its base address once and use offset addressing to access the individual registers
* If you need a pointer to a location in flash memory, try using “adr”
* If speed is important, use “movw”+”movt” to load the value
* Else, use “ldr rX, =...” to have the assembler choose the optimal encoding
* Place “.ltorg” after each function

The “ldr … =” instruction can also be used to load any immediate 32bit value into the PC to cause a jump to that address, simply by specifying “pc” as the target register. If you perform an ordinary branch (via “b” or “bl”) to some function whose address is too far away from the current code location, the linker will insert a “wrapper” function that does exactly that to perform the “far” jump. That function is called a “veneer”.
=== The SysTick timer ===
An important aspect of many embedded systems is to control timing of technical processes. In the blinker example, the timing of the LED flashes was handled by having the processor execute dummy instructions to pass time. It is however virtually impossible to accurately predict the runtime of any piece of code on a complex processor such as ARM ones, and the runtime may vary among multiple runs and depending on the actual micro controller and its configuration. For a simple LED blinker this may be acceptable, but not for e.g. a closed loop controller for some mechanical actor. Therefore, almost all micro controllers and also application processors feature one or more hardware timers, which allow to measure time independently of the execution speed of the software. Timer features vary widely among different processors, but that basic idea is to increment or decrement some digital counter at each clock cycle and trigger some event when it reaches a certain value.

All ARMv7-M processors feature the so-called “SysTick”-Timer as part of the processor core itself. This is a rather simple 24bit-timer that counts from a configurable value back to zero, then resets to that value and triggers an event. This timer is frequently used as a time base for RTOS or other runtime libraries. The timer uses three periphery registers: “RVR” contains the value from which to count down. “CVR” contains the current value, and “CSR” contains some status and control bits. The timer can be used for the “Delay” function like this:

<source lang=asm>
SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
ldr r1, =SCS
add r0, r0, r0, lsl #1

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

DelayLoop:
ldr r0, [r1, #SCS_SYST_CSR]
tst r0, #0x10000
beq DelayLoop

ldr r0, =0
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>

The SysTick is part of the “System Control Space”, SCS. The SCS base address is defined as a symbol, and the relative addresses of the registers as well. The count value is stored in “RVR”, after which “CVR” has to be set to zero. The timer is started by writing “5” into the “CSR” register. The loop repeatedly reads the “CSR” register and continues until bit 16 is set. The “tst” instruction is used to perform an “and” operation with the register contents and an immediate value without keeping the result while just updating the flags. At the end, the “CSR” register is set to zero to disable the timer. The “add” instruction at the beginning is used to multiply the count value by 3: r0 is shifted left by one, i.e. multiplied by two, and then added to itself, as in r0*2^1+r0. This is a common trick to quickly multiply by constants. By including this multiplication, the duration is the same as with the previous “Delay” variant, which, on this microcontroller, uses about 3 cycles per loop iteration.

Managing timing this way (or any other kind of “Delay” function) is still not very accurate. The time needed to call the function, start the timer, return, and set the pins is added to the actual duration and may also vary each time. The timing errors accumulate over time - a clock implemented this way will quickly go wrong. The proper way to achieve accurate timing is to start the timer once, let it run continuously, and react to its events. The internal clock source used by the microcontroller is also quite inaccurate (up to 2.5% deviation), which can be improved upon by a quartz crystal (typical accuracy of e.g. 0.005%), which will be covered later. Reacting to events instead of calling a function that executes dummy code requires restructuring the program code, without using any kind of “Delay” function.

To do that, the timer is started once at program startup and kept running. After setting the LED pin, wait for the timer event, and repeat. In the last example, the values 3000000 and 7500000 are used for the timer register (3x1000000 and 3x2500000, respectively). Changing the timer value while it is running continuously is problematic, so one fixed value should be used; to achieve variable blinker duration, multiple timer events need to be counted. The greatest common denominator of the two numbers is 1500000, so to achieve the two different times, 2 and 5 timer events need to be registered, respectively. Since these numbers fit into a single byte, the table entries and corresponding access instructions are changed to byte. A function “StartSysTick” is implemented to start the timer once, and a function “WaitSysTick” to wait for a given number of timer events:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =TimerValue
bl StartSysTick
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
adr r7, BlinkTable @ Move address of "BlinkTable" into r8
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r9

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2, 2
BlinkTableEnd:
.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

bx lr

@ r0 = Number of timer events to wait for
.type WaitSysTick, %function
WaitSysTick:
ldr r1, =SCS

WaitSysTickLoop:
ldr r2, [r1, #SCS_SYST_CSR]
tst r2, #0x10000
beq WaitSysTickLoop

subs r0, #1
bne WaitSysTickLoop

bx lr

</source>
Example name: “BlinkSysTick”

This way, the blinker frequency will be as stable and accurate as possible with the given clock source.
=== Exceptions & Interrupts ===
Exceptions and interrupts play an important role in low-level development. They provide a facility for hardware to notify the software of events, such as received data blocks or a timer event. On ARM, interrupts are a sub-group of exceptions – there are some “system-level” exceptions mostly for dealing with processor errors and providing operating system support, while interrupts are “special” exceptions for events signaled by periphery modules. When writing “regular” microcontroller software, you will mostly work with interrupts.

Exceptions (and interrupts) interrupt normal program flow, and cause the processor to execute some other piece of code which is called the exception handler or Interrupt Service Routine (ISR) (even for the “system-level” exceptions that are not interrupts). After dealing with the indicated event, the ISR typically returns and normal program flow resumes. As exceptions can interrupt the program anytime, data (and periphery) may be in any kind of inconsistent state, so special care must be taken to avoid corrupting program state in an ISR. The ARMv7-M processor (including the Cortex-M3) provide sophisticated support for exceptions, with configurable priorities and nested exception calls. This chapter will only cover the basics for using exceptions.

On ARMv7-M, exception handlers are implemented as regular functions, for example:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
@ Handle event ...
bx lr
</source>

Like any other function, it has a label, returns with “bx lr”, and is also made globally visible to other source files using “.global”. The “.type … %function” is required here for the same reason as for the already-mentioned “Reset_Handler”. Exception handlers can be located anywhere in flash memory, among the other regular functions. To tell the processor where the exception handlers for the various exception types are located, the vector table needs to be adjusted. Until now, the vector table was defined as:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Recall that the first 32bit-word in flash memory contains the initial stack pointer (defined via “.word _StackEnd”) and the second word contains the address of the first instruction of the program (defined via “.word Reset_Handler”). Actually, resetting the controller is an exception too, and the code to be executed after reset (or start-up) is the handler for the reset exception (hence the name “Reset_Handler”). The next 228 bytes of flash memory contain 57 32bit-addresses of the handlers of the other exceptions, including interrupts. The “.space” directive just fills those with zeroes. To tell the processor the address of an exception handler, the appropriate entry in this table needs to be set to that address. In chapter 10.1.2, table 63 of the controller’s reference manual, the format of the vector table, and which exception’s address should go where, is defined. Only the interrupts up until position 42 actually exist on the STM32F103RB/C8, as defined in chapter 2.3.5 of the datasheet; everything from “TIM8_BRK” is only present on larger controllers. According to the table, the SysTick exception handler’s address needs to be put at location 0x3C relative to the beginning of flash memory. Since the first 8 bytes are already occupied, 0x34 bytes of space are needed after those first 8 bytes.

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0x34
.word SysTick_Handler
.space 0xac
</source>

With this modification, the SysTick_Handler function is now declared as the handler for the SysTick exception. By default, the SysTick timer does not trigger an exception. To do that, you have to set bit 2 in the SCS_SYST_CSR register. By placing the logic for the blinker into the timer's ISR, you get an interrupt-based blinker:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

ldr r0, =TimerValue
bl StartSysTick
SleepLoop:
wfi
b SleepLoop

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq Return

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs Return

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne Return

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs SkipRestart

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

SkipRestart:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

Return:
bx lr

.align 2
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr

</source>
Example name: “BlinkSysTickInterrupt”

The regular program flow now consists only of initializing the periphery, timer, and the first step of the blinker (setting the pin high). After that, the processor should do nothing but wait for exceptions, which is achieved by a simple endless loop. The “wfi” instruction suspends the processor; when an exception occurs, the processor will wake up, execute the ISR, and return execution after the “wfi”. Therefore, “wfi” is usually put in an endless loop as shown. This technique can reduce the processor’s power consumption significantly, as it is only running when something needs to be done, as indicated via interrupts. The ISR first checks whether the interrupt flag in the timer register is set - this is necessary, since exceptions can sometimes occur “spuriously”, i.e. without an actual event causing it. The decision whether to set or reset the pin state is taken based on the lowest bit of the table index, such that the output alternates between 1 and 0.

The code inside the ISR needs to know which step in the blinking sequence is currently active, and how many timer events have already occurred inside the current step. Therefore, two 1-byte-variables are stored in RAM. To access them, offset addressing is used, where r0 contains the base address of the variables in memory, and the offsets inside “ldrb” and “strb” are set accordingly. The last number of the blink sequence table is omitted, since it is actually superfluous, because no action is taken after the last delay has elapsed. Because the table size is now odd, an “.align” directive after it is required. Always putting “.align” after outputting data is a good idea anyways.

Since exceptions can occur at any point in regular program flow, the processor registers may contain some data that will be used after the exception handler returns. Therefore, if the exception handler writes anything into the registers, they need to be restored when returning from the exception. Upon exception entry, the Cortex-M3/4 processors automatically store the registers r0-r3, r12, r14 (LR) and APSR (including the flags) on the stack. The link register is filled with a special “dummy” value, and when the exception handler returns via “bx lr” using this value, the processor restores said registers to their previous state. This effectively means that you can implement exception handlers like any other function, i.e. freely overwrite r0-r3, r12 and the flags and push/pop r4-r11 and the LR if needed.
=== Macros ===
The assembler provides a few mechanisms to make assembly-language development easier. One of those are macros, which allow you to define snippets of assembly code that you can then insert easily whenever you need them. While looking similar to function invocations, the code inside the macro is actually copied each time the macro is used, so don’t overuse them. Macros are started with “.macro” and end at the next “.endm” directive. For example, the following macro sets the LED pin to 0 or 1:

<source lang=asm>
.macro SETLED value
ldr r0, =GPIOA_BSRR
ldr r1, =(((!\value) << 24) | (\value<<8))
str r1, [r0]
.endm

SETLED 0
SETLED 1
</source>

The macro name is defined as “SETLED”, and a single parameter with name “value” is given. By typing “\value”, the value of the parameter is substituted in the macro body. Some bit-shifting is used to calculate the right bit pattern to write into BSRR to set or reset the pin accordingly.
=== Weak symbols ===
As explained before, labels defined in assembly files get translated into symbols in the object code files, which are resolved by the linker. Sometimes it is desirable to provide a “default” or “fallback” implementation of some function (or data block) which is only used when no other implementation is given. This can be achieved by marking the “fallback” variant with “.weak”:

<source lang=asm>
.type Function1, %function
.global Function1
.weak Function1
Function1:
@ Default implementation …
…
bl Function1 @ Call the function
</source>

With this code alone, “Function1” will be used normally. If you put another function with the same name in a different assembly source file, that second variant will be used.
=== Symbol aliases ===
It is also possible to define aliases for symbols using “.thumb_set”, which sets the address accordingly. For example:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

When trying to call “Function2”, the linker will automatically fill in the address of “Function1”. This can also be combined with “.weak” to define a weak alias:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.weak Function2
.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

If you now define another “Function2” in a different assembly source file, that will be used. If not, “Function1” will be called, which is the target of the alias definition. This is useful if you want to define one default implementation for several different functions, for each of which you need one “.weak” and one “.thumb_set” directive.
=== Improved vector table ===
The techniques from the last three sections can be used to improve the definition of the vector table. The way it was defined before is not very flexible; to insert new entries, you have to calculate the new gap sizes and offsets. First, define a default handler ISR that is called by exceptions for which no other ISR is defined, and a macro that defines an alias for one exception with the default handler as the target, and finally a table of all exceptions by using the macro:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.macro defisr name
.global \name
.weak \name
.thumb_set \name, Default_Handler
.word \name
.endm

.global VectorTable
.section .VectorTable, "a"
.type VectorTable, %object
VectorTable:
.word _StackEnd
defisr Reset_Handler
defisr NMI_Handler
defisr HardFault_Handler
defisr MemManage_Handler
defisr BusFault_Handler
defisr UsageFault_Handler
.word 0
.word 0
.word 0
.word 0
defisr SVC_Handler
defisr DebugMon_Handler
.word 0
defisr PendSV_Handler
defisr SysTick_Handler
defisr WWDG_IRQHandler
defisr PVD_IRQHandler
defisr TAMPER_IRQHandler
defisr RTC_IRQHandler
defisr FLASH_IRQHandler
defisr RCC_IRQHandler
defisr EXTI0_IRQHandler
defisr EXTI1_IRQHandler
defisr EXTI2_IRQHandler
defisr EXTI3_IRQHandler
defisr EXTI4_IRQHandler
defisr DMA1_Channel1_IRQHandler
defisr DMA1_Channel2_IRQHandler
defisr DMA1_Channel3_IRQHandler
defisr DMA1_Channel4_IRQHandler
defisr DMA1_Channel5_IRQHandler
defisr DMA1_Channel6_IRQHandler
defisr DMA1_Channel7_IRQHandler
defisr ADC1_2_IRQHandler
defisr USB_HP_CAN1_TX_IRQHandler
defisr USB_LP_CAN1_RX0_IRQHandler
defisr CAN1_RX1_IRQHandler
defisr CAN1_SCE_IRQHandler
defisr EXTI9_5_IRQHandler
defisr TIM1_BRK_IRQHandler
defisr TIM1_UP_IRQHandler
defisr TIM1_TRG_COM_IRQHandler
defisr TIM1_CC_IRQHandler
defisr TIM2_IRQHandler
defisr TIM3_IRQHandler
defisr TIM4_IRQHandler
defisr I2C1_EV_IRQHandler
defisr I2C1_ER_IRQHandler
defisr I2C2_EV_IRQHandler
defisr I2C2_ER_IRQHandler
defisr SPI1_IRQHandler
defisr SPI2_IRQHandler
defisr USART1_IRQHandler
defisr USART2_IRQHandler
defisr USART3_IRQHandler
defisr EXTI15_10_IRQHandler
defisr RTCAlarm_IRQHandler
defisr USBWakeUp_IRQHandler

.text

.type Default_Handler, %function
.global Default_Handler
Default_Handler:
bkpt
b.n Default_Handler
</source>

There are a few empty entries in the table that are not used by the processor. At the beginning, there is still the definition for the initial stack pointer and the “Reset_Handler”. If you replace your “vectortable.S” by this code, you get a “proper” vector table. The “SysTick_Handler” will continue to work as before, and if you need to define any other ISR, for example for USART1, just define a function by the exact name “USART1_IRQHandler”. The address of this function will automatically be put in the vector table. If an exception without a corresponding ISR occurs, the “Default_Handler” will be called, which uses the “bkpt” instruction to force a breakpoint via the attached debugger. This helps debugging missed exceptions while avoiding to define several individual dummy handler functions.
=== .include ===
Having to put the register and bit definitions (“RCC_APB2ENR”, “RCC_APB2ENR_IOPAEN”, ...) in each assembly source file is redundant and error-prone. Instead, you can put them into a separate file (e.g. called “stm32f103.inc”) and use the “.include” directive to reference it:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

@ Normal code ...
</source>

The assembler will read the code from the included file and pretend it was written instead of the “.include” line. This can help improve code structure. While working on the project structure, you can also restructure the definitions for the GPIO registers to facilitate offset addressing:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000
</source>

The next example incorporates these changes in addressing the registers.
=== Local Labels ===
Having to invent unique labels for all jump targets inside functions (e.g. for conditional code and loops) can be tedious. When using a disassembler (see below), each label will appear as its own functions. Therefore, the GNU assembler supports local labels. These are labels whose name consist only a number. Local names need not be unique; several labels called e.g. “1” may exist in one file. To perform a jump to a local label, use the number and append a “f” or “b” to indicate whether to jump forward or backward. Local labels can not be exported with the “.global” directive. The interrupt-based blinker can be modified like this using local labels:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

1:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>
Example name: “BlinkLocalLabels”
=== Initializing RAM ===
The blinker program uses 2 byte-variables in memory, which have to be initialized to some value at startup. For large programs with many variables, this quickly becomes hard to maintain and also inefficient. Assembler and linker can help producing an “image” of how the RAM contents should look like after initializing, and place this image in flash memory alongside the normal program data. At startup, this image can be simply copied 1:1 into RAM in a loop. Most programs contain many variables that will be initialized with zero, so placing a (possibly large) block of zeroes in flash memory is wasteful; therefore, an additional loop is used to initialize all zero-variables to zero. Both techniques are also employed by C and C++ compilers, so implementing the initialization code is required there too. First, change the declaration of your variables by using “.byte”, “.hword” and “.word” and include the desired initialization value. Variables that should be initialized by zero get to be placed after a “.bss” directive to put them into the equally-named section. They don’t get an initialization value but just reserved space by using “.space”:

<source lang=asm>
.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1
</source>

From the assembler’s point of view, the initialization data - in this case, just one byte of value “2” - will directly end up in RAM. However, this is not possible on microcontrollers, as the RAM always contains random data on startup and isn’t automatically initialized. To achieve that, change the linker script as follows:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
. = ALIGN(4);
} >FLASH

.stack (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
} >SRAM

.data : {
_DataStart = .;
*(.data);
. = ALIGN(4);
_DataEnd = .;
} >SRAM AT >FLASH

_DataLoad = LOADADDR(.data);

.bss (NOLOAD) : {
_BssStart = .;
*(.bss);
. = ALIGN(4);
_BssEnd = .;
} >SRAM
}
</source>
Example name: “BlinkInitRAM”

The stack got put in its own section with the “NOLOAD” attribute, since it doesn’t need initializing. The data is now put in the “.data” section. The initial data for that section is put into flash memory via the “>SRAM AT >FLASH” construct. The addresses of symbols inside the “.data” section are still the addresses in RAM, so accesses to the symbols from assembly code still work. The symbol “_DataStart” is assigned the beginning of the initialized data in RAM, and “_DataEnd” the end. The “LOADADDR” function is used to get the beginning of the initialization data in flash, and assign it to “_DataLoad”. The “.bss” section contains all the variables that should be zero-initialized, and the symbols “_BssStart” and “_BssEnd” are set to its beginning and end address, respectively. It is marked with “NOLOAD” as well as we don't want to store (potentially many) zeroes in the linked program file, and we will explicitly initialize it (see below). As the beginning and size of the stack are already a multiple of 4, the beginning of “.data” is as well. The size of .data might not be a multiple of 4 however, so an “.=ALIGN(4)” command is inserted right before the definition of “_DataEnd”. This adds 0-3 dummy bytes by incrementing the location counter to make sure the address is a multiple of 4. The same thing is done right before “_BssEnd” and also at the end of the “.text” section, to make sure that “_BssEnd” and “_DataLoad” are multiples of 4 as well.

The only thing left is the actual initialization of the RAM. To do that, change the “Reset_Handler” as follows:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b
.ltorg
</source>

The explicit initialization of the variables was removed. Instead, the addresses for “_DataStart”, “_DataEnd” and “_DataLoad” that were defined in the linker script are loaded. Then, a short loop repeatedly loads a word from flash (i.e. starting with “_DataLoad”) and stores it into RAM (starting at “_DataStart”). The address pointers are incremented by the “ldr”/”str” instructions after the access. The pointer for the RAM location is compared with the end of the RAM area (“_DataEnd”) to decide whether to jump back to the beginning of the loop. To start the loop, a jump directly to the comparison is performed; this avoids the need to do the comparison at the beginning and inside of the loop. The second loop performs the zero-initialization of the area between “_BssStart” and “_BssEnd”; it works similarly, but does not need to load any data.

Unfortunately, the program as shown can’t be translated - as the two variables now reside in two different sections (“.data” and “.bss”), the offset addressing in the “SysTick_Handler” doesn’t work anymore. Therefore, direct addressing has to be used:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr
</source>

=== Peripheral interrupts ===
Interrupts, i.e. exceptions called by periphery modules, need a little extra code compared to the “core” exceptions including the SysTick. The Cortex-M’s interrupt controller (the NVIC) contains several registers for configuring these interrupts. It is possible to configure the priority and manually trigger interrupts, but for most applications, the only necessary thing to do is enabling the desired interrupt. This is done via the registers “NVIC_ISER0” through “NVIC_ISER15”, which are documented in the ARMv7M Architecture Reference Manual in chapter B3.4.4. Each of those registers contains 32 bits with which 32 of the interrupts can be enabled. The STM32F103RB/C8 has 43 interrupts, so only two of the possible 16 registers are present. The number of interrupts is given in chapter 2.3.5 of the controller’s datasheet. So, to enable some interrupt x, the bit “x mod 32” in register NVIC_ISER''y'' with y=x/32 has to be set. This register’s address is 0xE000E100+y*4. Given an interrupt’s number in r0, the following function does just that:

<source lang=asm>
NVIC_ISER0 = 0xE000E100

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>
Example name: “BlinkTIM1”

The “and” instruction calculates “x mod 32”, and the following left-shift (“lsls”) calculates the value where bit “x mod 32” is one, and all others are zero. To calculate the offset address “y*4”, i.e. “(x/32)*4”, the register is first shifted right by 5 bits and then shifted back left by 2 bits. This is the same as shifting 3 bits right and zeroing out the lower 2 bits; but two shift instructions actually consume less program memory space. Finally, the calculated value is written into the register by using offset addressing.

In addition to enabling the interrupt in the processor core’s NVIC, it also has to be enabled in the periphery module. Many periphery modules support several different events, each of which has to be enabled in the periphery’s register individually. Depending on the controller, these can be mapped to one single processor interrupt (and hence, one single ISR) or multiple ones, and need to be configured in the NVIC appropriately.

This example uses the STM32’s periphery timer TIM1 instead of the SysTick timer:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500
TimerPrescaler=1000

.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl EnableClockTIM1
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TIM1_UP_IRQn
bl EnableIRQ
bl StartTIM1
1:
wfi
b 1b
.ltorg

.type TIM1_UP_IRQHandler, %function
.global TIM1_UP_IRQHandler
TIM1_UP_IRQHandler:
ldr r0, =TIM1
ldr r2, =(~(1 << TIMx_SR_UIF))

ldr r1, [r0, #TIMx_SR]
bics r1, r2
beq 2f

str r2, [r0, #TIMx_SR]

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type EnableClockTIM1, %function
EnableClockTIM1:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_TIM1EN)
str r0, [r1, #RCC_APB2ENR] @ Set TIM1EN bit in RCC_APB2ENR to 1 to enable TIM1
bx lr @ Return to caller
.ltorg

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartTIM1, %function
StartTIM1:
ldr r0, =TIM1
ldr r1, =(1 << TIMx_CR1_URS)
str r1, [r0, #TIMx_CR1]

ldr r1, =TimerPrescaler
str r1, [r0, #TIMx_PSC]

ldr r1, =TimerValue
str r1, [r0, #TIMx_ARR]

ldr r1, =(1 << TIMx_DIER_UIE)
str r1, [r0, #TIMx_DIER]

ldr r1, =(1 << TIMx_EGR_UG)
str r1, [r0, #TIMx_EGR]

dsb

ldr r1, =(1 << TIMx_CR1_CEN)
str r1, [r0, #TIMx_CR1]

bx lr
.ltorg

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>

The corresponding stm32f103.inc file with the added definitions for the timer registers is:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

RCC = 0x40021000
RCC_APB2ENR = 0x18
RCC_APB2ENR_IOPAEN = 2
RCC_APB2ENR_TIM1EN = 11

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

TIM1 = 0x40012C00

TIMx_CR1 = 0
TIMx_CR1_ARPE = 7
TIMx_CR1_URS = 2
TIMx_CR1_CEN = 0

TIMx_DIER = 0xC
TIMx_DIER_UIE = 0

TIMx_SR = 0x10
TIMx_SR_UIF = 0

TIMx_EGR = 0x14
TIMx_EGR_UG = 0

TIMx_PSC = 0x28
TIMx_ARR = 0x2C

TIM1_UP_IRQn = 25

NVIC_ISER0 = 0xE000E100
</source>

The source code enables the timer’s clock in the RCC before configuring it. The timer supports both a freely configurable prescaler for dividing the clock and a freely configurable maximum value, both of which are set by the StartTIM1 function. The TIMx_DIER_UIE bit is set to enable the interrupt for the so-called “update event”, which is triggered whenever the timer reaches the maximum value. A delicate sequence of register accesses is required to start the timer with the right configuration but without triggering the interrupt right away: To apply the modified settings immediately, the “TIMx_EGR_UG” bit is set to trigger an “artificial” update event. To prevent this from also triggering the interrupt, the “TIMx_CR1_URS” bit is set and cleared before and after, respectively. The timer is started by setting the “TIMx_CR1_CEN” bit at the end. Before that, a “dsb” instruction is inserted. This “Data Synchronization Barrier” waits until all write accesses before that have been completely processed - usually, the processors pipeline is working on several instructions at once. Because the timer configuration needs to be truly finished before starting the timer, this instruction is required. There are some other situations where the processor is too fast for the periphery and needs to be temporarily halted by a “dsb”. If some periphery-accessing code works in step-by-step mode while debugging, but not when executing normally, a well-placed “dsb” might help.

The ISR “TIM1_UP_IRQHandler” is used for the timer. It checks the “TIMx_SR_UIF” bit to verify an update event has actually happened. In that case, the register is overwritten with the value 0xFFFFFFFE, i.e. all bits are written with “1” except the UIF bit. Writing ones has no effect on the bits in this register, and writing a zero clears the respective bit. Therefore, this write access clears the UIF bit but keeps the others. These interrupt flags must always be cleared as soon as possible in the ISR, or the periphery might trigger the interrupt again immediately. The rest of the ISR stays the same.
=== Analysis tools ===
When working on a low level directly with linker scripts and assembly code, it is frequently necessary to directly verify the translation output, as you can’t rely on a compiler doing it right automatically, and flashing the program each time to see whether it works isn’t the most efficient way. This was, in fact, important in creating the example codes for this tutorial. The “binutils” package, of which assembler and linker are part of, offers a few tools that help with analyzing the assembler’s and linker’s output.
==== Disassembler ====
As the name implies, a disassembler is the opposite of an assembler - it turns binary machine code back into a (more or less) readable textual representation. If you feed an ELF file generated by the assembler or linker into the disassembler, it will read the header information to tell apart data (i.e. constants) and code, get names of symbols (and therefore, labels) and can even tell which instructions were generated from which assembly source file, if it was assembler with debug information (i.e. the “-g” flag was used). If you disassemble a binary flash image, the disassembler doesn’t have all those information and will produce a much less readable output and attempt to decode data bytes as instructions.

The disassembler from binutils is called “objdump”. Invoking it on the blinker looks like this:
<pre>
$ arm-none-eabi-objdump -d -s prog1.elf

prog1.elf: file format elf32-littlearm

Contents of section .VectorTable:
8000000 00040020 ed000008 ed010008 ed010008 ... ............
8000010 ed010008 ed010008 ed010008 00000000 ................
8000020 00000000 00000000 00000000 ed010008 ................
8000030 ed010008 00000000 ed010008 49010008 ............I...
8000040 ed010008 ed010008 ed010008 ed010008 ................
8000050 ed010008 ed010008 ed010008 ed010008 ................
8000060 ed010008 ed010008 ed010008 ed010008 ................
8000070 ed010008 ed010008 ed010008 ed010008 ................
8000080 ed010008 ed010008 ed010008 ed010008 ................
8000090 ed010008 ed010008 ed010008 ed010008 ................
80000a0 ed010008 ed010008 ed010008 ed010008 ................
80000b0 ed010008 ed010008 ed010008 ed010008 ................
80000c0 ed010008 ed010008 ed010008 ed010008 ................
80000d0 ed010008 ed010008 ed010008 ed010008 ................
80000e0 ed010008 ed010008 ed010008 ............
Contents of section .text:
80000ec 0f481049 104a03e0 52f8043b 40f8043b .H.I.J..R..;@..;
80000fc 8842f9d3 0d480e49 4ff00002 01e040f8 .B...H.IO.....@.
800010c 042b8842 fbd300f0 47f800f0 4bf84ff4 .+.B....G...K.O.
800011c 80710848 01600848 00f058f8 30bffde7 .q.H.`.H..X.0...
800012c 00040020 04040020 f0010008 04040020 ... ... .......
800013c 08040020 10080140 60e31600 4ff0e020 ... ...@`...O..
800014c 006910f4 803f1dd0 1a480178 b1f1110f .i...?...H.x....
800015c 18d21948 0378013b 1cbf0370 12e001f1 ...H.x.;...p....
800016c 0101b1f1 110f02d2 144a535c 03701148 .........JS\.p.H
800017c 017011f0 01010cbf 4ff48071 4ff08071 .p......O..qO..q
800018c 0f480160 70470202 02020202 05020502 .H.`pG..........
800019c 05020202 02020200 0a490868 40f00400 .........I.h@...
80001ac 08607047 08490868 20f00f00 40f00200 .`pG.I.h ...@...
80001bc 08607047 04040020 00040020 92010008 .`pG... ... ....
80001cc 10080140 18100240 04080140 4ff0e021 ...@...@...@O..!
80001dc 48614ff0 00008861 4ff00700 08617047 HaO....aO....apG
80001ec 00befde7 ....
Contents of section .data:
20000400 02000000 ....
Contents of section .ARM.attributes:
0000 41200000 00616561 62690001 16000000 A ...aeabi......
0010 05436f72 7465782d 4d330006 0a074d09 .Cortex-M3....M.
0020 02 .
Contents of section .debug_line:
0000 98000000 02001e00 00000201 fb0e0d00 ................
0010 01010101 00000001 00000100 70726f67 ............prog
0020 312e5300 00000000 000502ec 00000803 1.S.............
0030 15012121 22212f2f 21222121 30212f21 ..!!"!//!"!!0!/!
0040 222f302f 21232130 21036120 2f2f362f "/0/!#!0!.a //6/
0050 030c2e32 030a2e2f 212f2222 222f2221 ...2.../!/"""/"!
0060 21222121 222f2f22 21212321 222f212f !"!!"//"!!#!"/!/
0070 30212303 0d9e2121 2f212421 212f2f21 0!#...!!/!$!!//!
0080 03422035 030c2e03 0d2e0311 2e36030b .B 5.........6..
0090 2e30212f 222f2202 01000101 3b000000 .0!/"/".....;...
00a0 02002400 00000201 fb0e0d00 01010101 ..$.............
00b0 00000001 00000100 76656374 6f727461 ........vectorta
00c0 626c652e 53000000 00000005 02ec0100 ble.S...........
00d0 0803d000 01210201 000101 .....!.....
Contents of section .debug_info:
0000 22000000 02000000 00000401 00000000 "...............
0010 ec000008 ec010008 00000000 08000000 ................
0020 12000000 01802200 00000200 14000000 ......".........
0030 04019c00 0000ec01 0008f001 00082100 ..............!.
0040 00000800 00001200 00000180 ............
Contents of section .debug_abbrev:
0000 01110010 06110112 01030e1b 0e250e13 .............%..
0010 05000000 01110010 06110112 01030e1b ................
0020 0e250e13 05000000 .%......
Contents of section .debug_aranges:
0000 1c000000 02000000 00000400 00000000 ................
0010 ec000008 00010000 00000000 00000000 ................
0020 1c000000 02002600 00000400 00000000 ......&.........
0030 ec010008 04000000 00000000 00000000 ................
Contents of section .debug_str:
0000 70726f67 312e5300 2f746d70 2f746573 prog1.S./tmp/tes
0010 7400474e 55204153 20322e32 392e3531 t.GNU AS 2.29.51
0020 00766563 746f7274 61626c65 2e5300 .vectortable.S.

Disassembly of section .text:

080000ec <Reset_Handler>:
80000ec: 480f ldr r0, [pc, #60] ; (800012c <Reset_Handler+0x40>)
80000ee: 4910 ldr r1, [pc, #64] ; (8000130 <Reset_Handler+0x44>)
80000f0: 4a10 ldr r2, [pc, #64] ; (8000134 <Reset_Handler+0x48>)
80000f2: e003 b.n 80000fc <Reset_Handler+0x10>
80000f4: f852 3b04 ldr.w r3, [r2], #4
80000f8: f840 3b04 str.w r3, [r0], #4
80000fc: 4288 cmp r0, r1
80000fe: d3f9 bcc.n 80000f4 <Reset_Handler+0x8>
8000100: 480d ldr r0, [pc, #52] ; (8000138 <Reset_Handler+0x4c>)
8000102: 490e ldr r1, [pc, #56] ; (800013c <Reset_Handler+0x50>)
8000104: f04f 0200 mov.w r2, #0
8000108: e001 b.n 800010e <Reset_Handler+0x22>
800010a: f840 2b04 str.w r2, [r0], #4
800010e: 4288 cmp r0, r1
8000110: d3fb bcc.n 800010a <Reset_Handler+0x1e>
8000112: f000 f847 bl 80001a4 <EnableClockGPIOA>
8000116: f000 f84b bl 80001b0 <ConfigurePA8>
800011a: f44f 7180 mov.w r1, #256 ; 0x100
800011e: 4808 ldr r0, [pc, #32] ; (8000140 <Reset_Handler+0x54>)
8000120: 6001 str r1, [r0, #0]
8000122: 4808 ldr r0, [pc, #32] ; (8000144 <Reset_Handler+0x58>)
8000124: f000 f858 bl 80001d8 <StartSysTick>
8000128: bf30 wfi
800012a: e7fd b.n 8000128 <Reset_Handler+0x3c>
800012c: 20000400 .word 0x20000400
8000130: 20000404 .word 0x20000404
8000134: 080001f0 .word 0x080001f0
8000138: 20000404 .word 0x20000404
800013c: 20000408 .word 0x20000408
8000140: 40010810 .word 0x40010810
8000144: 0016e360 .word 0x0016e360

08000148 <SysTick_Handler>:
8000148: f04f 20e0 mov.w r0, #3758153728 ; 0xe000e000
800014c: 6900 ldr r0, [r0, #16]
800014e: f410 3f80 tst.w r0, #65536 ; 0x10000
8000152: d01d beq.n 8000190 <SysTick_Handler+0x48>
8000154: 481a ldr r0, [pc, #104] ; (80001c0 <ConfigurePA8+0x10>)
8000156: 7801 ldrb r1, [r0, #0]
8000158: f1b1 0f11 cmp.w r1, #17
800015c: d218 bcs.n 8000190 <SysTick_Handler+0x48>
800015e: 4819 ldr r0, [pc, #100] ; (80001c4 <ConfigurePA8+0x14>)
8000160: 7803 ldrb r3, [r0, #0]
8000162: 3b01 subs r3, #1
8000164: bf1c itt ne
8000166: 7003 strbne r3, [r0, #0]
8000168: e012 bne.n 8000190 <SysTick_Handler+0x48>
800016a: f101 0101 add.w r1, r1, #1
800016e: f1b1 0f11 cmp.w r1, #17
8000172: d202 bcs.n 800017a <SysTick_Handler+0x32>
8000174: 4a14 ldr r2, [pc, #80] ; (80001c8 <ConfigurePA8+0x18>)
8000176: 5c53 ldrb r3, [r2, r1]
8000178: 7003 strb r3, [r0, #0]
800017a: 4811 ldr r0, [pc, #68] ; (80001c0 <ConfigurePA8+0x10>)
800017c: 7001 strb r1, [r0, #0]
800017e: f011 0101 ands.w r1, r1, #1
8000182: bf0c ite eq
8000184: f44f 7180 moveq.w r1, #256 ; 0x100
8000188: f04f 7180 movne.w r1, #16777216 ; 0x1000000
800018c: 480f ldr r0, [pc, #60] ; (80001cc <ConfigurePA8+0x1c>)
800018e: 6001 str r1, [r0, #0]
8000190: 4770 bx lr

08000192 <BlinkTable>:
8000192: 0202 0202 0202 0205 0205 0205 0202 0202 ................
80001a2: .

080001a3 <BlinkTableEnd>:
...

080001a4 <EnableClockGPIOA>:
80001a4: 490a ldr r1, [pc, #40] ; (80001d0 <ConfigurePA8+0x20>)
80001a6: 6808 ldr r0, [r1, #0]
80001a8: f040 0004 orr.w r0, r0, #4
80001ac: 6008 str r0, [r1, #0]
80001ae: 4770 bx lr

080001b0 <ConfigurePA8>:
80001b0: 4908 ldr r1, [pc, #32] ; (80001d4 <ConfigurePA8+0x24>)
80001b2: 6808 ldr r0, [r1, #0]
80001b4: f020 000f bic.w r0, r0, #15
80001b8: f040 0002 orr.w r0, r0, #2
80001bc: 6008 str r0, [r1, #0]
80001be: 4770 bx lr
80001c0: 20000404 .word 0x20000404
80001c4: 20000400 .word 0x20000400
80001c8: 08000192 .word 0x08000192
80001cc: 40010810 .word 0x40010810
80001d0: 40021018 .word 0x40021018
80001d4: 40010804 .word 0x40010804

080001d8 <StartSysTick>:
80001d8: f04f 21e0 mov.w r1, #3758153728 ; 0xe000e000
80001dc: 6148 str r0, [r1, #20]
80001de: f04f 0000 mov.w r0, #0
80001e2: 6188 str r0, [r1, #24]
80001e4: f04f 0007 mov.w r0, #7
80001e8: 6108 str r0, [r1, #16]
80001ea: 4770 bx lr

080001ec <Default_Handler>:
80001ec: be00 bkpt 0x0000
80001ee: e7fd b.n 80001ec <Default_Handler>
</pre>

This is a lot of information. The “-d” flag tells objdump to disassemble code sections, and the “-s” flag lets it output data sections. At first, it prints the contents of “.VectorTable”. Each line is prefixed with the address of where this data is found in memory. Then, the 32bit data blocks from the vector table are output. The disassembler prints the bytes in the order they appear in memory, which, since the Cortex-M3 uses little endian, is reversed - for example, the printed “ed000008” actually refers to the address “0x080000ed”, which is the address of the “Reset_Handler” with the lowest bit set to one, as it is a thumb function. Most of the addresses in the vector table reflect the address of the default handler, 0x080001ec, except for the zero-entries and the SysTick_Handler. The contents of the “.text” section is the hexadecimal representation of the machine code, and hardly readable. The “.data” section contains a single “two” - this is the “02” put into “TimerEvents”. The contents of “.ARM.attributes:” and the various “.debug” sections is not very interesting, as it does not end up on the controller, and is only read by the various analysis tools to provide nicer output.

After that comes the actual disassembly. This is a list of all the instructions in the code section. The list is grouped by the symbols found in the input file. For C Code, each symbol usually matches one function, so each block in the disassembly represents one C function. In assembly code, if you put non-local labels into a function, that function will be split into multiple blocks by the disassembler, making it harder to read - the main reason for using local labels. Each instruction is translated into one line inside the blocks. The first column is the address where that instruction is found. The next column contains the hexadecimal representation of the 2 or 4 bytes that make up the machine code of that instruction, i.e. the actual content of flash memory. After that comes a textual representation of that instruction, as inferred by the disassembler. If the instruction contains some number, the disassembler sometimes outputs a semicolon followed by some interpretation of that number. If the instruction employs PC-relative addressing, that interpretation will be the absolute address. As many instructions have multiple spellings, there can be discrepancies between the original code and the disassembly. The disassembler will also output data, such as the “BlinkTable” and the literal pools, as such. Using the “.type” directive is helpful in that case so that the disassembler does not attempt to interpret the data bytes as code.

objdump can also be used to disassembly raw binary files that can be obtained by reading back the flash memory of some controller. To do this, use this command line:

<source lang=sh>
$ arm-none-eabi-objdump -b binary -m arm -D prog1.bin -Mforce-thumb --adjust-vma=0x08000000
</source>

The address of the binary in flash memory is specified so that the printed instruction addresses are correct. However, as the disassembler can’t tell data and code apart, the result will be of limited use. If you have to analyze a binary without having an ELF file or the source code, a more sophisticated disassembler such as IDA Pro is helpful. If you have the code and only need the disassembler to identify potential problems with the project (esp. the linker script), objdump is usually sufficient.
==== readelf ====
The “readelf” program is a powerful utility that can read and output various information from ELF files. The most useful option is the “-S” flag, which lets readelf print a summary of the sections in the respective file, e.g.:

<pre>
$ arm-none-eabi-readelf -S prog1.elf
There are 15 section headers, starting at offset 0x11268:

Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .VectorTable PROGBITS 08000000 010000 0000ec 00 A 0 0 1
[ 2] .text PROGBITS 080000ec 0100ec 000104 00 AX 0 0 4
[ 3] .stack NOBITS 20000000 020000 000400 00 WA 0 0 1
[ 4] .data PROGBITS 20000400 010400 000004 00 WA 0 0 1
[ 5] .bss NOBITS 20000404 010404 000004 00 WA 0 0 1
[ 6] .ARM.attributes ARM_ATTRIBUTES 00000000 010404 000021 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 010425 0000db 00 0 0 1
[ 8] .debug_info PROGBITS 00000000 010500 00004c 00 0 0 1
[ 9] .debug_abbrev PROGBITS 00000000 01054c 000028 00 0 0 1
[10] .debug_aranges PROGBITS 00000000 010578 000040 00 0 0 8
[11] .debug_str PROGBITS 00000000 0105b8 00002f 01 MS 0 0 1
[12] .symtab SYMTAB 00000000 0105e8 0006a0 10 13 45 4
[13] .strtab STRTAB 00000000 010c88 000550 00 0 0 1
[14] .shstrtab STRTAB 00000000 0111d8 000090 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
</pre>

For each section, one line is output. The sections “.strtab”, “.shstrtab”, “.symtab” and “NULL” are an integral part of ELF and always present. The “.debug” sections are present if the source was assembled with the “-g” flag. The “.ARM.attributes” section defines for which ARM processor the contained code was translated. These sections don’t end up on the microcontroller. The remaining sections were defined in the linker script: “.VectorTable” contains the addresses of the exception handlers, “.text” contains the program code and constant data for flash memory, “.stack” the stack in RAM, “.data” contains variables in RAM and “.bss” contains zero-initialized variables in RAM. For these sections, the column “Type” contains either “PROGBITS” or “NOBITS” that tells you whether the section in the ELF file actually contains some data - this is only the case for “.VectorTable”, “.text” and “.data”. The sections “.bss” and “.stack” only reserve memory that is written at runtime, but the ELF file doesn’t contain data to be written in these sections. The column “Addr” defines where this section begins in the address space. The most useful column is “Size”: If you sum up the sizes of the sections “.VectorTable”, “.text” and “.data”, you can obtain the used flash memory. By summing ob “.data”, “.stack” and “.bss”, you get the used amount of RAM. Note that “.data” is counted twice, as the initialization data is stored in flash.
==== nm ====
The “nm” utility prints the symbols defined in an ELF file, for example:

<pre>
$ arm-none-eabi-nm prog1.elf
080001ec W ADC1_2_IRQHandler
20000404 b BlinkStep
08000192 t BlinkTable
080001a3 t BlinkTableEnd
20000408 B _BssEnd
20000404 B _BssStart
…
</pre>

This can be helpful in analyzing errors in linker scripts where symbols might get assigned wrong addresses.
==== addr2line ====
The “addr2line” utility reads the debug information from an ELF file to determine which line in which source file produced the instruction found at a particular given address. For example:

<source lang=sh>
$ arm-none-eabi-addr2line 0x080000f0 -e prog1.elf
/tmp/test/prog1.S:24
</source>

Here, line 24 of “prog1.S” contains the assembler command that produced the instruction that ends up at address 0x080000f0.

==== objcopy ====
The “objcopy” utility allows you to translate program files between different formats. It is useful to convert the ELF files to both the Intel Hex format and a simple binary representation. For example,

<source lang=sh>
arm-none-eabi-objcopy -O ihex prog1.elf prog1.hex
</source>

produces a “.hex” file that contains an image of the flash contents in hexadecimal form. With

<source lang=sh>
arm-none-eabi-objcopy -O binary prog1.elf prog1.bin
</source>

a binary file is created which contains an exact 1:1 image of the flash contents. Some flashing tools require these formats instead of ELF, and viewing the binary file with a hex editor can be interesting as well.

=== Interfacing C and C++ code ===
Since assembly is rarely used to implement entire complex projects, but mostly for few time-critical or especially low-level routines that are part of larger code bases written in a high-level-language, interfacing C and assembly code is an important topic, which will be covered here. While it is possible to write the main project structure in assembly and integrate some C modules, it is usually done the other way round. Most of the code shown is already ready to be included in C programs. Most of this topic works the same way for C++, apart from C++ exceptions (not to be confused with ARM processor exceptions) - but these are rarely used on embedded targets anyways.

If you compile C, C++ and assembly code into individual .o object files, you can link these together using “ld” as before. However, C and C++ code usually requires access to the respective standard library, and “ld” doesn’t link these by default - therefore it is necessary to substitute “ld” for a call to “gcc” or “g++” for C or C++, respectively. This will call “ld” internally and pass the required libraries.
==== Environment setup for C and C++ ====
Many C projects use a reset handler and vector table implemented in assembly, although writing them in C is possible too. As required by the C standard, C programs start with the “main()” function, so the (assembly) reset handler should setup the environment such that it is ready for C, and then call “main”. The C code might then later call some assembly functions or inline assembly. When using C++ code, or some GCC extension for C code, it is required to call some additional functions before calling “main”. This is used by C++ to call the constructors of global objects. The C and C++ compilers emit a table of function pointers to functions that should be called at startup. This table has to be put into flash memory by modifying the linker script as follows:

<source>
.text : {
*(.text)
. = ALIGN(4);

_InitArrayStart = .;
*(SORT(.preinit_array*))
*(SORT(.init_array*))
_InitArrayEnd = .;
} >FLASH
</source>

The table of function pointers is sorted to keep the order needed by the compiler. The symbols “_InitArrayStart” and “_InitArrayEnd” mark beginning and end of that table. A reset handler that performs the memory initialization as before and calls the table of initialization functions could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

ldr r4, =_InitArrayStart
ldr r5, =_InitArrayEnd

b 2f
1: ldr r0, [r4], #4
blx r0
2: cmp r4, r5
blo 1b

bl main
1: bkpt
b 1b
.ltorg
</source>

Note that for iterating the table, registers r4 and r5 are used, since the called functions may not overwrite those. The “blx” instruction is needed to perform the indirect function call. When everything is set up, the main function is called. For embedded programs, the main function should never return (i.e. contain an endless loop). If it does, that’s an error, and to make it easier to find, an endless loop with a forced breakpoint is put right after the call to “main”.
==== Calling functions ====
To call assembly functions from C code and vice-versa, the assembly functions should observe the calling convention, as mentioned before. C functions can be called just like assembly functions from assembly code, by placing the parameters in register r0-r3 and on the stack, calling the function using “bl” and retrieving the return value from r0. To call an assembly function from C code, you need to declare it in C first just like a C function. For example, to call a function that takes 2 integer arguments and returns an integer:

<source lang=c>
int AssemblyFunction (int a, int b);
</source>

If you now define a function named “AssemblyFunction” in your assembly code and export it via “.global”, you can call it from C code just like any function.
==== Accessing global variables ====
Global variables defined in C can be accessed from assembly code just like variables defined in assembly code, by using the variable’s name. To access an assembly variable from C code, you need to declare it first by specifying the type. For example, to declare an integer variable:

<source lang=c>
extern int AssemblyVariable;
</source>

If you now define a variable named “AssemblyVariable” in your assembly code and export it via “.global”, you can access it from C code just like any variable. The “extern” is required to make sure the C code doesn’t attempt to declare another variable of the same name.
=== Clock configuration ===
By default, STM32 controllers use an internal RC-oscillator with 8 MHz as a clock source for the core and periphery. This oscillator is too inaccurate for implementing a clock or using serial interfaces such as UART, USB or CAN. To obtain a more accurate clock, an external quartz crystal is usually applied. Many STM32 boards feature an 8 MHz crystal. To use it, some initialization code is required that activates the microcontroller’s built-in crystal-oscillator circuit and switches the clock input to that. The STM32 controllers also include a PLL which can multiply some input clock by a configurable factor before feeding it to the processor core and peripherals. This way, a precise and fast clock can be achieved - the STM32F103 supports up to 72 MHz core frequency. Unfortunately, flash memory is not capable of keeping up with such a high frequency. Therefore, when enabling a fast clock, the flash memory needs to be configured to use wait states depending on the frequency.

The following function configures the flash wait states, enables the crystal oscillator, configures the PLL to multiply the input clock by a factor of 9, and use that as the system clock. The prescaler for the internal bus APB1 is set to 2. Assuming an 8 MHz crystal, this achieves the maximum performance possible with this microcontroller - 72 MHz for the core and APB2 domain, 36 MHz for APB1. If a different crystal is used, the PLL factors have to be adjusted.

<source lang=asm>
RCC = 0x40021000

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

.type ConfigureSysClock, %function
.global ConfigureSysClock
ConfigureSysClock:
@ Turn on HSE
ldr r0, =RCC
ldr r1, =((1 << RCC_CR_HSION)|(1 << RCC_CR_HSEON))
str r1, [r0, #RCC_CR]

@ Configure (but not start yet) PLL
@ Mul = 9, Prediv = 1, APB1 Prescaler = 2, APB2 Prescaler = 1, AHB Prescaler = 1
ldr r2, =(((9-2)<<RCC_CFGR_PLLMUL)|(1 << RCC_CFGR_USBPRE)|(1 << RCC_CFGR_PLLSRC)|(4 << RCC_CFGR_PPRE1))
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
orr r1, #(1 << RCC_CR_PLLON)

@ Wait for HSE ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_HSERDY)
beq 1b

@ Turn on PLL
str r1, [r0, #RCC_CR]

@ Pre-Calculate value for RCC_CFGR
orr r2, #(2 << RCC_CFGR_SW)

@ Wait for PLL ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_PLLRDY)
beq 1b

@ Set flash wait states to 2
ldr r0, =FLASH
ldr r3, =((1<<FLASH_ACR_PRFTBE)|(2<<FLASH_ACR_LATENCY))
str r3, [r0, #FLASH_ACR]
ldr r0, =RCC

@ Switch system clock to PLL
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
bic r1, #(1 << RCC_CR_HSION)

@ Wait for switch to PLL
1: ldr r3, [r0, #RCC_CFGR]
and r3, #(3 << RCC_CFGR_SWS)
cmp r3, #(2 << RCC_CFGR_SWS)
bne 1b

@ Turn off HSI to save power
str r1, [r0, #RCC_CR]

bx lr
.ltorg
</source>

Many projects perform the clock configuration by the reset handler before calling the main function. If you want to follow that practice, place a “bl ConfigureSysClock” as the first instruction in the “Reset_Handler” - this way, all the setup will run with the higher clock frequency, making start-up faster. This and the completed startup code from the previous chapters is implemented in the “startup.S” file in the example repository. If you use it, put your code in the “main” function, where RAM and system clock will already be initialized. This is shown in the “BlinkStartup” example.

=== Project template & makefile ===
To quickly start your own project, a project template is supplied in the examples repository under the directory [https://github.com/Erlkoenig90/ArmAsmTutorial/tree/master/ProjectTemplate-STM32F103RB ProjectTemplate-STM32F103RB]. Put your own application code in the program.S file. The startup.S and vectortable.S contain the reset handler with RAM initialization and the vector table with default handler, respectively. A linker script is included too.

The project also contains a makefile. This allows you to quickly translate your project without having to type the assembler and linker commands. Simply type
<source lang=sh>
make
</source>
To translate the code and produce program.elf, program.bin and program.hex files. All “.S” files in the directory will be automatically translated. Writing makefiles is a complex topic on its own with a lot of information already available on the web, so no further explanations on that will be made here.

ARM-ASM-Tutorial

2021-07-09T06:23:36Z

Nicolas:

[[Kategorie:ARM]][[Kategorie:STM32]][[Kategorie:Entwicklungstools]][[Kategorie:Programmiersprachen]]The [[ARM]] processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language.

Tutorial by [[Benutzer:Erlkoenig|Niklas Gürtler]]. [https://www.mikrocontroller.net/topic/482409 Thread in Forum] for feedback and questions.

== Introduction ==
=== Why assembly? ===
Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language!

As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation.

Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary.

=== About ARM ===
Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers.

Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications.

Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor.

=== Architecture and processor variants ===
A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”.

While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm.

ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations.

Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android).

An overview of ARM processors and their implemented architecture version can be found on [https://en.wikipedia.org/wiki/ARM_architecture#Cores Wikipedia].
This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use.

There is actually not a single, but three instruction sets for ARM processors:
* The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.
* The “A64” instruction set is for the new 64bit ARM processors
* The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes.

The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set.

== Prerequisites ==
First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components.

=== Microcontroller selection ===
For the microcontroller, an [https://www.st.com/en/microcontrollers-microprocessors/stm32f103c8.html STM32F103C8] or [https://www.st.com/en/microcontrollers-microprocessors/stm32f103rb.html STM32F103RB] by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: [https://www.st.com/en/evaluation-tools/nucleo-f103rb.html Nucleo-F103], “[https://os.mbed.com/users/hudakz/code/STM32F103C8T6_Hello/ Blue Pill]” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), [https://www.olimex.com/Products/Duino/STM32/OLIMEXINO-STM32/open-source-hardware Olimexino-STM32], [https://www.olimex.com/Products/ARM/ST/STM32-P103/ STM32-P103], [https://www.olimex.com/Products/ARM/ST/STM32-H103/ STM32-H103], [https://www.st.com/en/evaluation-tools/stm3210e-eval.html STM3210E-EVAL].

=== Processor type & documentation ===
First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation.

* The first source of information is the [https://www.st.com/resource/en/datasheet/stm32f103rb.pdf STM32F103RB/C8 datasheet]. According to the headline, this is a '''medium-density''' device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a '''Cortex-M3''' processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.
* The next important document is the [https://www.st.com/resource/en/reference_manual/cd00171190.pdf STM32F103 reference manual] that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.
* The [https://developer.arm.com/ip-products/processors/cortex-m/cortex-m3 ARM developer website] provides information about the Cortex-M3 processor core, particularly the [https://static.docs.arm.com/100165/0201/arm_cortexm3_processor_trm_100165_0201_01_en.pdf ARM Cortex-M3 Processor Technical Reference Manual]. According to chapter 1.5.3, this processor implements the '''ARMv7-M architecture'''.
* The architecture is documented in the [https://static.docs.arm.com/ddi0403/ed/DDI0403E_d_armv7m_arm.pdf ARMv7M Architecture Reference Manual]. Particularly, it contains the complete documentation of the instruction set.

For any serious STM32 development, you should be familiar with all these documents.

=== Debug adapter ===
There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD.

Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the [https://www.st.com/en/development-tools/st-link-v2.html ST-Link] debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them.

=== Development Software ===
On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested).

First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the [https://www.st.com/en/development-tools/stsw-link004.html ST-Link Utility] can be useful, but is not strictly required.

Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm GNU Arm Embedded Toolchain] is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it.

Another component is required to talk with the debug adapter. For the ST-Link, this is done by [http://openocd.org/ OpenOCD], which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software.

Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable.

== Setup ==
Follow the instructions in the next chapters to set up your development environment.

=== Hardware ===
The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input.

If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB.

=== Software ===
==== Linux ====
Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM.

Download the Linux binary tarball from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”).

Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via:

<source lang=sh>
echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh
</source>

OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian):
<source lang=sh>
sudo apt-get install openocd
</source>

After that, log out and back in (or just reboot). In a terminal, type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source lang=sh>
$ arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source lang=sh>
$ openocd -v
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

==== Windows ====
[[File:ArmAsmTutorial_GccOptions.png|300px|thumb|right|Options for installing GCC]]Download the Windows installer from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”.

A Windows package for OpenOCD can be obtained from the [https://github.com/gnu-mcu-eclipse/openocd/releases gnu-mcu-eclipse downloads page]. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path.

[[File:ArmAsmTutorial_PcProperties.png|300px|thumb|right|Opening PC properties]][[File:ArmAsmTutorial_SetEnvVar.png|300px|thumb|right|Setting environment variable]]Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times.

Open a ''new'' command window (Windows Key + R, type “cmd” + Return). Type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source>
C:\>arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source>
C:\>openocd -v
GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

== Writing assembly applications ==
The full source code of the examples in the following chapters contain be found on [https://github.com/Erlkoenig90/ArmAsmTutorial GitHub]. The name of the corresponding directory is given after each example code below.
=== First rudimentary program ===
After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”.

Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

nop @ Do Nothing
b . @ Endless loop
</source>
Example name: “EmptyProgram”

When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents.

The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment.

The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later.

The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program.

To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing <code>cd <Path to Project Directory></code>. Call the assembler like this:

<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
</source>

This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this:

<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -Ttext=0x8000000
</source>

This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this:

<pre>arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000</pre>

This is not relevant for executing the program without an operating system and can be ignored.

=== Flashing the program ===
To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit"
</source>
Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption.
=== Starting the debugger ===
To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg
</source>
Then, open a new terminal/command window and start a GDB session:
<source lang=sh>
arm-none-eabi-gdb prog1.elf
</source>
GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance:
<source>
target remote :3333
</source>
Then, stop the currently running program:
<source>
monitor reset halt
</source>
If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing:
<source>
load
</source>
Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again:
<source>
monitor reset halt
</source>
Now, examine the contents of the CPU registers:
<source>
info reg
</source>
The output should look something like
<pre>
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0x0 0x0
lr 0x0 0
pc 0x8000000 0x8000000 <_stack+133693440>
xPSR 0x1000000 16777216
msp 0x20000400 0x20000400
psp 0x27e3fa34 0x27e3fa34
primask 0x0 0
basepri 0x0 0
faultmask 0x0 0
control 0x0 0
</pre>

At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing
<source>
stepi
</source>
If you type <code>info reg</code> again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do <source>stepi</source> again (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing
<source>
continue
</source>
You can interrupt the running program by pressing “Ctrl+C”. Run the commands
<source>
kill
quit
</source>
to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal.
=== Using processor registers ===
The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing.

All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”.

As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is:
<source lang=asm>
ldr r0, =123456789
</source>
The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead.

The instruction “mov” can be used to copy the contents from one register to another:
<source lang=asm>
mov r1, r0
</source>
This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers.

In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual.
=== Accessing periphery ===
To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses.

The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor.

As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step.

==== Clock Configuration ====
Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably.

This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004.

To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”.

In short: To enable GPIOA, the value '''0x00000004''' needs to be written to address '''0x40021018'''.

According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is:
* Load the value 0x00000004 into a register
* Load the value 0x40021018 into another register
* Store the value from the first register into the memory location specified by the second register.
This last step is performed by the “STR” instruction as follows:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
b .
</source>
The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet.

==== GPIO Configuration ====
By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”.

Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”.

Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on.

The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register).

Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”:
<source lang=asm>
ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

==== Writing GPIO pins ====
The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x100
ldr r1, =0x4001080C
str r0, [r1] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high

b .
</source>
Example name: “SetPin”

This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect!

=== Data processing ===
ARM supports many instructions for mathematical operations. For example, addition can be performed as:
<source lang=asm>
ldr r0, =222
ldr r1, =111
add r2, r0, r1
</source>
This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right.

You can also overwrite an input register with the result:
<source lang=asm>
add r0, r0, r1
</source>
This will write the result to r0, overwriting the previous value. This is commonly shortened to
<source lang=asm>
add r0, r1
</source>
The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are:
* '''sub''' for subtraction
* '''mul''' for multiplication
* '''and''' for bitwise and
* '''orr''' for bitwise or
* '''eor''' for bitwise exclusive or (“xor”)
* '''lsl''' for logical left shift
* '''lsr''' for logical right shift

Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example,
<source lang=asm>
add r0, r0, #23
</source>
adds 23 to the register r0 and stores the result in r0. This can again be shortened to
<source lang=asm>
add r0, #23
</source>

Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register.

Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program.

=== Reading periphery registers ===
The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle.

Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via:
<source lang=asm>
ldr r1, =0x40021018
ldr r0, [r1]
</source>
Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1.

The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction:
<source lang=asm>
orr r0, r0, #4
</source>
After that, we can store r0 as before.

With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value:

<source lang=asm>
ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

So, the modified program is:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “SetPin2”
=== Jump instructions ===
For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1]
</source>

By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning.

A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b 0x8000104
</source>
Example name: “Blink”

The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative.

When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there.

This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones.

Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon:

<source lang=asm>
BlinkLoop:

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b BlinkLoop
</source>
Example name: “Blink2”

This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this:

<source lang=asm>
b (.+4+42*2)
</source>

The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though.

=== Counting Loops ===
The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count ''down'' than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”:

<source lang=asm>
ldr r2, =1000000
subs r2, #1
</source>

Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is ''not'' set, i.e. when the result of the last math instruction (with an “s” appended) was ''not'' zero. The “beq” instruction is the opposite of that – it only performs a jump if the result ''was'' zero.

So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet:
<source lang=asm>
ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop
</source>

The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

BlinkLoop:
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r2, =1000000
delay2:
subs r2, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay”

You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change ''before'' the loop. Then, just use them inside the loop:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay2”

=== Using RAM ===
Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored.

As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right ''before'' that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash):

<source lang=asm>
.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text
@ Instructions go here...
</source>

Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text

ldr r0, =var1 @ Get address of var1
ldr r1, =0x12345678
str r1, [r0] @ Store 0x12345678 into memory block “var1”

ldr r1, [r0] @ Read memory block “var1”
and r1, #0xFF @ Set bits 8..31 to zero
ldr r0, =var2 @ Get address of var2
strb r1, [r0] @ Store a single byte into var2

b .
</source>
Example name: “RAMVariables”

Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored.

There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option <code>-Tdata=0x20000000</code> to the <code>arm-none-eabi-ld</code> call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly.

If you run this program via GDB, you can use the commands <code>x/1xw &var1</code> and <code>x/1xb &var2</code> to read the data stored in memory. After this quick introduction a more abstract overview is indicated.

== Memory Management ==
If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be used effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works.

=== Address space ===
In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123.

However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash.

While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined ''before'' the code, the code actually comes first in address space. What makes all this work is the Linker.

=== The Linker ===
Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as using multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes.

[[File:ArmAsmTutorial_Linking.svg|300px|thumb|right|Translation of native applications using assembler, compiler and linker]]You have already used a linker – the command <code>arm-none-eabi-ld</code> calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file.

When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable.

On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found [https://developer.arm.com/architectures/system-architectures/software-standards/abi here], and the generic specification for ELF on which the ARM ELF variant is based can be found [http://www.sco.com/developers/gabi/2003-12-17/contents.html here].

ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses.

Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well.

As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error:

<pre>(.text+0x132): undefined reference to `Foo'</pre>

Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker.

=== Linker Scripts ===
A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found [https://sourceware.org/binutils/docs/ld/Scripts.html here].

It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
*(.data)
} >SRAM
}
</source>
Example name: “LinkerScriptSimple”

This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations.

The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example:

<source>
SECTIONS {
.FlashText : {
*(.text)
} >FLASH

.RamData (NOLOAD) : {
*(.data)
} >SRAM
}
</source>

The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone.

To specify the filename of the linker script, use the “-T” option:
<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld
</source>
The <code>-Tdata</code> and <code>-Ttext</code> aren’t needed anymore, as the addresses are now defined in the linker script.

Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM):
<pre>arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH'
arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes</pre>

==== Reserving memory blocks ====
Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the ''beginning'' of SRAM, so put this before the “*(.data)” command:

<source>. = . + 0x400;</source>

Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400.

==== Defining symbols in linker scripts ====

As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file:

<source lang=asm>
.word 0x20000400
.word 0x080000ed
.space 0xe4
</source>

The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte ''after'' the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script:

<source>
.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
</source>
Example name: “LinkerScriptSymbols”

This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol:

<source lang=asm>
.word _StackEnd
</source>

The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores.

The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word _StackEnd
.word Reset_Handler
.space 0xe4

.type Reset_Handler, %function
Reset_Handler:

@ Put code here
</source>

The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now.

==== Absolute section placement ====
The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:
</source>
Example name: “LinkerScriptAbsolutePlacement”

The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.VectorTable)
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>

=== Program Structure ===
Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Assemble and link this source code with two assembler commands:
<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
arm-none-eabi-as -g vectortable.S -o vectortable.o
arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld
</source>

This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
@ Code here ...
</source>

This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a ''local'' symbol for each label, which can’t be used from other source files (same as ''static'' in C). The symbol is still there in the final program file, though - it can be used for debugging purposes.

== More assembly techniques ==
After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced.

=== Instruction set state ===
As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective.

Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32.

Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is:

<source lang=asm>
ldr r0, =SomeLabel
bx r0
</source>

Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions.

In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive:

<source lang=asm>
.type SomeLabel, %function
SomeLabel:
@ Some code...
</source>

That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch.

As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too:

<source lang=asm>
.type Reset_Handler, %function
</source>

If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example:

<source lang=asm>
.cpu cortex-a8
.syntax unified

@ Small but slower code here
.thumb

.type Block1, %function
Block1:
ldr r0, =Block2
bx r0

@ Larger but faster code here
.arm

.type Block2, %function
Block2:
@ ...
</source>

The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below).

=== Constants ===
The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly.

The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways:
<source lang=asm>
RCC_APB2ENR = 0x40021018
.set GPIOA_CRH, 0x40010804
.equ GPIOA_ODR, 0x4001080C
</source>

and then use it in place of the number:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_10MHz = 1
GPIOx_CRx_GP_PP_2MHz = 2
GPIOx_CRx_GP_PP_50MHz = 3

GPIOx_CRx_GP_OD_10MHz = 1|4
GPIOx_CRx_GP_OD_2MHz = 2|4
GPIOx_CRx_GP_OD_50MHz = 3|4

GPIOx_CRx_AF_PP_10MHz = 1|8
GPIOx_CRx_AF_PP_2MHz = 2|8
GPIOx_CRx_AF_PP_50MHz = 3|8

GPIOx_CRx_AF_OD_10MHz = 1|4|8
GPIOx_CRx_AF_OD_2MHz = 2|4|8
GPIOx_CRx_AF_OD_50MHz = 3|4|8

GPIOx_CRx_IN_ANLG = 0
GPIOx_CRx_IN_FLOAT = 4
GPIOx_CRx_IN_PULL = 8

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkConstants”

This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules.

The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate:

<source lang=asm>
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffff0ff
orr r0, #(GPIOx_CRx_GP_OD_10MHz<<8)
str r0, [r1]
</source>

C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values.

Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label.

A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register:

<source lang=asm>
MyData .req r7
ldr MyData, =123
add MyData, 3
</source>

This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code.
=== The Stack ===
In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”.

Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack.

On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs:

* Some operation that writes to r0
* '''Push''' (save) r0 to the stack
* Some operation that overwrites r0
* '''Pop''' (restore) r0 from the stack
* Use the value in r0 which is the same as initially assigned

ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ … Some code that overwrites r0 …

pop { r0 } @ Restore value 1000000
@ Continue using r0 ...
</source>

It is also possible to save/restore multiple registers in one go:

<source lang=asm>
ldr r0, =1000000
ldr r1, =1234567
@ Use r0 and r1 ...
push { r0, r1 } @ Save values 1000000 and 1234567

@ … Some code that overwrites r0 and r1 …

pop { r0, r2 } @ Restore 1000000 into r0 and 1234567 into r2
@ Continue using r0 and r2...
</source>

It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ Inner Code Block:

ldr r0, =123
@ Use r0 …

push { r0 } @ Save value 123

@ Inner-Inner Code Block that overwrites r0

pop { r0 } @ Restore value 123
@ Continue using r0 ...

pop { r0 } @ Restore value 1000000 into r0

@ Continue using r0 …
</source>

The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”.

At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa.

As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data.

When storing a 32bit register value using “push”, the stack pointer is '''first''' decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later.

An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte '''after''' the stack memory area. If the stack is completely full, it contains the address of the very first byte '''inside''' the stack memory area. This means that the stack grows '''downward'''. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash.

The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address ''inside'' the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4127.html ARM ABI requires] the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8.

In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the ''end'' of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the ''beginning'' of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial.
=== Function calls ===
Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the ''next'' instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again).

This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

bl EnableClockGPIOA @ Call function to enable GPIOA's peripheral clock

@ Some more code ...
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1]

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

</source>

Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”.

When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files.
==== Using the stack for functions ====
In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r5, =5 @ Number of LED flashes.
bl Blink
b .

.type Blink, %function
Blink:
push { lr }
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

bl Delay

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

bl Delay

subs r5, #1
bne BlinkLoop

pop { lr }
bx lr

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

.type Delay, %function
Delay:
mov r4, r3
DelayLoop:
subs r4, #1
bne DelayLoop @ Iterate delay loop
bx lr

</source>
Example name: “BlinkFunctions”

The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions.

The two lines

<source lang=asm>
pop { lr }
bx lr
</source>

are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC:

<source lang=asm>
pop { pc }
</source>

This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running.
==== Calling Convention ====
Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller.

When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics:

* Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.
* The LR is used as shown for the return address.
* When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).
* The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.
* The register r0 may be used to pass a result value back to the caller, which is called the return value.

This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved ''outside'' the function (“caller-save”), and the registers r4-r11 are (if at all) saved ''inside'' the function (“callee-save”).

A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack.

If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =5
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r7, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
mov r7, r0 @ Number of LED flashes.

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r7, #1
bne BlinkLoop

pop { r4-r7, pc }

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
DelayLoop:
subs r0, #1
bne DelayLoop @ Iterate delay loop
bx lr
</source>
Example name: “BlinkFunctionCallingConvention”

The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7.

Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes:

<source lang=asm>
.type Blink, %function
Blink:
push { r4, lr }

mov r4, r0

BlinkLoop:
ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r4, #1
bne BlinkLoop

pop { r4, pc }
</source>
Example name: “BlinkFunctionCallingConvention2”

A third variant would not use any of the callee-save-registers (r4-r11) at all, and instead just save r0 before the function calls and restore it as needed

<source lang=asm>
.type Blink, %function
Blink:
push { lr }

BlinkLoop:
push { r0 }

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

pop { r0 }
subs r0, #1
bne BlinkLoop

pop { pc }

</source>
Example name: “BlinkFunctionCallingConvention3”

The frequent stack accesses would however make this slower. Be sure to always document the meaning (and units, if applicable) of parameters e.g. via comments.
=== Conditional Execution ===
As mentioned, the conditional variants of the “b” instruction (e.g. “bne”) can be used to execute certain blocks of code only if a certain condition is met. First, more ways to formulate conditions will be shown. Next, the ARM instruction “it” will be introduced, which makes executing small blocks of code conditionally more efficient.
==== Conditions ====
All conditions for conditional execution depend on the outcome of some mathematical operation. When instructions such as “adds”, “subs”, “ands” are used, they update the flags in the APSR register depending on the outcome, which are then read by the conditional variants of “b” to decide whether to actually perform the jump.

Often it is necessary to compare two numbers without actually doing a calculation. This can be done with the “cmp” instruction to which you can pass two registers or a register and a literal:

<source lang=asm>
cmp r0, #42
cmp r0, r1
</source>

The “cmp” instruction is very similar to “subs” - it subtracts the second operand from the first, but doesn’t save the result anywhere, i.e. the registers keep their values. Just the flags in the APSR are updated according to the result, just as with “subs”. For example, if both operands were equal, the result of the subtraction is zero, and the zero flag will be set. So, to test whether two numbers are equal:

<source lang=asm>
cmp r0, #42
beq TheAnswer

@ This is executed if r0 is not 42

TheAnswer:
@ This is executed if r0 is 42
</source>

The “bne” instruction is the opposite of “beq”.

The “tst” instruction works similarly to “cmp”, but instead of subtracting, perform a bitwise “and” operation - like the “ands” instruction, but without keeping the result. This way, you can test whether a bit in a register is set:

<source lang=asm>
tst r0, #4
beq BitNotSet

@ This is executed if bit 2 in r0 is set

BitNotSet:
@ This is executed if bit 2 in r0 is not set
</source>

A more useful use case for “tst” is to pass the same register twice. Applying “and” to the same value twice yields the same result as the input, so “tst” in this case effectively checks the properties of the input (negative/positive, zero):

<source lang=asm>
tst r0, r0
beq ValueZero

@ This is executed if r0 is not zero

ValueZero:
@ This is executed if r0 is zero
</source>

There is also the “teq” instruction which performs an exclusive or operation.

As mentioned, the suffixes “eq” and “ne” are called condition codes. ARM has 14 of those which define how the flags in the APSR form the condition. The details about how a subtraction (by “subs” or “cmp”) sets the flags in the APSR and how their interpretation by the different condition codes correlates to the mathematical result are somewhat complicated, involving the way the 2’s complement format works and relying on the fact that subtracting works by adding a negated number. Instead of diving into all the details, a table with a more high-level view and a practical interpretation of the condition should be more helpful:

{| class="wikitable"
! style="text-align:left;" | Code
! style="text-align:left;" | Meaning
! style="text-align:left;" | Unsigned / Signed
! style="text-align:left;" | Flags
! style="text-align:left;" | Condition after “cmp/subs r0, r1”
! style="text-align:left;" | Condition after “tst r0, r0”
|-
| style="text-align:center;" | EQ
| Equal
| style="text-align:center;" | U+S
| Z==1
| r0 = r1
| r0 = 0
|-
| style="text-align:center;" | NE
| Not equal
| style="text-align:center;" | U+S
| Z==0
| r0 ≠ r1
| r0 ≠ 0
|-
| style="text-align:center;" | MI
| Negative
| style="text-align:center;" | S
| N==1
| ---
| r0 < 0
|-
| style="text-align:center;" | PL
| Positive or Zero
| style="text-align:center;" | S
| N==0
| ---
| r0 ≥ 0
|-
| style="text-align:center;" | VS
| Overflow
| style="text-align:center;" | S
| V==1
| r0-r1 out of range¹
| ---
|-
| style="text-align:center;" | VC
| No Overflow
| style="text-align:center;" | S
| V==0
| r0-r1 in range¹
| ---
|-
| style="text-align:center;" | HS
| Unsigned higher or same
| style="text-align:center;" | U
| C==1
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LO
| Unsigned lower
| style="text-align:center;" | U
| C==0
| r0 < r1
| ---
|-
| style="text-align:center;" | HI
| Unsigned higher
| style="text-align:center;" | U
| C==1 and Z==0
| r0 > r1
| ---
|-
| style="text-align:center;" | LS
| Unsigned lower or same
| style="text-align:center;" | U
| C==0 or Z==1
| r0 ≤ r1
| ---
|-
| style="text-align:center;" | GE
| Signed greater or equal
| style="text-align:center;" | S
| N==V
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LT
| Signed less than
| style="text-align:center;" | S
| N!=V
| r0 < r1
| ---
|-
| style="text-align:center;" | GT
| Signed greater than
| style="text-align:center;" | S
| Z==0 and N==V
| r0 > r1
| ---
|-
| style="text-align:center;" | LE
| Signed less or equal
| style="text-align:center;" | S
| Z==1 or N!=V
| r0 ≤ r1
| ---
|}
1: Range meaning the numbers from -(2^31) until (2^31-1), inclusive

To determine which condition code you need, first think about whether the number is unsigned (range 0 to 2^32-1) or is using two's complement to represent signed numbers (range -2^31 to 2^31-1). Ignore all rows in the table with the wrong format.

If you want to compare two numbers, use the “cmp” instruction, and search for the desired condition within the “cmp”-condition column of the table. If you want to test a single number's properties, use the “tst”-column. Use the condition code from the first column with the conditional “b” instruction (“bne”, “beq”, “bmi”, “bpl”, “bhs”, ... ) right after the appropriate “cmp”/”tst” instruction.

Note that all the condition codes have a corresponding inverse code that has exactly the negated meaning. Most also have a swapped partner code, using which is equivalent to swapping the operands for cmp.
==== The IT instruction ====
Jumping is inefficient, so having many conditional jumps may slow down your program. The ARM architecture offers a way to make a few instructions conditional without requiring a jump via the “it” (if-then) instruction. It is used in place of a conditional jump after an instruction that set the flags (“cmp”, “tst”, “adds”…) and also needs a condition code. The next instruction right after the it will then only be executed when the condition is met, and skipped otherwise. You have to repeat the condition code and add it to that instruction; this is just to make the code clearer and avoid confusion.

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

it hi @ Make the next instruction conditional
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

This checks if r0 is higher than r1 (it isn’t), and only sets the pin PA8 to high if this condition is met. Up to 4 instructions can be made conditional like this; for each one, an additional “t” has to be appended to the “it” instruction:

<source lang=asm>
cmp r0, r1 @ Perform comparison

ittt hi @ Make the next instruction conditional
ldrhi r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldrhi r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

You can also add instructions that will be executed if the condition was ''not'' met (like an “else”-case in high-level-languages), by appending “e” instead of “t” to the “it” instruction. Since the “t” in “it” is fixed, the first instruction is always executed if the condition is met; only the next three instructions can be either a “then” case (“t”) or “else” case (“e”). You also have to provide the inverted condition code for the “else”-instructions:

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

ite hi @ Make the next two instructions conditional (if-then-else)
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
strls r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low
</source>

There are several restrictions on which instructions may appear within an it-block. Most importantly, instructions that set the flags are forbidden here, as is the “b” instruction except for the last instruction in an “it” block. Directly jumping to one of the conditional instructions is forbidden too.

In T32 code, only the conditional “b” instruction is capable of encoding a condition code together with some operation, so the “it” instruction is provided to make any instruction conditional. On A32, most instructions include a condition code and can therefore be conditional, and the “it” instruction is actually ignored by the assembler here. You can and should still put “it” into code intended for A32, as this makes it compatible with T32. This is one of the reasons why A32 is more time-efficient, and T32 more space-efficient.

Conditional instructions sometimes make surprisingly compact programs. For example, the euclidean algorithm for calculating the greatest common divisor (gcd) of two numbers can be written in ARM assembly like this:

<source lang=asm>
gcd:
cmp r0, r1
ite gt
subgt r0, r0, r1
suble r1, r1, r0
bne gcd
</source>

While the C equivalent is actually longer:

<source lang=asm>
int gcd(int a, int b) {
while (a != b) {
if (a > b)
a = a - b;
else
b = b - a;
}
return a;
}
</source>

The usage of conditional instructions is also [https://developer.arm.com/docs/dui0801/latest/condition-codes/example-showing-the-benefits-of-conditional-instructions-in-a32-and-t32-code faster] than using conditional jumps. Note that the final “bne” instruction is independent of the “if-then” block; it just directly uses the result of “cmp”.
=== 8/16 bit arithmetic ===
So far, all numbers had 32 bit. However, especially for space reasons, smaller numbers are needed with 8 or 16 bit. Cortex-M3 doesn't provide any instructions for calculating 8 or 16 bit numbers directly. Instead, after loading such a number from memory into a processor register, it has to be extended into 32bit to allow the 32bit instructions to work properly. When storing the result back, only the lower 8/16 bit are used. If 8/16bit overflow behavior is required (i.e. overflow at -128/127 for 8bit signed, 0/256 for 8bit unsigned, -32768/32767 for 16bit signed, 0/65536 for 16bit unsigned) for calculations, the numbers have to be truncated after each calculation. This actually makes it slightly less efficient to deal with smaller numbers.

A 16bit value (“halfword”) can be read from memory with the ldrh instruction:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
</source>

“ldrh” loads 16bit from memory, writes them into the lower 16 bits of the destination register (here: r1), and sets the upper 16bits to zero. If the value is signed, it has to be sign-extended so that it can be used with 32bit-calculations:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
sxth r1, r1
</source>

The “sxth” instruction copies the sign bit (i.e. bit 15) into the upper 16 bits (“sign-extension”); this makes sure that negative 16bit-numers keep their value when interpreted as 32 bits. The “ldrsh” instruction combines both “ldrh” and “sxth”. “ldrb”, “sxtb”, “ldrsb” are for loading and sign-extending 8bit-values and the combination of both, respectively.

To simulate 8/16bit overflow behaviour after a mathematical operation, use uxtb/uxth for unsigned 8/16 bit numbers, or sxtb/sxth for signed 8/16 bit numbers:

<source lang=asm>
add r0, #1
uxth r0, r0
</source>

The “uxth”/”uxtb” instructions copy the lower 16/8 bits of a register into another one, setting the upper 16/24 bits to zero. This way, if r0 contained 65535 before, the result will be 0 instead of 65536 after using “uxth”.

This is a common trap when coding in C - when using e.g. the "uint16_t" type for local variables such as loop counters, this implicitly requests 16bit overflow behavior, requiring the truncating after each calculation, even though the overflow may actually never happen. This is why e.g. uint16_fast_t should be used for local variables, as this is 32 bit on ARM, which is faster.
=== Alignment ===
There are certain restrictions on the address when accessing data in memory using the “str”/”ldr” variants:
* The “ldrd”/”strd”/”ldm”/”stm” instructions, which can load/store multiple registers at once, always require the address to be a multiple of 4. If it isn’t, the program will crash.
* The “ldr”/”str” instructions require the address to be a multiple of 4, and “strh”/”ldrh” require it to be a multiple of 2. If it isn’t, the behaviour depends on the ARM version:
** On ARMv6-M and before, the program will crash.
** On ARMv7-M:
*** If the CCR.UNALIGN_TRP is set to zero (the default), the access will be slow
*** If the CCR.UNALIGN_TRP bit is set to one, the program will crash, emulating the ARMv6-M behaviour

For “strb”/”ldrb” there are no such requirements.

The number of which the address needs to be a multiple of is called the “alignment” (e.g. 2-byte-alignment, 4-byte-alignment, …). An access with an address that is a multiple of 2/4 as specified above is called an “aligned access”; others are called “unaligned access” (which are slow or cause a crash).

Even though slow accesses may be acceptable, it is still a good idea to make sure all accesses are always correctly aligned in case the code is ported to an ARM version or operating system that requires it. The addresses of periphery registers are already aligned correctly, so there is no need to worry. When placing data in RAM however, you should make sure that the addresses of the individual elements that are accessed via one of the “ldr” variants are aligned properly. For example, if a previous example code was modified like this:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

The address of “var1” will not be a multiple of 4, and an access via “ldr” would be unaligned. This could be improved by adding a space of 3 bytes in between:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.space 3
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

This would require you to keep in mind all the other things in memory that were declared before, which is impractical especially if multiple assembly files are used. Therefore, the assembler offers the “.align” directive:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.align 2
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

When using “.align X”, the assembler makes sure that the next address will be a multiple of 2^X, so in this case, a multiple of 2^2=4. The assembler will therefore insert 0 to 2^X-1 bytes of space. The section containing the directive in the object code file will also be marked to require that alignment, such that the linker will automatically place it at the appropriate location in address space.
=== Offset addressing ===
The various “ldr”/”str” instructions can optionally perform some calculation on the address before executing the memory access. What is shown for “ldr” here works for “str” and the variants for halfwords and bytes equivalently. There are several variants for this. This first one adds a fixed offset that is encoded within the instruction itself to the address:

<source lang=asm>
ldr r0, [r1, #8]
</source>

This adds 8 to r1 and uses the result as the address to access. The number can also be negative. This variant is useful for accessing members of a heterogeneous container organized like a C struct or the registers in a periphery module. For example, you can load the base address of a periphery module into a register, and then access the various registers using offset-addressing without having to load each address individually:

<source lang=asm>
GPIOA=0x40010800
GPIOx_CRH = 0x04
GPIOx_BSRR = 0x10

GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA

ldr r1, =GPIOA

ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r0, [r1, #GPIOx_BSRR] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “OffsetAddressing”

This way, you can avoid repeated loads of similar addresses. This variant is also capable of writing the newly calculated address back into the address register by appending a “!”:

<source lang=asm>
ldr r0, [r1, #8]!
</source>

This will add 8 to r1, write the result into r1, and also use it as an address from which to load 4 bytes and store them into r0. The variant

<source lang=asm>
ldr r0, [r1], #8
</source>

works just the opposite - r1 is used as an address from which to load the data, and “r1+8” is written back to r1. The next variant adds two registers to obtain the memory address:

<source lang=asm>
ldr r0, [r1, r2]
</source>

This loads the data from the address calculated by “r1+r2”. The second register (here: r2) can also be optionally shifted left by a fixed number of bits in the range 0-3:

<source lang=asm>
ldr r0, [r1, r2, lsl #2]
</source>

This shifts r2 left by two bits (i.e. multiplies it by 4), adds it to r1, and uses that as the address (r2 itself is not modified).
=== Iterating arrays ===
The offset addressing mechanism is perfectly suited to iterating arrays. This could be used to make an array defining a sequence of LED flashes that is iterated by the LED blinker application. Such an array would contain the duration of each on-and off-cycle (as passed to the “Delay” function) and be placed in flash memory:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =BlinkTableEnd @ Move address of "BlinkTableEnd" into r8

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
.word 2500000, 1000000, 2500000, 1000000, 2500000, 1000000
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
BlinkTableEnd:
</source>
Example name: “BlinkPattern”

The “.word” directive is used to place a sequence of 32bit- numbers into flash memory. The label “BlinkTable” will refer the the start address of the array, and “BlinkTableEnd” to the first address ''after'' the array. These two addresses are loaded into registers before the loop. The “.align” directive is used to make sure the 32bit-words are stored at properly aligned addresses. Inside the loop, the “ldr” instruction is used to load a 32bit-word from the array and pass it to the “Delay” function. The r7 register is advanced by 4 bytes to the next 32bit-word. This is done twice, for the on-and off-time. At the end of the loop, the address register is compared with the address of “BlinkTableEnd” - until that address has been reached, the loop will continue.

Another possibility is to keep the base address of the array in a register, and increment another register that contains the offset:

<source lang=asm>
.type Blink, %function
Blink:
push { r4-r9, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =0
ldr r9, =18

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

cmp r8, r9
blo BlinkLoop

pop { r4-r9, pc }
</source>
Example name: “BlinkPattern2”

Here, r8 is incremented in steps of 1 to denote the index in the array. The “lsl” syntax for “ldr” is used to multiply r8 by 4 (since each word is 4 bytes in size) and add it to r7, which contains the array’s base address. At the end of the loop, r8 is compared with 18, which is the number of entries in the array. This variant is actually less efficient, as it needs to keep both the base address and the index in registers and also has to increment the index in each iteration.
=== Literal loads ===

Regardless of architecture, any processor obviously needs to work with addresses in its own address space a lot. ARM can do calculations with its 32bit addresses just fine, but there is a bottleneck: The instruction set itself. To work with any address, it needs to be initially loaded into a processor register, but ARM instructions are only 16 or 32bit in size - not enough space for an arbitrary 32bit number plus the instruction encoding. Allowing even larger instructions (e.g. 40 bit) would complicate matters, so ARM instead uses several tricks to deal with this problem, which will be discussed here.

The “ldr r0, =1234” syntax allows you to load any arbitrary 32bit numbers, but is not actually a machine code instruction, but is translated by the assembler into one. In this chapter, the actual instructions for loading immediate numbers are discussed.

==== The “mov”-instruction ====
The most basic way of loading an immediate number into a register is the “mov” instruction:

<source lang=asm>
mov r0, #1234
</source>

This allows you to load any 16bit number (0 to 2^16-1) into a register. “mov” also includes some clever encodings that allow you to load certain commonly-used patterns:

* Any 32bit number that consists of one byte of arbitrary bits (i.e. 8 adjacent arbitrary bits) at any location, and zeros otherwise, e.g. 0x00000045, 0x00045000, 0x7f800000.
* Any 32bit number that consists of the same byte repeated 2 or 4 times in fixed places, as in 0x23002300, 0x00230023, 0x23232323
* The bit-wise negated result of any of these two patterns, e.g. 0xffffffba, 0xfffbafff, 0x807fffff or 0xdcffdcff. The assembler will actually use the “mvn” instruction for this, which works identically to “mov”, but negates the value.

By specifying a number that falls into one of these patterns, the assembler will automatically use the appropriate encoding. The first two ways of encoding numbers are not only available with “mov”, but also several other mathematical instructions that expect some immediate value: “add”, “and”, “bic”, “cmn”, “cmp”, “eor”, “mov”, “mvn”, “orn”, “orr”, “rsb”, “sbc”, “sub”, “teq”, “tst”. In the ARM Architecture Reference Manual, check the description of the instructions and look out for “ThumbExpandImm” to see whether it supports the first two patterns above.

You can also use the “mvn” instruction directly, e.g.:

<source lang=asm>
mov r0, #0xf807ffff
mvn r0, #0x07f80000
</source>

both lines are identical and write the number 0xf807ffff into r0.
==== The “movt” instruction ====
While supporting many common patterns, this does not allow arbitrary 32 bit numbers. One way to load any 32bit number is to split the number into two 16bit halves, and use both “mov” and “movt” to combine these two half-words into one register:

<source lang=asm>
mov r0, #0xabcd
movt r0, #0x1234
</source>

The “movt” instruction loads the given number into the upper 16 bits of the register, so this example loads 0x1234abcd into r0. The order is important, as “mov” overwrites the upper 16 bits with zeros, but “movt” keeps the lower 16 bits. If a single “mov” can't fit the desired number, the combination of “mov” and “movt” is the fastest way of loading any 32bit number. As two 32bit instructions are needed, this consumes 8 bytes of program memory. If you want to load the address of a symbol into a register, you need to tell the assembler to split it automatically. This can be achieved by prefixing the symbol with “:lower16:” or “:upper16:”, e.g.:

<source lang=asm>
movw r0, #:lower16:GPIOA_BSRR
movt r0, #:upper16:GPIOA_BSRR
</source>

Note that “movw” needs to be specified in this case to explicitly tell the assembler to use the “mov” variant that accepts 16bit numbers (which it otherwise does automatically when a direct value is given).
==== PC-relative loads ====
The other way of loading arbitrary 32bit values into registers is to place the value directly in flash memory, and load it from there using “ldr”:

<source lang=asm>
@ Some code …
mov r0, … address of Literal ...
ldr r1, [r0]
@ More code …
Literal:
.word 0x12345678
</source>

However, there is a Chicken-And-Egg problem - the address of “Literal” is a 32bit number itself, so how to load it into r0? Luckily, there is a register that contains a number close to the one needed - the program counter (PC, r15) indicates the address of the instruction currently being executed. By reading it and adding a small offset that fits into the instruction itself, the address of “Literal” can be obtained, provided that “Literal” is located close enough. Consider this example of the EnableClockGPIOA function:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
add r1, pc, #12
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
.word RCC_APB2ENR
</source>

The 32bit-value “RCC_APB2ENR” is stored in flash memory. The “add” instruction is used to add the offset 12 to the address of the instruction itself to obtain the address of said 32bit-value, which is then loaded via “ldr”. The offset 12 is actually not easy to calculate and even depends on the alignment of the “add” instruction itself (hence the “.align” to ensure a consistent example). The assembler is capable of doing the calculation on its own, for which the “adr” instruction is used:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
adr r1, LiteralRCC_APB2ENR
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

The label LiteralRCC_APB2ENR refers to the address of the 32bit-value in memory. “adr” is actually a variant of “add” that instructs the assembler to calculate the offset and place it into the instruction itself, which then lets the processor add it to PC and write the result to r1. This address is then used by “ldr”.

The “adr” instruction is useful when the address of some literal is explicitly needed; for example, in the blinker program, it can be used to obtain the addresses of the array:

<source lang=asm>
adr r7, BlinkTable @ Move address of "BlinkTable" into r7
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r8
</source>

However, for loading a single value, the address is actually not needed. In this case, “adr” and “ldr” can be combined:

<source lang=asm>
.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, LiteralRCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

This special variant of “ldr” lets the assembler calculate to offset as with “adr”, adds it to “PC” at runtime and loads the data found at the address into r1. This is much easier than the first variant, as all calculations are done automatically. It is still somewhat cumbersome having to write three lines just to obtain a single 32bit value. Therefore, the assembler offers this already introduced syntax:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

This is a special command for the assembler. If possible, the assembler will use the “mov” or “mvn” instruction to load the value. If the value won’t fit, it will be put into flash memory, and a “ldr” instruction as above will be used. In this case, the “ldr rX, =...” syntax is equivalent to the combination of specifying a label for the value, the “.word” directive and “ldr rX, <Label>”. Therefore, this syntax is usually the best way to load immediates.

The assembler places the literals at the end of the file. If the file is long, the offset will be too long for the “ldr” and “adr” instructions and the assembler will emit an error. You can instruct the assembler to place all literals that have been declared so far at a specific point using the “.ltorg” directive. It is recommended to place an “.ltorg” after each function (after the “bx lr”) - just make sure that execution will never reach there. If a single function is so long that an “.ltorg” at the end is too far away from “ldr”/”adr” at the beginning, you can place an “.ltorg” somewhere in the middle and jump over it with “b”.

In summary, the following rules can help make literal loads more efficient
* Avoid literal loads if possible; try to calculate needed values from other values that have already been loaded, possibly by using offset-addressing in “ldr”/”str”
* When accessing multiple registers of a single periphery module, load its base address once and use offset addressing to access the individual registers
* If you need a pointer to a location in flash memory, try using “adr”
* If speed is important, use “movw”+”movt” to load the value
* Else, use “ldr rX, =...” to have the assembler choose the optimal encoding
* Place “.ltorg” after each function

The “ldr … =” instruction can also be used to load any immediate 32bit value into the PC to cause a jump to that address, simply by specifying “pc” as the target register. If you perform an ordinary branch (via “b” or “bl”) to some function whose address is too far away from the current code location, the linker will insert a “wrapper” function that does exactly that to perform the “far” jump. That function is called a “veneer”.
=== The SysTick timer ===
An important aspect of many embedded systems is to control timing of technical processes. In the blinker example, the timing of the LED flashes was handled by having the processor execute dummy instructions to pass time. It is however virtually impossible to accurately predict the runtime of any piece of code on a complex processor such as ARM ones, and the runtime may vary among multiple runs and depending on the actual micro controller and its configuration. For a simple LED blinker this may be acceptable, but not for e.g. a closed loop controller for some mechanical actor. Therefore, almost all micro controllers and also application processors feature one or more hardware timers, which allow to measure time independently of the execution speed of the software. Timer features vary widely among different processors, but that basic idea is to increment or decrement some digital counter at each clock cycle and trigger some event when it reaches a certain value.

All ARMv7-M processors feature the so-called “SysTick”-Timer as part of the processor core itself. This is a rather simple 24bit-timer that counts from a configurable value back to zero, then resets to that value and triggers an event. This timer is frequently used as a time base for RTOS or other runtime libraries. The timer uses three periphery registers: “RVR” contains the value from which to count down. “CVR” contains the current value, and “CSR” contains some status and control bits. The timer can be used for the “Delay” function like this:

<source lang=asm>
SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
ldr r1, =SCS
add r0, r0, r0, lsl #1

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

DelayLoop:
ldr r0, [r1, #SCS_SYST_CSR]
tst r0, #0x10000
beq DelayLoop

ldr r0, =0
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>

The SysTick is part of the “System Control Space”, SCS. The SCS base address is defined as a symbol, and the relative addresses of the registers as well. The count value is stored in “RVR”, after which “CVR” has to be set to zero. The timer is started by writing “5” into the “CSR” register. The loop repeatedly reads the “CSR” register and continues until bit 16 is set. The “tst” instruction is used to perform an “and” operation with the register contents and an immediate value without keeping the result while just updating the flags. At the end, the “CSR” register is set to zero to disable the timer. The “add” instruction at the beginning is used to multiply the count value by 3: r0 is shifted left by one, i.e. multiplied by two, and then added to itself, as in r0*2^1+r0. This is a common trick to quickly multiply by constants. By including this multiplication, the duration is the same as with the previous “Delay” variant, which, on this microcontroller, uses about 3 cycles per loop iteration.

Managing timing this way (or any other kind of “Delay” function) is still not very accurate. The time needed to call the function, start the timer, return, and set the pins is added to the actual duration and may also vary each time. The timing errors accumulate over time - a clock implemented this way will quickly go wrong. The proper way to achieve accurate timing is to start the timer once, let it run continuously, and react to its events. The internal clock source used by the microcontroller is also quite inaccurate (up to 2.5% deviation), which can be improved upon by a quartz crystal (typical accuracy of e.g. 0.005%), which will be covered later. Reacting to events instead of calling a function that executes dummy code requires restructuring the program code, without using any kind of “Delay” function.

To do that, the timer is started once at program startup and kept running. After setting the LED pin, wait for the timer event, and repeat. In the last example, the values 3000000 and 7500000 are used for the timer register (3x1000000 and 3x2500000, respectively). Changing the timer value while it is running continuously is problematic, so one fixed value should be used; to achieve variable blinker duration, multiple timer events need to be counted. The greatest common denominator of the two numbers is 1500000, so to achieve the two different times, 2 and 5 timer events need to be registered, respectively. Since these numbers fit into a single byte, the table entries and corresponding access instructions are changed to byte. A function “StartSysTick” is implemented to start the timer once, and a function “WaitSysTick” to wait for a given number of timer events:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =TimerValue
bl StartSysTick
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
adr r7, BlinkTable @ Move address of "BlinkTable" into r8
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r9

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2, 2
BlinkTableEnd:
.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

bx lr

@ r0 = Number of timer events to wait for
.type WaitSysTick, %function
WaitSysTick:
ldr r1, =SCS

WaitSysTickLoop:
ldr r2, [r1, #SCS_SYST_CSR]
tst r2, #0x10000
beq WaitSysTickLoop

subs r0, #1
bne WaitSysTickLoop

bx lr

</source>
Example name: “BlinkSysTick”

This way, the blinker frequency will be as stable and accurate as possible with the given clock source.
=== Exceptions & Interrupts ===
Exceptions and interrupts play an important role in low-level development. They provide a facility for hardware to notify the software of events, such as received data blocks or a timer event. On ARM, interrupts are a sub-group of exceptions – there are some “system-level” exceptions mostly for dealing with processor errors and providing operating system support, while interrupts are “special” exceptions for events signaled by periphery modules. When writing “regular” microcontroller software, you will mostly work with interrupts.

Exceptions (and interrupts) interrupt normal program flow, and cause the processor to execute some other piece of code which is called the exception handler or Interrupt Service Routine (ISR) (even for the “system-level” exceptions that are not interrupts). After dealing with the indicated event, the ISR typically returns and normal program flow resumes. As exceptions can interrupt the program anytime, data (and periphery) may be in any kind of inconsistent state, so special care must be taken to avoid corrupting program state in an ISR. The ARMv7-M processor (including the Cortex-M3) provide sophisticated support for exceptions, with configurable priorities and nested exception calls. This chapter will only cover the basics for using exceptions.

On ARMv7-M, exception handlers are implemented as regular functions, for example:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
@ Handle event ...
bx lr
</source>

Like any other function, it has a label, returns with “bx lr”, and is also made globally visible to other source files using “.global”. The “.type … %function” is required here for the same reason as for the already-mentioned “Reset_Handler”. Exception handlers can be located anywhere in flash memory, among the other regular functions. To tell the processor where the exception handlers for the various exception types are located, the vector table needs to be adjusted. Until now, the vector table was defined as:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Recall that the first 32bit-word in flash memory contains the initial stack pointer (defined via “.word _StackEnd”) and the second word contains the address of the first instruction of the program (defined via “.word Reset_Handler”). Actually, resetting the controller is an exception too, and the code to be executed after reset (or start-up) is the handler for the reset exception (hence the name “Reset_Handler”). The next 228 bytes of flash memory contain 57 32bit-addresses of the handlers of the other exceptions, including interrupts. The “.space” directive just fills those with zeroes. To tell the processor the address of an exception handler, the appropriate entry in this table needs to be set to that address. In chapter 10.1.2, table 63 of the controller’s reference manual, the format of the vector table, and which exception’s address should go where, is defined. Only the interrupts up until position 42 actually exist on the STM32F103RB/C8, as defined in chapter 2.3.5 of the datasheet; everything from “TIM8_BRK” is only present on larger controllers. According to the table, the SysTick exception handler’s address needs to be put at location 0x3C relative to the beginning of flash memory. Since the first 8 bytes are already occupied, 0x34 bytes of space are needed after those first 8 bytes.

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0x34
.word SysTick_Handler
.space 0xac
</source>

With this modification, the SysTick_Handler function is now declared as the handler for the SysTick exception. By default, the SysTick timer does not trigger an exception. To do that, you have to set bit 2 in the SCS_SYST_CSR register. By placing the logic for the blinker into the timer's ISR, you get an interrupt-based blinker:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

ldr r0, =TimerValue
bl StartSysTick
SleepLoop:
wfi
b SleepLoop

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq Return

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs Return

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne Return

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs SkipRestart

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

SkipRestart:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

Return:
bx lr

.align 2
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr

</source>
Example name: “BlinkSysTickInterrupt”

The regular program flow now consists only of initializing the periphery, timer, and the first step of the blinker (setting the pin high). After that, the processor should do nothing but wait for exceptions, which is achieved by a simple endless loop. The “wfi” instruction suspends the processor; when an exception occurs, the processor will wake up, execute the ISR, and return execution after the “wfi”. Therefore, “wfi” is usually put in an endless loop as shown. This technique can reduce the processor’s power consumption significantly, as it is only running when something needs to be done, as indicated via interrupts. The ISR first checks whether the interrupt flag in the timer register is set - this is necessary, since exceptions can sometimes occur “spuriously”, i.e. without an actual event causing it. The decision whether to set or reset the pin state is taken based on the lowest bit of the table index, such that the output alternates between 1 and 0.

The code inside the ISR needs to know which step in the blinking sequence is currently active, and how many timer events have already occurred inside the current step. Therefore, two 1-byte-variables are stored in RAM. To access them, offset addressing is used, where r0 contains the base address of the variables in memory, and the offsets inside “ldrb” and “strb” are set accordingly. The last number of the blink sequence table is omitted, since it is actually superfluous, because no action is taken after the last delay has elapsed. Because the table size is now odd, an “.align” directive after it is required. Always putting “.align” after outputting data is a good idea anyways.

Since exceptions can occur at any point in regular program flow, the processor registers may contain some data that will be used after the exception handler returns. Therefore, if the exception handler writes anything into the registers, they need to be restored when returning from the exception. Upon exception entry, the Cortex-M3/4 processors automatically store the registers r0-r3, r12, r14 (LR) and APSR (including the flags) on the stack. The link register is filled with a special “dummy” value, and when the exception handler returns via “bx lr” using this value, the processor restores said registers to their previous state. This effectively means that you can implement exception handlers like any other function, i.e. freely overwrite r0-r3, r12 and the flags and push/pop r4-r11 and the LR if needed.
=== Macros ===
The assembler provides a few mechanisms to make assembly-language development easier. One of those are macros, which allow you to define snippets of assembly code that you can then insert easily whenever you need them. While looking similar to function invocations, the code inside the macro is actually copied each time the macro is used, so don’t overuse them. Macros are started with “.macro” and end at the next “.endm” directive. For example, the following macro sets the LED pin to 0 or 1:

<source lang=asm>
.macro SETLED value
ldr r0, =GPIOA_BSRR
ldr r1, =(((!\value) << 24) | (\value<<8))
str r1, [r0]
.endm

SETLED 0
SETLED 1
</source>

The macro name is defined as “SETLED”, and a single parameter with name “value” is given. By typing “\value”, the value of the parameter is substituted in the macro body. Some bit-shifting is used to calculate the right bit pattern to write into BSRR to set or reset the pin accordingly.
=== Weak symbols ===
As explained before, labels defined in assembly files get translated into symbols in the object code files, which are resolved by the linker. Sometimes it is desirable to provide a “default” or “fallback” implementation of some function (or data block) which is only used when no other implementation is given. This can be achieved by marking the “fallback” variant with “.weak”:

<source lang=asm>
.type Function1, %function
.global Function1
.weak Function1
Function1:
@ Default implementation …
…
bl Function1 @ Call the function
</source>

With this code alone, “Function1” will be used normally. If you put another function with the same name in a different assembly source file, that second variant will be used.
=== Symbol aliases ===
It is also possible to define aliases for symbols using “.thumb_set”, which sets the address accordingly. For example:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

When trying to call “Function2”, the linker will automatically fill in the address of “Function1”. This can also be combined with “.weak” to define a weak alias:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.weak Function2
.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

If you now define another “Function2” in a different assembly source file, that will be used. If not, “Function1” will be called, which is the target of the alias definition. This is useful if you want to define one default implementation for several different functions, for each of which you need one “.weak” and one “.thumb_set” directive.
=== Improved vector table ===
The techniques from the last three sections can be used to improve the definition of the vector table. The way it was defined before is not very flexible; to insert new entries, you have to calculate the new gap sizes and offsets. First, define a default handler ISR that is called by exceptions for which no other ISR is defined, and a macro that defines an alias for one exception with the default handler as the target, and finally a table of all exceptions by using the macro:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.macro defisr name
.global \name
.weak \name
.thumb_set \name, Default_Handler
.word \name
.endm

.global VectorTable
.section .VectorTable, "a"
.type VectorTable, %object
VectorTable:
.word _StackEnd
defisr Reset_Handler
defisr NMI_Handler
defisr HardFault_Handler
defisr MemManage_Handler
defisr BusFault_Handler
defisr UsageFault_Handler
.word 0
.word 0
.word 0
.word 0
defisr SVC_Handler
defisr DebugMon_Handler
.word 0
defisr PendSV_Handler
defisr SysTick_Handler
defisr WWDG_IRQHandler
defisr PVD_IRQHandler
defisr TAMPER_IRQHandler
defisr RTC_IRQHandler
defisr FLASH_IRQHandler
defisr RCC_IRQHandler
defisr EXTI0_IRQHandler
defisr EXTI1_IRQHandler
defisr EXTI2_IRQHandler
defisr EXTI3_IRQHandler
defisr EXTI4_IRQHandler
defisr DMA1_Channel1_IRQHandler
defisr DMA1_Channel2_IRQHandler
defisr DMA1_Channel3_IRQHandler
defisr DMA1_Channel4_IRQHandler
defisr DMA1_Channel5_IRQHandler
defisr DMA1_Channel6_IRQHandler
defisr DMA1_Channel7_IRQHandler
defisr ADC1_2_IRQHandler
defisr USB_HP_CAN1_TX_IRQHandler
defisr USB_LP_CAN1_RX0_IRQHandler
defisr CAN1_RX1_IRQHandler
defisr CAN1_SCE_IRQHandler
defisr EXTI9_5_IRQHandler
defisr TIM1_BRK_IRQHandler
defisr TIM1_UP_IRQHandler
defisr TIM1_TRG_COM_IRQHandler
defisr TIM1_CC_IRQHandler
defisr TIM2_IRQHandler
defisr TIM3_IRQHandler
defisr TIM4_IRQHandler
defisr I2C1_EV_IRQHandler
defisr I2C1_ER_IRQHandler
defisr I2C2_EV_IRQHandler
defisr I2C2_ER_IRQHandler
defisr SPI1_IRQHandler
defisr SPI2_IRQHandler
defisr USART1_IRQHandler
defisr USART2_IRQHandler
defisr USART3_IRQHandler
defisr EXTI15_10_IRQHandler
defisr RTCAlarm_IRQHandler
defisr USBWakeUp_IRQHandler

.text

.type Default_Handler, %function
.global Default_Handler
Default_Handler:
bkpt
b.n Default_Handler
</source>

There are a few empty entries in the table that are not used by the processor. At the beginning, there is still the definition for the initial stack pointer and the “Reset_Handler”. If you replace your “vectortable.S” by this code, you get a “proper” vector table. The “SysTick_Handler” will continue to work as before, and if you need to define any other ISR, for example for USART1, just define a function by the exact name “USART1_IRQHandler”. The address of this function will automatically be put in the vector table. If an exception without a corresponding ISR occurs, the “Default_Handler” will be called, which uses the “bkpt” instruction to force a breakpoint via the attached debugger. This helps debugging missed exceptions while avoiding to define several individual dummy handler functions.
=== .include ===
Having to put the register and bit definitions (“RCC_APB2ENR”, “RCC_APB2ENR_IOPAEN”, ...) in each assembly source file is redundant and error-prone. Instead, you can put them into a separate file (e.g. called “stm32f103.inc”) and use the “.include” directive to reference it:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

@ Normal code ...
</source>

The assembler will read the code from the included file and pretend it was written instead of the “.include” line. This can help improve code structure. While working on the project structure, you can also restructure the definitions for the GPIO registers to facilitate offset addressing:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000
</source>

The next example incorporates these changes in addressing the registers.
=== Local Labels ===
Having to invent unique labels for all jump targets inside functions (e.g. for conditional code and loops) can be tedious. When using a disassembler (see below), each label will appear as its own functions. Therefore, the GNU assembler supports local labels. These are labels whose name consist only a number. Local names need not be unique; several labels called e.g. “1” may exist in one file. To perform a jump to a local label, use the number and append a “f” or “b” to indicate whether to jump forward or backward. Local labels can not be exported with the “.global” directive. The interrupt-based blinker can be modified like this using local labels:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

1:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>
Example name: “BlinkLocalLabels”
=== Initializing RAM ===
The blinker program uses 2 byte-variables in memory, which have to be initialized to some value at startup. For large programs with many variables, this quickly becomes hard to maintain and also inefficient. Assembler and linker can help producing an “image” of how the RAM contents should look like after initializing, and place this image in flash memory alongside the normal program data. At startup, this image can be simply copied 1:1 into RAM in a loop. Most programs contain many variables that will be initialized with zero, so placing a (possibly large) block of zeroes in flash memory is wasteful; therefore, an additional loop is used to initialize all zero-variables to zero. Both techniques are also employed by C and C++ compilers, so implementing the initialization code is required there too. First, change the declaration of your variables by using “.byte”, “.hword” and “.word” and include the desired initialization value. Variables that should be initialized by zero get to be placed after a “.bss” directive to put them into the equally-named section. They don’t get an initialization value but just reserved space by using “.space”:

<source lang=asm>
.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1
</source>

From the assembler’s point of view, the initialization data - in this case, just one byte of value “2” - will directly end up in RAM. However, this is not possible on microcontrollers, as the RAM always contains random data on startup and isn’t automatically initialized. To achieve that, change the linker script as follows:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
. = ALIGN(4);
} >FLASH

.stack (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
} >SRAM

.data : {
_DataStart = .;
*(.data);
. = ALIGN(4);
_DataEnd = .;
} >SRAM AT >FLASH

_DataLoad = LOADADDR(.data);

.bss (NOLOAD) : {
_BssStart = .;
*(.bss);
. = ALIGN(4);
_BssEnd = .;
} >SRAM
}
</source>
Example name: “BlinkInitRAM”

The stack got put in its own section with the “NOLOAD” attribute, since it doesn’t need initializing. The data is now put in the “.data” section. The initial data for that section is put into flash memory via the “>SRAM AT >FLASH” construct. The addresses of symbols inside the “.data” section are still the addresses in RAM, so accesses to the symbols from assembly code still work. The symbol “_DataStart” is assigned the beginning of the initialized data in RAM, and “_DataEnd” the end. The “LOADADDR” function is used to get the beginning of the initialization data in flash, and assign it to “_DataLoad”. The “.bss” section contains all the variables that should be zero-initialized, and the symbols “_BssStart” and “_BssEnd” are set to its beginning and end address, respectively. It is marked with “NOLOAD” as well as we don't want to store (potentially many) zeroes in the linked program file, and we will explicitly initialize it (see below). As the beginning and size of the stack are already a multiple of 4, the beginning of “.data” is as well. The size of .data might not be a multiple of 4 however, so an “.=ALIGN(4)” command is inserted right before the definition of “_DataEnd”. This adds 0-3 dummy bytes by incrementing the location counter to make sure the address is a multiple of 4. The same thing is done right before “_BssEnd” and also at the end of the “.text” section, to make sure that “_BssEnd” and “_DataLoad” are multiples of 4 as well.

The only thing left is the actual initialization of the RAM. To do that, change the “Reset_Handler” as follows:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b
.ltorg
</source>

The explicit initialization of the variables was removed. Instead, the addresses for “_DataStart”, “_DataEnd” and “_DataLoad” that were defined in the linker script are loaded. Then, a short loop repeatedly loads a word from flash (i.e. starting with “_DataLoad”) and stores it into RAM (starting at “_DataStart”). The address pointers are incremented by the “ldr”/”str” instructions after the access. The pointer for the RAM location is compared with the end of the RAM area (“_DataEnd”) to decide whether to jump back to the beginning of the loop. To start the loop, a jump directly to the comparison is performed; this avoids the need to do the comparison at the beginning and inside of the loop. The second loop performs the zero-initialization of the area between “_BssStart” and “_BssEnd”; it works similarly, but does not need to load any data.

Unfortunately, the program as shown can’t be translated - as the two variables now reside in two different sections (“.data” and “.bss”), the offset addressing in the “SysTick_Handler” doesn’t work anymore. Therefore, direct addressing has to be used:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr
</source>

=== Peripheral interrupts ===
Interrupts, i.e. exceptions called by periphery modules, need a little extra code compared to the “core” exceptions including the SysTick. The Cortex-M’s interrupt controller (the NVIC) contains several registers for configuring these interrupts. It is possible to configure the priority and manually trigger interrupts, but for most applications, the only necessary thing to do is enabling the desired interrupt. This is done via the registers “NVIC_ISER0” through “NVIC_ISER15”, which are documented in the ARMv7M Architecture Reference Manual in chapter B3.4.4. Each of those registers contains 32 bits with which 32 of the interrupts can be enabled. The STM32F103RB/C8 has 43 interrupts, so only two of the possible 16 registers are present. The number of interrupts is given in chapter 2.3.5 of the controller’s datasheet. So, to enable some interrupt x, the bit “x mod 32” in register NVIC_ISER''y'' with y=x/32 has to be set. This register’s address is 0xE000E100+y*4. Given an interrupt’s number in r0, the following function does just that:

<source lang=asm>
NVIC_ISER0 = 0xE000E100

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>
Example name: “BlinkTIM1”

The “and” instruction calculates “x mod 32”, and the following left-shift (“lsls”) calculates the value where bit “x mod 32” is one, and all others are zero. To calculate the offset address “y*4”, i.e. “(x/32)*4”, the register is first shifted right by 5 bits and then shifted back left by 2 bits. This is the same as shifting 3 bits right and zeroing out the lower 2 bits; but two shift instructions actually consume less program memory space. Finally, the calculated value is written into the register by using offset addressing.

In addition to enabling the interrupt in the processor core’s NVIC, it also has to be enabled in the periphery module. Many periphery modules support several different events, each of which has to be enabled in the periphery’s register individually. Depending on the controller, these can be mapped to one single processor interrupt (and hence, one single ISR) or multiple ones, and need to be configured in the NVIC appropriately.

This example uses the STM32’s periphery timer TIM1 instead of the SysTick timer:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500
TimerPrescaler=1000

.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl EnableClockTIM1
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TIM1_UP_IRQn
bl EnableIRQ
bl StartTIM1
1:
wfi
b 1b
.ltorg

.type TIM1_UP_IRQHandler, %function
.global TIM1_UP_IRQHandler
TIM1_UP_IRQHandler:
ldr r0, =TIM1
ldr r2, =(~(1 << TIMx_SR_UIF))

ldr r1, [r0, #TIMx_SR]
bics r1, r2
beq 2f

str r2, [r0, #TIMx_SR] @ clearing bit UIF in status register by writing bit mask ~(1<<TIMx_SR_UIF)

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type EnableClockTIM1, %function
EnableClockTIM1:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_TIM1EN)
str r0, [r1, #RCC_APB2ENR] @ Set TIM1EN bit in RCC_APB2ENR to 1 to enable TIM1
bx lr @ Return to caller
.ltorg

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartTIM1, %function
StartTIM1:
ldr r0, =TIM1
ldr r1, =(1 << TIMx_CR1_URS)
str r1, [r0, #TIMx_CR1]

ldr r1, =TimerPrescaler
str r1, [r0, #TIMx_PSC]

ldr r1, =TimerValue
str r1, [r0, #TIMx_ARR]

ldr r1, =(1 << TIMx_DIER_UIE)
str r1, [r0, #TIMx_DIER]

ldr r1, =(1 << TIMx_EGR_UG)
str r1, [r0, #TIMx_EGR]

dsb

ldr r1, =(1 << TIMx_CR1_CEN)
str r1, [r0, #TIMx_CR1]

bx lr
.ltorg

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>

The corresponding stm32f103.inc file with the added definitions for the timer registers is:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

RCC = 0x40021000
RCC_APB2ENR = 0x18
RCC_APB2ENR_IOPAEN = 2
RCC_APB2ENR_TIM1EN = 11

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

TIM1 = 0x40012C00

TIMx_CR1 = 0
TIMx_CR1_ARPE = 7
TIMx_CR1_URS = 2
TIMx_CR1_CEN = 0

TIMx_DIER = 0xC
TIMx_DIER_UIE = 0

TIMx_SR = 0x10
TIMx_SR_UIF = 0

TIMx_EGR = 0x14
TIMx_EGR_UG = 0

TIMx_PSC = 0x28
TIMx_ARR = 0x2C

TIM1_UP_IRQn = 25

NVIC_ISER0 = 0xE000E100
</source>

The source code enables the timer’s clock in the RCC before configuring it. The timer supports both a freely configurable prescaler for dividing the clock and a freely configurable maximum value, both of which are set by the StartTIM1 function. The TIMx_DIER_UIE bit is set to enable the interrupt for the so-called “update event”, which is triggered whenever the timer reaches the maximum value. A delicate sequence of register accesses is required to start the timer with the right configuration but without triggering the interrupt right away: To apply the modified settings immediately, the “TIMx_EGR_UG” bit is set to trigger an “artificial” update event. To prevent this from also triggering the interrupt, the “TIMx_CR1_URS” bit is set and cleared before and after, respectively. The timer is started by setting the “TIMx_CR1_CEN” bit at the end. Before that, a “dsb” instruction is inserted. This “Data Synchronization Barrier” waits until all write accesses before that have been completely processed - usually, the processors pipeline is working on several instructions at once. Because the timer configuration needs to be truly finished before starting the timer, this instruction is required. There are some other situations where the processor is too fast for the periphery and needs to be temporarily halted by a “dsb”. If some periphery-accessing code works in step-by-step mode while debugging, but not when executing normally, a well-placed “dsb” might help.

The ISR “TIM1_UP_IRQHandler” is used for the timer. It checks the “TIMx_SR_UIF” bit to verify an update event has actually happened. In that case, the register is overwritten with the value 0xFFFFFFFE, i.e. all bits are written with “1” except the UIF bit. Writing ones has no effect on the bits in this register, and writing a zero clears the respective bit. Therefore, this write access clears the UIF bit but keeps the others. These interrupt flags must always be cleared as soon as possible in the ISR, or the periphery might trigger the interrupt again immediately. The rest of the ISR stays the same.
=== Analysis tools ===
When working on a low level directly with linker scripts and assembly code, it is frequently necessary to directly verify the translation output, as you can’t rely on a compiler doing it right automatically, and flashing the program each time to see whether it works isn’t the most efficient way. This was, in fact, important in creating the example codes for this tutorial. The “binutils” package, of which assembler and linker are part of, offers a few tools that help with analyzing the assembler’s and linker’s output.
==== Disassembler ====
As the name implies, a disassembler is the opposite of an assembler - it turns binary machine code back into a (more or less) readable textual representation. If you feed an ELF file generated by the assembler or linker into the disassembler, it will read the header information to tell apart data (i.e. constants) and code, get names of symbols (and therefore, labels) and can even tell which instructions were generated from which assembly source file, if it was assembler with debug information (i.e. the “-g” flag was used). If you disassemble a binary flash image, the disassembler doesn’t have all those information and will produce a much less readable output and attempt to decode data bytes as instructions.

The disassembler from binutils is called “objdump”. Invoking it on the blinker looks like this:
<pre>
$ arm-none-eabi-objdump -d -s prog1.elf

prog1.elf: file format elf32-littlearm

Contents of section .VectorTable:
8000000 00040020 ed000008 ed010008 ed010008 ... ............
8000010 ed010008 ed010008 ed010008 00000000 ................
8000020 00000000 00000000 00000000 ed010008 ................
8000030 ed010008 00000000 ed010008 49010008 ............I...
8000040 ed010008 ed010008 ed010008 ed010008 ................
8000050 ed010008 ed010008 ed010008 ed010008 ................
8000060 ed010008 ed010008 ed010008 ed010008 ................
8000070 ed010008 ed010008 ed010008 ed010008 ................
8000080 ed010008 ed010008 ed010008 ed010008 ................
8000090 ed010008 ed010008 ed010008 ed010008 ................
80000a0 ed010008 ed010008 ed010008 ed010008 ................
80000b0 ed010008 ed010008 ed010008 ed010008 ................
80000c0 ed010008 ed010008 ed010008 ed010008 ................
80000d0 ed010008 ed010008 ed010008 ed010008 ................
80000e0 ed010008 ed010008 ed010008 ............
Contents of section .text:
80000ec 0f481049 104a03e0 52f8043b 40f8043b .H.I.J..R..;@..;
80000fc 8842f9d3 0d480e49 4ff00002 01e040f8 .B...H.IO.....@.
800010c 042b8842 fbd300f0 47f800f0 4bf84ff4 .+.B....G...K.O.
800011c 80710848 01600848 00f058f8 30bffde7 .q.H.`.H..X.0...
800012c 00040020 04040020 f0010008 04040020 ... ... .......
800013c 08040020 10080140 60e31600 4ff0e020 ... ...@`...O..
800014c 006910f4 803f1dd0 1a480178 b1f1110f .i...?...H.x....
800015c 18d21948 0378013b 1cbf0370 12e001f1 ...H.x.;...p....
800016c 0101b1f1 110f02d2 144a535c 03701148 .........JS\.p.H
800017c 017011f0 01010cbf 4ff48071 4ff08071 .p......O..qO..q
800018c 0f480160 70470202 02020202 05020502 .H.`pG..........
800019c 05020202 02020200 0a490868 40f00400 .........I.h@...
80001ac 08607047 08490868 20f00f00 40f00200 .`pG.I.h ...@...
80001bc 08607047 04040020 00040020 92010008 .`pG... ... ....
80001cc 10080140 18100240 04080140 4ff0e021 ...@...@...@O..!
80001dc 48614ff0 00008861 4ff00700 08617047 HaO....aO....apG
80001ec 00befde7 ....
Contents of section .data:
20000400 02000000 ....
Contents of section .ARM.attributes:
0000 41200000 00616561 62690001 16000000 A ...aeabi......
0010 05436f72 7465782d 4d330006 0a074d09 .Cortex-M3....M.
0020 02 .
Contents of section .debug_line:
0000 98000000 02001e00 00000201 fb0e0d00 ................
0010 01010101 00000001 00000100 70726f67 ............prog
0020 312e5300 00000000 000502ec 00000803 1.S.............
0030 15012121 22212f2f 21222121 30212f21 ..!!"!//!"!!0!/!
0040 222f302f 21232130 21036120 2f2f362f "/0/!#!0!.a //6/
0050 030c2e32 030a2e2f 212f2222 222f2221 ...2.../!/"""/"!
0060 21222121 222f2f22 21212321 222f212f !"!!"//"!!#!"/!/
0070 30212303 0d9e2121 2f212421 212f2f21 0!#...!!/!$!!//!
0080 03422035 030c2e03 0d2e0311 2e36030b .B 5.........6..
0090 2e30212f 222f2202 01000101 3b000000 .0!/"/".....;...
00a0 02002400 00000201 fb0e0d00 01010101 ..$.............
00b0 00000001 00000100 76656374 6f727461 ........vectorta
00c0 626c652e 53000000 00000005 02ec0100 ble.S...........
00d0 0803d000 01210201 000101 .....!.....
Contents of section .debug_info:
0000 22000000 02000000 00000401 00000000 "...............
0010 ec000008 ec010008 00000000 08000000 ................
0020 12000000 01802200 00000200 14000000 ......".........
0030 04019c00 0000ec01 0008f001 00082100 ..............!.
0040 00000800 00001200 00000180 ............
Contents of section .debug_abbrev:
0000 01110010 06110112 01030e1b 0e250e13 .............%..
0010 05000000 01110010 06110112 01030e1b ................
0020 0e250e13 05000000 .%......
Contents of section .debug_aranges:
0000 1c000000 02000000 00000400 00000000 ................
0010 ec000008 00010000 00000000 00000000 ................
0020 1c000000 02002600 00000400 00000000 ......&.........
0030 ec010008 04000000 00000000 00000000 ................
Contents of section .debug_str:
0000 70726f67 312e5300 2f746d70 2f746573 prog1.S./tmp/tes
0010 7400474e 55204153 20322e32 392e3531 t.GNU AS 2.29.51
0020 00766563 746f7274 61626c65 2e5300 .vectortable.S.

Disassembly of section .text:

080000ec <Reset_Handler>:
80000ec: 480f ldr r0, [pc, #60] ; (800012c <Reset_Handler+0x40>)
80000ee: 4910 ldr r1, [pc, #64] ; (8000130 <Reset_Handler+0x44>)
80000f0: 4a10 ldr r2, [pc, #64] ; (8000134 <Reset_Handler+0x48>)
80000f2: e003 b.n 80000fc <Reset_Handler+0x10>
80000f4: f852 3b04 ldr.w r3, [r2], #4
80000f8: f840 3b04 str.w r3, [r0], #4
80000fc: 4288 cmp r0, r1
80000fe: d3f9 bcc.n 80000f4 <Reset_Handler+0x8>
8000100: 480d ldr r0, [pc, #52] ; (8000138 <Reset_Handler+0x4c>)
8000102: 490e ldr r1, [pc, #56] ; (800013c <Reset_Handler+0x50>)
8000104: f04f 0200 mov.w r2, #0
8000108: e001 b.n 800010e <Reset_Handler+0x22>
800010a: f840 2b04 str.w r2, [r0], #4
800010e: 4288 cmp r0, r1
8000110: d3fb bcc.n 800010a <Reset_Handler+0x1e>
8000112: f000 f847 bl 80001a4 <EnableClockGPIOA>
8000116: f000 f84b bl 80001b0 <ConfigurePA8>
800011a: f44f 7180 mov.w r1, #256 ; 0x100
800011e: 4808 ldr r0, [pc, #32] ; (8000140 <Reset_Handler+0x54>)
8000120: 6001 str r1, [r0, #0]
8000122: 4808 ldr r0, [pc, #32] ; (8000144 <Reset_Handler+0x58>)
8000124: f000 f858 bl 80001d8 <StartSysTick>
8000128: bf30 wfi
800012a: e7fd b.n 8000128 <Reset_Handler+0x3c>
800012c: 20000400 .word 0x20000400
8000130: 20000404 .word 0x20000404
8000134: 080001f0 .word 0x080001f0
8000138: 20000404 .word 0x20000404
800013c: 20000408 .word 0x20000408
8000140: 40010810 .word 0x40010810
8000144: 0016e360 .word 0x0016e360

08000148 <SysTick_Handler>:
8000148: f04f 20e0 mov.w r0, #3758153728 ; 0xe000e000
800014c: 6900 ldr r0, [r0, #16]
800014e: f410 3f80 tst.w r0, #65536 ; 0x10000
8000152: d01d beq.n 8000190 <SysTick_Handler+0x48>
8000154: 481a ldr r0, [pc, #104] ; (80001c0 <ConfigurePA8+0x10>)
8000156: 7801 ldrb r1, [r0, #0]
8000158: f1b1 0f11 cmp.w r1, #17
800015c: d218 bcs.n 8000190 <SysTick_Handler+0x48>
800015e: 4819 ldr r0, [pc, #100] ; (80001c4 <ConfigurePA8+0x14>)
8000160: 7803 ldrb r3, [r0, #0]
8000162: 3b01 subs r3, #1
8000164: bf1c itt ne
8000166: 7003 strbne r3, [r0, #0]
8000168: e012 bne.n 8000190 <SysTick_Handler+0x48>
800016a: f101 0101 add.w r1, r1, #1
800016e: f1b1 0f11 cmp.w r1, #17
8000172: d202 bcs.n 800017a <SysTick_Handler+0x32>
8000174: 4a14 ldr r2, [pc, #80] ; (80001c8 <ConfigurePA8+0x18>)
8000176: 5c53 ldrb r3, [r2, r1]
8000178: 7003 strb r3, [r0, #0]
800017a: 4811 ldr r0, [pc, #68] ; (80001c0 <ConfigurePA8+0x10>)
800017c: 7001 strb r1, [r0, #0]
800017e: f011 0101 ands.w r1, r1, #1
8000182: bf0c ite eq
8000184: f44f 7180 moveq.w r1, #256 ; 0x100
8000188: f04f 7180 movne.w r1, #16777216 ; 0x1000000
800018c: 480f ldr r0, [pc, #60] ; (80001cc <ConfigurePA8+0x1c>)
800018e: 6001 str r1, [r0, #0]
8000190: 4770 bx lr

08000192 <BlinkTable>:
8000192: 0202 0202 0202 0205 0205 0205 0202 0202 ................
80001a2: .

080001a3 <BlinkTableEnd>:
...

080001a4 <EnableClockGPIOA>:
80001a4: 490a ldr r1, [pc, #40] ; (80001d0 <ConfigurePA8+0x20>)
80001a6: 6808 ldr r0, [r1, #0]
80001a8: f040 0004 orr.w r0, r0, #4
80001ac: 6008 str r0, [r1, #0]
80001ae: 4770 bx lr

080001b0 <ConfigurePA8>:
80001b0: 4908 ldr r1, [pc, #32] ; (80001d4 <ConfigurePA8+0x24>)
80001b2: 6808 ldr r0, [r1, #0]
80001b4: f020 000f bic.w r0, r0, #15
80001b8: f040 0002 orr.w r0, r0, #2
80001bc: 6008 str r0, [r1, #0]
80001be: 4770 bx lr
80001c0: 20000404 .word 0x20000404
80001c4: 20000400 .word 0x20000400
80001c8: 08000192 .word 0x08000192
80001cc: 40010810 .word 0x40010810
80001d0: 40021018 .word 0x40021018
80001d4: 40010804 .word 0x40010804

080001d8 <StartSysTick>:
80001d8: f04f 21e0 mov.w r1, #3758153728 ; 0xe000e000
80001dc: 6148 str r0, [r1, #20]
80001de: f04f 0000 mov.w r0, #0
80001e2: 6188 str r0, [r1, #24]
80001e4: f04f 0007 mov.w r0, #7
80001e8: 6108 str r0, [r1, #16]
80001ea: 4770 bx lr

080001ec <Default_Handler>:
80001ec: be00 bkpt 0x0000
80001ee: e7fd b.n 80001ec <Default_Handler>
</pre>

This is a lot of information. The “-d” flag tells objdump to disassemble code sections, and the “-s” flag lets it output data sections. At first, it prints the contents of “.VectorTable”. Each line is prefixed with the address of where this data is found in memory. Then, the 32bit data blocks from the vector table are output. The disassembler prints the bytes in the order they appear in memory, which, since the Cortex-M3 uses little endian, is reversed - for example, the printed “ed000008” actually refers to the address “0x080000ed”, which is the address of the “Reset_Handler” with the lowest bit set to one, as it is a thumb function. Most of the addresses in the vector table reflect the address of the default handler, 0x080001ec, except for the zero-entries and the SysTick_Handler. The contents of the “.text” section is the hexadecimal representation of the machine code, and hardly readable. The “.data” section contains a single “two” - this is the “02” put into “TimerEvents”. The contents of “.ARM.attributes:” and the various “.debug” sections is not very interesting, as it does not end up on the controller, and is only read by the various analysis tools to provide nicer output.

After that comes the actual disassembly. This is a list of all the instructions in the code section. The list is grouped by the symbols found in the input file. For C Code, each symbol usually matches one function, so each block in the disassembly represents one C function. In assembly code, if you put non-local labels into a function, that function will be split into multiple blocks by the disassembler, making it harder to read - the main reason for using local labels. Each instruction is translated into one line inside the blocks. The first column is the address where that instruction is found. The next column contains the hexadecimal representation of the 2 or 4 bytes that make up the machine code of that instruction, i.e. the actual content of flash memory. After that comes a textual representation of that instruction, as inferred by the disassembler. If the instruction contains some number, the disassembler sometimes outputs a semicolon followed by some interpretation of that number. If the instruction employs PC-relative addressing, that interpretation will be the absolute address. As many instructions have multiple spellings, there can be discrepancies between the original code and the disassembly. The disassembler will also output data, such as the “BlinkTable” and the literal pools, as such. Using the “.type” directive is helpful in that case so that the disassembler does not attempt to interpret the data bytes as code.

objdump can also be used to disassembly raw binary files that can be obtained by reading back the flash memory of some controller. To do this, use this command line:

<source lang=sh>
$ arm-none-eabi-objdump -b binary -m arm -D prog1.bin -Mforce-thumb --adjust-vma=0x08000000
</source>

The address of the binary in flash memory is specified so that the printed instruction addresses are correct. However, as the disassembler can’t tell data and code apart, the result will be of limited use. If you have to analyze a binary without having an ELF file or the source code, a more sophisticated disassembler such as IDA Pro is helpful. If you have the code and only need the disassembler to identify potential problems with the project (esp. the linker script), objdump is usually sufficient.
==== readelf ====
The “readelf” program is a powerful utility that can read and output various information from ELF files. The most useful option is the “-S” flag, which lets readelf print a summary of the sections in the respective file, e.g.:

<pre>
$ arm-none-eabi-readelf -S prog1.elf
There are 15 section headers, starting at offset 0x11268:

Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .VectorTable PROGBITS 08000000 010000 0000ec 00 A 0 0 1
[ 2] .text PROGBITS 080000ec 0100ec 000104 00 AX 0 0 4
[ 3] .stack NOBITS 20000000 020000 000400 00 WA 0 0 1
[ 4] .data PROGBITS 20000400 010400 000004 00 WA 0 0 1
[ 5] .bss NOBITS 20000404 010404 000004 00 WA 0 0 1
[ 6] .ARM.attributes ARM_ATTRIBUTES 00000000 010404 000021 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 010425 0000db 00 0 0 1
[ 8] .debug_info PROGBITS 00000000 010500 00004c 00 0 0 1
[ 9] .debug_abbrev PROGBITS 00000000 01054c 000028 00 0 0 1
[10] .debug_aranges PROGBITS 00000000 010578 000040 00 0 0 8
[11] .debug_str PROGBITS 00000000 0105b8 00002f 01 MS 0 0 1
[12] .symtab SYMTAB 00000000 0105e8 0006a0 10 13 45 4
[13] .strtab STRTAB 00000000 010c88 000550 00 0 0 1
[14] .shstrtab STRTAB 00000000 0111d8 000090 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
</pre>

For each section, one line is output. The sections “.strtab”, “.shstrtab”, “.symtab” and “NULL” are an integral part of ELF and always present. The “.debug” sections are present if the source was assembled with the “-g” flag. The “.ARM.attributes” section defines for which ARM processor the contained code was translated. These sections don’t end up on the microcontroller. The remaining sections were defined in the linker script: “.VectorTable” contains the addresses of the exception handlers, “.text” contains the program code and constant data for flash memory, “.stack” the stack in RAM, “.data” contains variables in RAM and “.bss” contains zero-initialized variables in RAM. For these sections, the column “Type” contains either “PROGBITS” or “NOBITS” that tells you whether the section in the ELF file actually contains some data - this is only the case for “.VectorTable”, “.text” and “.data”. The sections “.bss” and “.stack” only reserve memory that is written at runtime, but the ELF file doesn’t contain data to be written in these sections. The column “Addr” defines where this section begins in the address space. The most useful column is “Size”: If you sum up the sizes of the sections “.VectorTable”, “.text” and “.data”, you can obtain the used flash memory. By summing ob “.data”, “.stack” and “.bss”, you get the used amount of RAM. Note that “.data” is counted twice, as the initialization data is stored in flash.
==== nm ====
The “nm” utility prints the symbols defined in an ELF file, for example:

<pre>
$ arm-none-eabi-nm prog1.elf
080001ec W ADC1_2_IRQHandler
20000404 b BlinkStep
08000192 t BlinkTable
080001a3 t BlinkTableEnd
20000408 B _BssEnd
20000404 B _BssStart
…
</pre>

This can be helpful in analyzing errors in linker scripts where symbols might get assigned wrong addresses.
==== addr2line ====
The “addr2line” utility reads the debug information from an ELF file to determine which line in which source file produced the instruction found at a particular given address. For example:

<source lang=sh>
$ arm-none-eabi-addr2line 0x080000f0 -e prog1.elf
/tmp/test/prog1.S:24
</source>

Here, line 24 of “prog1.S” contains the assembler command that produced the instruction that ends up at address 0x080000f0.

==== objcopy ====
The “objcopy” utility allows you to translate program files between different formats. It is useful to convert the ELF files to both the Intel Hex format and a simple binary representation. For example,

<source lang=sh>
arm-none-eabi-objcopy -O ihex prog1.elf prog1.hex
</source>

produces a “.hex” file that contains an image of the flash contents in hexadecimal form. With

<source lang=sh>
arm-none-eabi-objcopy -O binary prog1.elf prog1.bin
</source>

a binary file is created which contains an exact 1:1 image of the flash contents. Some flashing tools require these formats instead of ELF, and viewing the binary file with a hex editor can be interesting as well.

=== Interfacing C and C++ code ===
Since assembly is rarely used to implement entire complex projects, but mostly for few time-critical or especially low-level routines that are part of larger code bases written in a high-level-language, interfacing C and assembly code is an important topic, which will be covered here. While it is possible to write the main project structure in assembly and integrate some C modules, it is usually done the other way round. Most of the code shown is already ready to be included in C programs. Most of this topic works the same way for C++, apart from C++ exceptions (not to be confused with ARM processor exceptions) - but these are rarely used on embedded targets anyways.

If you compile C, C++ and assembly code into individual .o object files, you can link these together using “ld” as before. However, C and C++ code usually requires access to the respective standard library, and “ld” doesn’t link these by default - therefore it is necessary to substitute “ld” for a call to “gcc” or “g++” for C or C++, respectively. This will call “ld” internally and pass the required libraries.
==== Environment setup for C and C++ ====
Many C projects use a reset handler and vector table implemented in assembly, although writing them in C is possible too. As required by the C standard, C programs start with the “main()” function, so the (assembly) reset handler should setup the environment such that it is ready for C, and then call “main”. The C code might then later call some assembly functions or inline assembly. When using C++ code, or some GCC extension for C code, it is required to call some additional functions before calling “main”. This is used by C++ to call the constructors of global objects. The C and C++ compilers emit a table of function pointers to functions that should be called at startup. This table has to be put into flash memory by modifying the linker script as follows:

<source>
.text : {
*(.text)
. = ALIGN(4);

_InitArrayStart = .;
*(SORT(.preinit_array*))
*(SORT(.init_array*))
_InitArrayEnd = .;
} >FLASH
</source>

The table of function pointers is sorted to keep the order needed by the compiler. The symbols “_InitArrayStart” and “_InitArrayEnd” mark beginning and end of that table. A reset handler that performs the memory initialization as before and calls the table of initialization functions could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

ldr r4, =_InitArrayStart
ldr r5, =_InitArrayEnd

b 2f
1: ldr r0, [r4], #4
blx r0
2: cmp r4, r5
blo 1b

bl main
1: bkpt
b 1b
.ltorg
</source>

Note that for iterating the table, registers r4 and r5 are used, since the called functions may not overwrite those. The “blx” instruction is needed to perform the indirect function call. When everything is set up, the main function is called. For embedded programs, the main function should never return (i.e. contain an endless loop). If it does, that’s an error, and to make it easier to find, an endless loop with a forced breakpoint is put right after the call to “main”.
==== Calling functions ====
To call assembly functions from C code and vice-versa, the assembly functions should observe the calling convention, as mentioned before. C functions can be called just like assembly functions from assembly code, by placing the parameters in register r0-r3 and on the stack, calling the function using “bl” and retrieving the return value from r0. To call an assembly function from C code, you need to declare it in C first just like a C function. For example, to call a function that takes 2 integer arguments and returns an integer:

<source lang=c>
int AssemblyFunction (int a, int b);
</source>

If you now define a function named “AssemblyFunction” in your assembly code and export it via “.global”, you can call it from C code just like any function.
==== Accessing global variables ====
Global variables defined in C can be accessed from assembly code just like variables defined in assembly code, by using the variable’s name. To access an assembly variable from C code, you need to declare it first by specifying the type. For example, to declare an integer variable:

<source lang=c>
extern int AssemblyVariable;
</source>

If you now define a variable named “AssemblyVariable” in your assembly code and export it via “.global”, you can access it from C code just like any variable. The “extern” is required to make sure the C code doesn’t attempt to declare another variable of the same name.
=== Clock configuration ===
By default, STM32 controllers use an internal RC-oscillator with 8 MHz as a clock source for the core and periphery. This oscillator is too inaccurate for implementing a clock or using serial interfaces such as UART, USB or CAN. To obtain a more accurate clock, an external quartz crystal is usually applied. Many STM32 boards feature an 8 MHz crystal. To use it, some initialization code is required that activates the microcontroller’s built-in crystal-oscillator circuit and switches the clock input to that. The STM32 controllers also include a PLL which can multiply some input clock by a configurable factor before feeding it to the processor core and peripherals. This way, a precise and fast clock can be achieved - the STM32F103 supports up to 72 MHz core frequency. Unfortunately, flash memory is not capable of keeping up with such a high frequency. Therefore, when enabling a fast clock, the flash memory needs to be configured to use wait states depending on the frequency.

The following function configures the flash wait states, enables the crystal oscillator, configures the PLL to multiply the input clock by a factor of 9, and use that as the system clock. The prescaler for the internal bus APB1 is set to 2. Assuming an 8 MHz crystal, this achieves the maximum performance possible with this microcontroller - 72 MHz for the core and APB2 domain, 36 MHz for APB1. If a different crystal is used, the PLL factors have to be adjusted.

<source lang=asm>
RCC = 0x40021000

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

.type ConfigureSysClock, %function
.global ConfigureSysClock
ConfigureSysClock:
@ Turn on HSE
ldr r0, =RCC
ldr r1, =((1 << RCC_CR_HSION)|(1 << RCC_CR_HSEON))
str r1, [r0, #RCC_CR]

@ Configure (but not start yet) PLL
@ Mul = 9, Prediv = 1, APB1 Prescaler = 2, APB2 Prescaler = 1, AHB Prescaler = 1
ldr r2, =(((9-2)<<RCC_CFGR_PLLMUL)|(1 << RCC_CFGR_USBPRE)|(1 << RCC_CFGR_PLLSRC)|(4 << RCC_CFGR_PPRE1))
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
orr r1, #(1 << RCC_CR_PLLON)

@ Wait for HSE ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_HSERDY)
beq 1b

@ Turn on PLL
str r1, [r0, #RCC_CR]

@ Pre-Calculate value for RCC_CFGR
orr r2, #(2 << RCC_CFGR_SW)

@ Wait for PLL ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_PLLRDY)
beq 1b

@ Set flash wait states to 2
ldr r0, =FLASH
ldr r3, =((1<<FLASH_ACR_PRFTBE)|(2<<FLASH_ACR_LATENCY))
str r3, [r0, #FLASH_ACR]
ldr r0, =RCC

@ Switch system clock to PLL
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
bic r1, #(1 << RCC_CR_HSION)

@ Wait for switch to PLL
1: ldr r3, [r0, #RCC_CFGR]
and r3, #(3 << RCC_CFGR_SWS)
cmp r3, #(2 << RCC_CFGR_SWS)
bne 1b

@ Turn off HSI to save power
str r1, [r0, #RCC_CR]

bx lr
.ltorg
</source>

Many projects perform the clock configuration by the reset handler before calling the main function. If you want to follow that practice, place a “bl ConfigureSysClock” as the first instruction in the “Reset_Handler” - this way, all the setup will run with the higher clock frequency, making start-up faster. This and the completed startup code from the previous chapters is implemented in the “startup.S” file in the example repository. If you use it, put your code in the “main” function, where RAM and system clock will already be initialized. This is shown in the “BlinkStartup” example.

=== Project template & makefile ===
To quickly start your own project, a project template is supplied in the examples repository under the directory [https://github.com/Erlkoenig90/ArmAsmTutorial/tree/master/ProjectTemplate-STM32F103RB ProjectTemplate-STM32F103RB]. Put your own application code in the program.S file. The startup.S and vectortable.S contain the reset handler with RAM initialization and the vector table with default handler, respectively. A linker script is included too.

The project also contains a makefile. This allows you to quickly translate your project without having to type the assembler and linker commands. Simply type
<source lang=sh>
make
</source>
To translate the code and produce program.elf, program.bin and program.hex files. All “.S” files in the directory will be automatically translated. Writing makefiles is a complex topic on its own with a lot of information already available on the web, so no further explanations on that will be made here.

ARM-ASM-Tutorial

2021-07-08T14:10:30Z

Nicolas:

[[Kategorie:ARM]][[Kategorie:STM32]][[Kategorie:Entwicklungstools]][[Kategorie:Programmiersprachen]]The [[ARM]] processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language.

Tutorial by [[Benutzer:Erlkoenig|Niklas Gürtler]]. [https://www.mikrocontroller.net/topic/482409 Thread in Forum] for feedback and questions.

== Introduction ==
=== Why assembly? ===
Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language!

As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation.

Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary.

=== About ARM ===
Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers.

Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications.

Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor.

=== Architecture and processor variants ===
A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”.

While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm.

ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations.

Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android).

An overview of ARM processors and their implemented architecture version can be found on [https://en.wikipedia.org/wiki/ARM_architecture#Cores Wikipedia].
This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use.

There is actually not a single, but three instruction sets for ARM processors:
* The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.
* The “A64” instruction set is for the new 64bit ARM processors
* The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes.

The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set.

== Prerequisites ==
First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components.

=== Microcontroller selection ===
For the microcontroller, an [https://www.st.com/en/microcontrollers-microprocessors/stm32f103c8.html STM32F103C8] or [https://www.st.com/en/microcontrollers-microprocessors/stm32f103rb.html STM32F103RB] by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: [https://www.st.com/en/evaluation-tools/nucleo-f103rb.html Nucleo-F103], “[https://os.mbed.com/users/hudakz/code/STM32F103C8T6_Hello/ Blue Pill]” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), [https://www.olimex.com/Products/Duino/STM32/OLIMEXINO-STM32/open-source-hardware Olimexino-STM32], [https://www.olimex.com/Products/ARM/ST/STM32-P103/ STM32-P103], [https://www.olimex.com/Products/ARM/ST/STM32-H103/ STM32-H103], [https://www.st.com/en/evaluation-tools/stm3210e-eval.html STM3210E-EVAL].

=== Processor type & documentation ===
First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation.

* The first source of information is the [https://www.st.com/resource/en/datasheet/stm32f103rb.pdf STM32F103RB/C8 datasheet]. According to the headline, this is a '''medium-density''' device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a '''Cortex-M3''' processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.
* The next important document is the [https://www.st.com/resource/en/reference_manual/cd00171190.pdf STM32F103 reference manual] that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.
* The [https://developer.arm.com/ip-products/processors/cortex-m/cortex-m3 ARM developer website] provides information about the Cortex-M3 processor core, particularly the [https://static.docs.arm.com/100165/0201/arm_cortexm3_processor_trm_100165_0201_01_en.pdf ARM Cortex-M3 Processor Technical Reference Manual]. According to chapter 1.5.3, this processor implements the '''ARMv7-M architecture'''.
* The architecture is documented in the [https://static.docs.arm.com/ddi0403/ed/DDI0403E_d_armv7m_arm.pdf ARMv7M Architecture Reference Manual]. Particularly, it contains the complete documentation of the instruction set.

For any serious STM32 development, you should be familiar with all these documents.

=== Debug adapter ===
There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD.

Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the [https://www.st.com/en/development-tools/st-link-v2.html ST-Link] debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them.

=== Development Software ===
On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested).

First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the [https://www.st.com/en/development-tools/stsw-link004.html ST-Link Utility] can be useful, but is not strictly required.

Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm GNU Arm Embedded Toolchain] is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it.

Another component is required to talk with the debug adapter. For the ST-Link, this is done by [http://openocd.org/ OpenOCD], which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software.

Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable.

== Setup ==
Follow the instructions in the next chapters to set up your development environment.

=== Hardware ===
The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input.

If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB.

=== Software ===
==== Linux ====
Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM.

Download the Linux binary tarball from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”).

Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via:

<source lang=sh>
echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh
</source>

OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian):
<source lang=sh>
sudo apt-get install openocd
</source>

After that, log out and back in (or just reboot). In a terminal, type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source lang=sh>
$ arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source lang=sh>
$ openocd -v
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

==== Windows ====
[[File:ArmAsmTutorial_GccOptions.png|300px|thumb|right|Options for installing GCC]]Download the Windows installer from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”.

A Windows package for OpenOCD can be obtained from the [https://github.com/gnu-mcu-eclipse/openocd/releases gnu-mcu-eclipse downloads page]. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path.

[[File:ArmAsmTutorial_PcProperties.png|300px|thumb|right|Opening PC properties]][[File:ArmAsmTutorial_SetEnvVar.png|300px|thumb|right|Setting environment variable]]Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times.

Open a ''new'' command window (Windows Key + R, type “cmd” + Return). Type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source>
C:\>arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source>
C:\>openocd -v
GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

== Writing assembly applications ==
The full source code of the examples in the following chapters contain be found on [https://github.com/Erlkoenig90/ArmAsmTutorial GitHub]. The name of the corresponding directory is given after each example code below.
=== First rudimentary program ===
After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”.

Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

nop @ Do Nothing
b . @ Endless loop
</source>
Example name: “EmptyProgram”

When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents.

The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment.

The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later.

The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program.

To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing <code>cd <Path to Project Directory></code>. Call the assembler like this:

<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
</source>

This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this:

<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -Ttext=0x8000000
</source>

This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this:

<pre>arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000</pre>

This is not relevant for executing the program without an operating system and can be ignored.

=== Flashing the program ===
To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit"
</source>
Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption.
=== Starting the debugger ===
To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg
</source>
Then, open a new terminal/command window and start a GDB session:
<source lang=sh>
arm-none-eabi-gdb prog1.elf
</source>
GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance:
<source>
target remote :3333
</source>
Then, stop the currently running program:
<source>
monitor reset halt
</source>
If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing:
<source>
load
</source>
Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again:
<source>
monitor reset halt
</source>
Now, examine the contents of the CPU registers:
<source>
info reg
</source>
The output should look something like
<pre>
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0x0 0x0
lr 0x0 0
pc 0x8000000 0x8000000 <_stack+133693440>
xPSR 0x1000000 16777216
msp 0x20000400 0x20000400
psp 0x27e3fa34 0x27e3fa34
primask 0x0 0
basepri 0x0 0
faultmask 0x0 0
control 0x0 0
</pre>

At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing
<source>
stepi
</source>
If you type <code>info reg</code> again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do <source>stepi</source> again (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing
<source>
continue
</source>
You can interrupt the running program by pressing “Ctrl+C”. Run the commands
<source>
kill
quit
</source>
to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal.
=== Using processor registers ===
The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing.

All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”.

As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is:
<source lang=asm>
ldr r0, =123456789
</source>
The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead.

The instruction “mov” can be used to copy the contents from one register to another:
<source lang=asm>
mov r1, r0
</source>
This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers.

In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual.
=== Accessing periphery ===
To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses.

The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor.

As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step.

==== Clock Configuration ====
Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably.

This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004.

To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”.

In short: To enable GPIOA, the value '''0x00000004''' needs to be written to address '''0x40021018'''.

According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is:
* Load the value 0x00000004 into a register
* Load the value 0x40021018 into another register
* Store the value from the first register into the memory location specified by the second register.
This last step is performed by the “STR” instruction as follows:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
b .
</source>
The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet.

==== GPIO Configuration ====
By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”.

Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”.

Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on.

The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register).

Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”:
<source lang=asm>
ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

==== Writing GPIO pins ====
The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x100
ldr r1, =0x4001080C
str r0, [r1] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high

b .
</source>
Example name: “SetPin”

This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect!

=== Data processing ===
ARM supports many instructions for mathematical operations. For example, addition can be performed as:
<source lang=asm>
ldr r0, =222
ldr r1, =111
add r2, r0, r1
</source>
This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right.

You can also overwrite an input register with the result:
<source lang=asm>
add r0, r0, r1
</source>
This will write the result to r0, overwriting the previous value. This is commonly shortened to
<source lang=asm>
add r0, r1
</source>
The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are:
* '''sub''' for subtraction
* '''mul''' for multiplication
* '''and''' for bitwise and
* '''orr''' for bitwise or
* '''eor''' for bitwise exclusive or (“xor”)
* '''lsl''' for logical left shift
* '''lsr''' for logical right shift

Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example,
<source lang=asm>
add r0, r0, #23
</source>
adds 23 to the register r0 and stores the result in r0. This can again be shortened to
<source lang=asm>
add r0, #23
</source>

Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register.

Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program.

=== Reading periphery registers ===
The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle.

Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via:
<source lang=asm>
ldr r1, =0x40021018
ldr r0, [r1]
</source>
Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1.

The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction:
<source lang=asm>
orr r0, r0, #4
</source>
After that, we can store r0 as before.

With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value:

<source lang=asm>
ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

So, the modified program is:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “SetPin2”
=== Jump instructions ===
For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1]
</source>

By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning.

A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b 0x8000104
</source>
Example name: “Blink”

The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative.

When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there.

This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones.

Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon:

<source lang=asm>
BlinkLoop:

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b BlinkLoop
</source>
Example name: “Blink2”

This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this:

<source lang=asm>
b (.+4+42*2)
</source>

The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though.

=== Counting Loops ===
The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count ''down'' than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”:

<source lang=asm>
ldr r2, =1000000
subs r2, #1
</source>

Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is ''not'' set, i.e. when the result of the last math instruction (with an “s” appended) was ''not'' zero. The “beq” instruction is the opposite of that – it only performs a jump if the result ''was'' zero.

So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet:
<source lang=asm>
ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop
</source>

The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

BlinkLoop:
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r2, =1000000
delay2:
subs r2, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay”

You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change ''before'' the loop. Then, just use them inside the loop:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay2”

=== Using RAM ===
Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored.

As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right ''before'' that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash):

<source lang=asm>
.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text
@ Instructions go here...
</source>

Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text

ldr r0, =var1 @ Get address of var1
ldr r1, =0x12345678
str r1, [r0] @ Store 0x12345678 into memory block “var1”

ldr r1, [r0] @ Read memory block “var1”
and r1, #0xFF @ Set bits 8..31 to zero
ldr r0, =var2 @ Get address of var2
strb r1, [r0] @ Store a single byte into var2

b .
</source>
Example name: “RAMVariables”

Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored.

There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option <code>-Tdata=0x20000000</code> to the <code>arm-none-eabi-ld</code> call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly.

If you run this program via GDB, you can use the commands <code>x/1xw &var1</code> and <code>x/1xb &var2</code> to read the data stored in memory. After this quick introduction a more abstract overview is indicated.

== Memory Management ==
If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be used effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works.

=== Address space ===
In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123.

However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash.

While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined ''before'' the code, the code actually comes first in address space. What makes all this work is the Linker.

=== The Linker ===
Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as using multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes.

[[File:ArmAsmTutorial_Linking.svg|300px|thumb|right|Translation of native applications using assembler, compiler and linker]]You have already used a linker – the command <code>arm-none-eabi-ld</code> calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file.

When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable.

On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found [https://developer.arm.com/architectures/system-architectures/software-standards/abi here], and the generic specification for ELF on which the ARM ELF variant is based can be found [http://www.sco.com/developers/gabi/2003-12-17/contents.html here].

ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses.

Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well.

As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error:

<pre>(.text+0x132): undefined reference to `Foo'</pre>

Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker.

=== Linker Scripts ===
A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found [https://sourceware.org/binutils/docs/ld/Scripts.html here].

It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
*(.data)
} >SRAM
}
</source>
Example name: “LinkerScriptSimple”

This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations.

The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example:

<source>
SECTIONS {
.FlashText : {
*(.text)
} >FLASH

.RamData (NOLOAD) : {
*(.data)
} >SRAM
}
</source>

The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone.

To specify the filename of the linker script, use the “-T” option:
<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld
</source>
The <code>-Tdata</code> and <code>-Ttext</code> aren’t needed anymore, as the addresses are now defined in the linker script.

Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM):
<pre>arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH'
arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes</pre>

==== Reserving memory blocks ====
Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the ''beginning'' of SRAM, so put this before the “*(.data)” command:

<source>. = . + 0x400;</source>

Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400.

==== Defining symbols in linker scripts ====

As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file:

<source lang=asm>
.word 0x20000400
.word 0x080000ed
.space 0xe4
</source>

The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte ''after'' the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script:

<source>
.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
</source>
Example name: “LinkerScriptSymbols”

This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol:

<source lang=asm>
.word _StackEnd
</source>

The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores.

The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word _StackEnd
.word Reset_Handler
.space 0xe4

.type Reset_Handler, %function
Reset_Handler:

@ Put code here
</source>

The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now.

==== Absolute section placement ====
The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:
</source>
Example name: “LinkerScriptAbsolutePlacement”

The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.VectorTable)
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>

=== Program Structure ===
Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Assemble and link this source code with two assembler commands:
<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
arm-none-eabi-as -g vectortable.S -o vectortable.o
arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld
</source>

This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
@ Code here ...
</source>

This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a ''local'' symbol for each label, which can’t be used from other source files (same as ''static'' in C). The symbol is still there in the final program file, though - it can be used for debugging purposes.

== More assembly techniques ==
After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced.

=== Instruction set state ===
As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective.

Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32.

Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is:

<source lang=asm>
ldr r0, =SomeLabel
bx r0
</source>

Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions.

In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive:

<source lang=asm>
.type SomeLabel, %function
SomeLabel:
@ Some code...
</source>

That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch.

As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too:

<source lang=asm>
.type Reset_Handler, %function
</source>

If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example:

<source lang=asm>
.cpu cortex-a8
.syntax unified

@ Small but slower code here
.thumb

.type Block1, %function
Block1:
ldr r0, =Block2
bx r0

@ Larger but faster code here
.arm

.type Block2, %function
Block2:
@ ...
</source>

The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below).

=== Constants ===
The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly.

The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways:
<source lang=asm>
RCC_APB2ENR = 0x40021018
.set GPIOA_CRH, 0x40010804
.equ GPIOA_ODR, 0x4001080C
</source>

and then use it in place of the number:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_10MHz = 1
GPIOx_CRx_GP_PP_2MHz = 2
GPIOx_CRx_GP_PP_50MHz = 3

GPIOx_CRx_GP_OD_10MHz = 1|4
GPIOx_CRx_GP_OD_2MHz = 2|4
GPIOx_CRx_GP_OD_50MHz = 3|4

GPIOx_CRx_AF_PP_10MHz = 1|8
GPIOx_CRx_AF_PP_2MHz = 2|8
GPIOx_CRx_AF_PP_50MHz = 3|8

GPIOx_CRx_AF_OD_10MHz = 1|4|8
GPIOx_CRx_AF_OD_2MHz = 2|4|8
GPIOx_CRx_AF_OD_50MHz = 3|4|8

GPIOx_CRx_IN_ANLG = 0
GPIOx_CRx_IN_FLOAT = 4
GPIOx_CRx_IN_PULL = 8

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkConstants”

This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules.

The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate:

<source lang=asm>
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffff0ff
orr r0, #(GPIOx_CRx_GP_OD_10MHz<<8)
str r0, [r1]
</source>

C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values.

Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label.

A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register:

<source lang=asm>
MyData .req r7
ldr MyData, =123
add MyData, 3
</source>

This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code.
=== The Stack ===
In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”.

Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack.

On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs:

* Some operation that writes to r0
* '''Push''' (save) r0 to the stack
* Some operation that overwrites r0
* '''Pop''' (restore) r0 from the stack
* Use the value in r0 which is the same as initially assigned

ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ … Some code that overwrites r0 …

pop { r0 } @ Restore value 1000000
@ Continue using r0 ...
</source>

It is also possible to save/restore multiple registers in one go:

<source lang=asm>
ldr r0, =1000000
ldr r1, =1234567
@ Use r0 and r1 ...
push { r0, r1 } @ Save values 1000000 and 1234567

@ … Some code that overwrites r0 and r1 …

pop { r0, r2 } @ Restore 1000000 into r0 and 1234567 into r2
@ Continue using r0 and r2...
</source>

It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ Inner Code Block:

ldr r0, =123
@ Use r0 …

push { r0 } @ Save value 123

@ Inner-Inner Code Block that overwrites r0

pop { r0 } @ Restore value 123
@ Continue using r0 ...

pop { r0 } @ Restore value 1000000 into r0

@ Continue using r0 …
</source>

The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”.

At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa.

As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data.

When storing a 32bit register value using “push”, the stack pointer is '''first''' decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later.

An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte '''after''' the stack memory area. If the stack is completely full, it contains the address of the very first byte '''inside''' the stack memory area. This means that the stack grows '''downward'''. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash.

The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address ''inside'' the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4127.html ARM ABI requires] the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8.

In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the ''end'' of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the ''beginning'' of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial.
=== Function calls ===
Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the ''next'' instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again).

This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

bl EnableClockGPIOA @ Call function to enable GPIOA's peripheral clock

@ Some more code ...
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1]

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

</source>

Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”.

When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files.
==== Using the stack for functions ====
In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r5, =5 @ Number of LED flashes.
bl Blink
b .

.type Blink, %function
Blink:
push { lr }
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

bl Delay

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

bl Delay

subs r5, #1
bne BlinkLoop

pop { lr }
bx lr

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

.type Delay, %function
Delay:
mov r4, r3
DelayLoop:
subs r4, #1
bne DelayLoop @ Iterate delay loop
bx lr

</source>
Example name: “BlinkFunctions”

The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions.

The two lines

<source lang=asm>
pop { lr }
bx lr
</source>

are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC:

<source lang=asm>
pop { pc }
</source>

This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running.
==== Calling Convention ====
Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller.

When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics:

* Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.
* The LR is used as shown for the return address.
* When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).
* The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.
* The register r0 may be used to pass a result value back to the caller, which is called the return value.

This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved ''outside'' the function (“caller-save”), and the registers r4-r11 are (if at all) saved ''inside'' the function (“callee-save”).

A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack.

If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =5
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r7, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
mov r7, r0 @ Number of LED flashes.

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r7, #1
bne BlinkLoop

pop { r4-r7, pc }

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
DelayLoop:
subs r0, #1
bne DelayLoop @ Iterate delay loop
bx lr
</source>
Example name: “BlinkFunctionCallingConvention”

The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7.

Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes:

<source lang=asm>
.type Blink, %function
Blink:
push { r4, lr }

mov r4, r0

BlinkLoop:
ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r4, #1
bne BlinkLoop

pop { r4, pc }
</source>
Example name: “BlinkFunctionCallingConvention2”

A third variant would not use any of the callee-save-registers (r4-r11) at all, and instead just save r0 before the function calls and restore it as needed

<source lang=asm>
.type Blink, %function
Blink:
push { lr }

BlinkLoop:
push { r0 }

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

pop { r0 }
subs r0, #1
bne BlinkLoop

pop { pc }

</source>
Example name: “BlinkFunctionCallingConvention3”

The frequent stack accesses would however make this slower. Be sure to always document the meaning (and units, if applicable) of parameters e.g. via comments.
=== Conditional Execution ===
As mentioned, the conditional variants of the “b” instruction (e.g. “bne”) can be used to execute certain blocks of code only if a certain condition is met. First, more ways to formulate conditions will be shown. Next, the ARM instruction “it” will be introduced, which makes executing small blocks of code conditionally more efficient.
==== Conditions ====
All conditions for conditional execution depend on the outcome of some mathematical operation. When instructions such as “adds”, “subs”, “ands” are used, they update the flags in the APSR register depending on the outcome, which are then read by the conditional variants of “b” to decide whether to actually perform the jump.

Often it is necessary to compare two numbers without actually doing a calculation. This can be done with the “cmp” instruction to which you can pass two registers or a register and a literal:

<source lang=asm>
cmp r0, #42
cmp r0, r1
</source>

The “cmp” instruction is very similar to “subs” - it subtracts the second operand from the first, but doesn’t save the result anywhere, i.e. the registers keep their values. Just the flags in the APSR are updated according to the result, just as with “subs”. For example, if both operands were equal, the result of the subtraction is zero, and the zero flag will be set. So, to test whether two numbers are equal:

<source lang=asm>
cmp r0, #42
beq TheAnswer

@ This is executed if r0 is not 42

TheAnswer:
@ This is executed if r0 is 42
</source>

The “bne” instruction is the opposite of “beq”.

The “tst” instruction works similarly to “cmp”, but instead of subtracting, perform a bitwise “and” operation - like the “ands” instruction, but without keeping the result. This way, you can test whether a bit in a register is set:

<source lang=asm>
tst r0, #4
beq BitNotSet

@ This is executed if bit 2 in r0 is set

BitNotSet:
@ This is executed if bit 2 in r0 is not set
</source>

A more useful use case for “tst” is to pass the same register twice. Applying “and” to the same value twice yields the same result as the input, so “tst” in this case effectively checks the properties of the input (negative/positive, zero):

<source lang=asm>
tst r0, r0
beq ValueZero

@ This is executed if r0 is not zero

ValueZero:
@ This is executed if r0 is zero
</source>

There is also the “teq” instruction which performs an exclusive or operation.

As mentioned, the suffixes “eq” and “ne” are called condition codes. ARM has 14 of those which define how the flags in the APSR form the condition. The details about how a subtraction (by “subs” or “cmp”) sets the flags in the APSR and how their interpretation by the different condition codes correlates to the mathematical result are somewhat complicated, involving the way the 2’s complement format works and relying on the fact that subtracting works by adding a negated number. Instead of diving into all the details, a table with a more high-level view and a practical interpretation of the condition should be more helpful:

{| class="wikitable"
! style="text-align:left;" | Code
! style="text-align:left;" | Meaning
! style="text-align:left;" | Unsigned / Signed
! style="text-align:left;" | Flags
! style="text-align:left;" | Condition after “cmp/subs r0, r1”
! style="text-align:left;" | Condition after “tst r0, r0”
|-
| style="text-align:center;" | EQ
| Equal
| style="text-align:center;" | U+S
| Z==1
| r0 = r1
| r0 = 0
|-
| style="text-align:center;" | NE
| Not equal
| style="text-align:center;" | U+S
| Z==0
| r0 ≠ r1
| r0 ≠ 0
|-
| style="text-align:center;" | MI
| Negative
| style="text-align:center;" | S
| N==1
| ---
| r0 < 0
|-
| style="text-align:center;" | PL
| Positive or Zero
| style="text-align:center;" | S
| N==0
| ---
| r0 ≥ 0
|-
| style="text-align:center;" | VS
| Overflow
| style="text-align:center;" | S
| V==1
| r0-r1 out of range¹
| ---
|-
| style="text-align:center;" | VC
| No Overflow
| style="text-align:center;" | S
| V==0
| r0-r1 in range¹
| ---
|-
| style="text-align:center;" | HS
| Unsigned higher or same
| style="text-align:center;" | U
| C==1
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LO
| Unsigned lower
| style="text-align:center;" | U
| C==0
| r0 < r1
| ---
|-
| style="text-align:center;" | HI
| Unsigned higher
| style="text-align:center;" | U
| C==1 and Z==0
| r0 > r1
| ---
|-
| style="text-align:center;" | LS
| Unsigned lower or same
| style="text-align:center;" | U
| C==0 or Z==1
| r0 ≤ r1
| ---
|-
| style="text-align:center;" | GE
| Signed greater or equal
| style="text-align:center;" | S
| N==V
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LT
| Signed less than
| style="text-align:center;" | S
| N!=V
| r0 < r1
| ---
|-
| style="text-align:center;" | GT
| Signed greater than
| style="text-align:center;" | S
| Z==0 and N==V
| r0 > r1
| ---
|-
| style="text-align:center;" | LE
| Signed less or equal
| style="text-align:center;" | S
| Z==1 or N!=V
| r0 ≤ r1
| ---
|}
1: Range meaning the numbers from -(2^31) until (2^31-1), inclusive

To determine which condition code you need, first think about whether the number is unsigned (range 0 to 2^32-1) or is using two's complement to represent signed numbers (range -2^31 to 2^31-1). Ignore all rows in the table with the wrong format.

If you want to compare two numbers, use the “cmp” instruction, and search for the desired condition within the “cmp”-condition column of the table. If you want to test a single number's properties, use the “tst”-column. Use the condition code from the first column with the conditional “b” instruction (“bne”, “beq”, “bmi”, “bpl”, “bhs”, ... ) right after the appropriate “cmp”/”tst” instruction.

Note that all the condition codes have a corresponding inverse code that has exactly the negated meaning. Most also have a swapped partner code, using which is equivalent to swapping the operands for cmp.
==== The IT instruction ====
Jumping is inefficient, so having many conditional jumps may slow down your program. The ARM architecture offers a way to make a few instructions conditional without requiring a jump via the “it” (if-then) instruction. It is used in place of a conditional jump after an instruction that set the flags (“cmp”, “tst”, “adds”…) and also needs a condition code. The next instruction right after the it will then only be executed when the condition is met, and skipped otherwise. You have to repeat the condition code and add it to that instruction; this is just to make the code clearer and avoid confusion.

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

it hi @ Make the next instruction conditional
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

This checks if r0 is higher than r1 (it isn’t), and only sets the pin PA8 to high if this condition is met. Up to 4 instructions can be made conditional like this; for each one, an additional “t” has to be appended to the “it” instruction:

<source lang=asm>
cmp r0, r1 @ Perform comparison

ittt hi @ Make the next instruction conditional
ldrhi r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldrhi r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

You can also add instructions that will be executed if the condition was ''not'' met (like an “else”-case in high-level-languages), by appending “e” instead of “t” to the “it” instruction. Since the “t” in “it” is fixed, the first instruction is always executed if the condition is met; only the next three instructions can be either a “then” case (“t”) or “else” case (“e”). You also have to provide the inverted condition code for the “else”-instructions:

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

ite hi @ Make the next two instructions conditional (if-then-else)
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
strls r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low
</source>

There are several restrictions on which instructions may appear within an it-block. Most importantly, instructions that set the flags are forbidden here, as is the “b” instruction except for the last instruction in an “it” block. Directly jumping to one of the conditional instructions is forbidden too.

In T32 code, only the conditional “b” instruction is capable of encoding a condition code together with some operation, so the “it” instruction is provided to make any instruction conditional. On A32, most instructions include a condition code and can therefore be conditional, and the “it” instruction is actually ignored by the assembler here. You can and should still put “it” into code intended for A32, as this makes it compatible with T32. This is one of the reasons why A32 is more time-efficient, and T32 more space-efficient.

Conditional instructions sometimes make surprisingly compact programs. For example, the euclidean algorithm for calculating the greatest common divisor (gcd) of two numbers can be written in ARM assembly like this:

<source lang=asm>
gcd:
cmp r0, r1
ite gt
subgt r0, r0, r1
suble r1, r1, r0
bne gcd
</source>

While the C equivalent is actually longer:

<source lang=asm>
int gcd(int a, int b) {
while (a != b) {
if (a > b)
a = a - b;
else
b = b - a;
}
return a;
}
</source>

The usage of conditional instructions is also [https://developer.arm.com/docs/dui0801/latest/condition-codes/example-showing-the-benefits-of-conditional-instructions-in-a32-and-t32-code faster] than using conditional jumps. Note that the final “bne” instruction is independent of the “if-then” block; it just directly uses the result of “cmp”.
=== 8/16 bit arithmetic ===
So far, all numbers had 32 bit. However, especially for space reasons, smaller numbers are needed with 8 or 16 bit. Cortex-M3 doesn't provide any instructions for calculating 8 or 16 bit numbers directly. Instead, after loading such a number from memory into a processor register, it has to be extended into 32bit to allow the 32bit instructions to work properly. When storing the result back, only the lower 8/16 bit are used. If 8/16bit overflow behavior is required (i.e. overflow at -128/127 for 8bit signed, 0/256 for 8bit unsigned, -32768/32767 for 16bit signed, 0/65536 for 16bit unsigned) for calculations, the numbers have to be truncated after each calculation. This actually makes it slightly less efficient to deal with smaller numbers.

A 16bit value (“halfword”) can be read from memory with the ldrh instruction:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
</source>

“ldrh” loads 16bit from memory, writes them into the lower 16 bits of the destination register (here: r1), and sets the upper 16bits to zero. If the value is signed, it has to be sign-extended so that it can be used with 32bit-calculations:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
sxth r1, r1
</source>

The “sxth” instruction copies the sign bit (i.e. bit 15) into the upper 16 bits (“sign-extension”); this makes sure that negative 16bit-numers keep their value when interpreted as 32 bits. The “ldrsh” instruction combines both “ldrh” and “sxth”. “ldrb”, “sxtb”, “ldrsb” are for loading and sign-extending 8bit-values and the combination of both, respectively.

To simulate 8/16bit overflow behaviour after a mathematical operation, use uxtb/uxth for unsigned 8/16 bit numbers, or sxtb/sxth for signed 8/16 bit numbers:

<source lang=asm>
add r0, #1
uxth r0, r0
</source>

The “uxth”/”uxtb” instructions copy the lower 16/8 bits of a register into another one, setting the upper 16/24 bits to zero. This way, if r0 contained 65535 before, the result will be 0 instead of 65536 after using “uxth”.

This is a common trap when coding in C - when using e.g. the "uint16_t" type for local variables such as loop counters, this implicitly requests 16bit overflow behavior, requiring the truncating after each calculation, even though the overflow may actually never happen. This is why e.g. uint16_fast_t should be used for local variables, as this is 32 bit on ARM, which is faster.
=== Alignment ===
There are certain restrictions on the address when accessing data in memory using the “str”/”ldr” variants:
* The “ldrd”/”strd”/”ldm”/”stm” instructions, which can load/store multiple registers at once, always require the address to be a multiple of 4. If it isn’t, the program will crash.
* The “ldr”/”str” instructions require the address to be a multiple of 4, and “strh”/”ldrh” require it to be a multiple of 2. If it isn’t, the behaviour depends on the ARM version:
** On ARMv6-M and before, the program will crash.
** On ARMv7-M:
*** If the CCR.UNALIGN_TRP is set to zero (the default), the access will be slow
*** If the CCR.UNALIGN_TRP bit is set to one, the program will crash, emulating the ARMv6-M behaviour

For “strb”/”ldrb” there are no such requirements.

The number of which the address needs to be a multiple of is called the “alignment” (e.g. 2-byte-alignment, 4-byte-alignment, …). An access with an address that is a multiple of 2/4 as specified above is called an “aligned access”; others are called “unaligned access” (which are slow or cause a crash).

Even though slow accesses may be acceptable, it is still a good idea to make sure all accesses are always correctly aligned in case the code is ported to an ARM version or operating system that requires it. The addresses of periphery registers are already aligned correctly, so there is no need to worry. When placing data in RAM however, you should make sure that the addresses of the individual elements that are accessed via one of the “ldr” variants are aligned properly. For example, if a previous example code was modified like this:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

The address of “var1” will not be a multiple of 4, and an access via “ldr” would be unaligned. This could be improved by adding a space of 3 bytes in between:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.space 3
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

This would require you to keep in mind all the other things in memory that were declared before, which is impractical especially if multiple assembly files are used. Therefore, the assembler offers the “.align” directive:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.align 2
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

When using “.align X”, the assembler makes sure that the next address will be a multiple of 2^X, so in this case, a multiple of 2^2=4. The assembler will therefore insert 0 to 2^X-1 bytes of space. The section containing the directive in the object code file will also be marked to require that alignment, such that the linker will automatically place it at the appropriate location in address space.
=== Offset addressing ===
The various “ldr”/”str” instructions can optionally perform some calculation on the address before executing the memory access. What is shown for “ldr” here works for “str” and the variants for halfwords and bytes equivalently. There are several variants for this. This first one adds a fixed offset that is encoded within the instruction itself to the address:

<source lang=asm>
ldr r0, [r1, #8]
</source>

This adds 8 to r1 and uses the result as the address to access. The number can also be negative. This variant is useful for accessing members of a heterogeneous container organized like a C struct or the registers in a periphery module. For example, you can load the base address of a periphery module into a register, and then access the various registers using offset-addressing without having to load each address individually:

<source lang=asm>
GPIOA=0x40010800
GPIOx_CRH = 0x04
GPIOx_BSRR = 0x10

GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA

ldr r1, =GPIOA

ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r0, [r1, #GPIOx_BSRR] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “OffsetAddressing”

This way, you can avoid repeated loads of similar addresses. This variant is also capable of writing the newly calculated address back into the address register by appending a “!”:

<source lang=asm>
ldr r0, [r1, #8]!
</source>

This will add 8 to r1, write the result into r1, and also use it as an address from which to load 4 bytes and store them into r0. The variant

<source lang=asm>
ldr r0, [r1], #8
</source>

works just the opposite - r1 is used as an address from which to load the data, and “r1+8” is written back to r1. The next variant adds two registers to obtain the memory address:

<source lang=asm>
ldr r0, [r1, r2]
</source>

This loads the data from the address calculated by “r1+r2”. The second register (here: r2) can also be optionally shifted left by a fixed number of bits in the range 0-3:

<source lang=asm>
ldr r0, [r1, r2, lsl #2]
</source>

This shifts r2 left by two bits (i.e. multiplies it by 4), adds it to r1, and uses that as the address (r2 itself is not modified).
=== Iterating arrays ===
The offset addressing mechanism is perfectly suited to iterating arrays. This could be used to make an array defining a sequence of LED flashes that is iterated by the LED blinker application. Such an array would contain the duration of each on-and off-cycle (as passed to the “Delay” function) and be placed in flash memory:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =BlinkTableEnd @ Move address of "BlinkTableEnd" into r8

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
.word 2500000, 1000000, 2500000, 1000000, 2500000, 1000000
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
BlinkTableEnd:
</source>
Example name: “BlinkPattern”

The “.word” directive is used to place a sequence of 32bit- numbers into flash memory. The label “BlinkTable” will refer the the start address of the array, and “BlinkTableEnd” to the first address ''after'' the array. These two addresses are loaded into registers before the loop. The “.align” directive is used to make sure the 32bit-words are stored at properly aligned addresses. Inside the loop, the “ldr” instruction is used to load a 32bit-word from the array and pass it to the “Delay” function. The r7 register is advanced by 4 bytes to the next 32bit-word. This is done twice, for the on-and off-time. At the end of the loop, the address register is compared with the address of “BlinkTableEnd” - until that address has been reached, the loop will continue.

Another possibility is to keep the base address of the array in a register, and increment another register that contains the offset:

<source lang=asm>
.type Blink, %function
Blink:
push { r4-r9, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =0
ldr r9, =18

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

cmp r8, r9
blo BlinkLoop

pop { r4-r9, pc }
</source>
Example name: “BlinkPattern2”

Here, r8 is incremented in steps of 1 to denote the index in the array. The “lsl” syntax for “ldr” is used to multiply r8 by 4 (since each word is 4 bytes in size) and add it to r7, which contains the array’s base address. At the end of the loop, r8 is compared with 18, which is the number of entries in the array. This variant is actually less efficient, as it needs to keep both the base address and the index in registers and also has to increment the index in each iteration.
=== Literal loads ===

Regardless of architecture, any processor obviously needs to work with addresses in its own address space a lot. ARM can do calculations with its 32bit addresses just fine, but there is a bottleneck: The instruction set itself. To work with any address, it needs to be initially loaded into a processor register, but ARM instructions are only 16 or 32bit in size - not enough space for an arbitrary 32bit number plus the instruction encoding. Allowing even larger instructions (e.g. 40 bit) would complicate matters, so ARM instead uses several tricks to deal with this problem, which will be discussed here.

The “ldr r0, =1234” syntax allows you to load any arbitrary 32bit numbers, but is not actually a machine code instruction, but is translated by the assembler into one. In this chapter, the actual instructions for loading immediate numbers are discussed.

==== The “mov”-instruction ====
The most basic way of loading an immediate number into a register is the “mov” instruction:

<source lang=asm>
mov r0, #1234
</source>

This allows you to load any 16bit number (0 to 2^16-1) into a register. “mov” also includes some clever encodings that allow you to load certain commonly-used patterns:

* Any 32bit number that consists of one byte of arbitrary bits (i.e. 8 adjacent arbitrary bits) at any location, and zeros otherwise, e.g. 0x00000045, 0x00045000, 0x7f800000.
* Any 32bit number that consists of the same byte repeated 2 or 4 times in fixed places, as in 0x23002300, 0x00230023, 0x23232323
* The bit-wise negated result of any of these two patterns, e.g. 0xffffffba, 0xfffbafff, 0x807fffff or 0xdcffdcff. The assembler will actually use the “mvn” instruction for this, which works identically to “mov”, but negates the value.

By specifying a number that falls into one of these patterns, the assembler will automatically use the appropriate encoding. The first two ways of encoding numbers are not only available with “mov”, but also several other mathematical instructions that expect some immediate value: “add”, “and”, “bic”, “cmn”, “cmp”, “eor”, “mov”, “mvn”, “orn”, “orr”, “rsb”, “sbc”, “sub”, “teq”, “tst”. In the ARM Architecture Reference Manual, check the description of the instructions and look out for “ThumbExpandImm” to see whether it supports the first two patterns above.

You can also use the “mvn” instruction directly, e.g.:

<source lang=asm>
mov r0, #0xf807ffff
mvn r0, #0x07f80000
</source>

both lines are identical and write the number 0xf807ffff into r0.
==== The “movt” instruction ====
While supporting many common patterns, this does not allow arbitrary 32 bit numbers. One way to load any 32bit number is to split the number into two 16bit halves, and use both “mov” and “movt” to combine these two half-words into one register:

<source lang=asm>
mov r0, #0xabcd
movt r0, #0x1234
</source>

The “movt” instruction loads the given number into the upper 16 bits of the register, so this example loads 0x1234abcd into r0. The order is important, as “mov” overwrites the upper 16 bits with zeros, but “movt” keeps the lower 16 bits. If a single “mov” can't fit the desired number, the combination of “mov” and “movt” is the fastest way of loading any 32bit number. As two 32bit instructions are needed, this consumes 8 bytes of program memory. If you want to load the address of a symbol into a register, you need to tell the assembler to split it automatically. This can be achieved by prefixing the symbol with “:lower16:” or “:upper16:”, e.g.:

<source lang=asm>
movw r0, #:lower16:GPIOA_BSRR
movt r0, #:upper16:GPIOA_BSRR
</source>

Note that “movw” needs to be specified in this case to explicitly tell the assembler to use the “mov” variant that accepts 16bit numbers (which it otherwise does automatically when a direct value is given).
==== PC-relative loads ====
The other way of loading arbitrary 32bit values into registers is to place the value directly in flash memory, and load it from there using “ldr”:

<source lang=asm>
@ Some code …
mov r0, … address of Literal ...
ldr r1, [r0]
@ More code …
Literal:
.word 0x12345678
</source>

However, there is a Chicken-And-Egg problem - the address of “Literal” is a 32bit number itself, so how to load it into r0? Luckily, there is a register that contains a number close to the one needed - the program counter (PC, r15) indicates the address of the instruction currently being executed. By reading it and adding a small offset that fits into the instruction itself, the address of “Literal” can be obtained, provided that “Literal” is located close enough. Consider this example of the EnableClockGPIOA function:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
add r1, pc, #12
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
.word RCC_APB2ENR
</source>

The 32bit-value “RCC_APB2ENR” is stored in flash memory. The “add” instruction is used to add the offset 12 to the address of the instruction itself to obtain the address of said 32bit-value, which is then loaded via “ldr”. The offset 12 is actually not easy to calculate and even depends on the alignment of the “add” instruction itself (hence the “.align” to ensure a consistent example). The assembler is capable of doing the calculation on its own, for which the “adr” instruction is used:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
adr r1, LiteralRCC_APB2ENR
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

The label LiteralRCC_APB2ENR refers to the address of the 32bit-value in memory. “adr” is actually a variant of “add” that instructs the assembler to calculate the offset and place it into the instruction itself, which then lets the processor add it to PC and write the result to r1. This address is then used by “ldr”.

The “adr” instruction is useful when the address of some literal is explicitly needed; for example, in the blinker program, it can be used to obtain the addresses of the array:

<source lang=asm>
adr r7, BlinkTable @ Move address of "BlinkTable" into r7
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r8
</source>

However, for loading a single value, the address is actually not needed. In this case, “adr” and “ldr” can be combined:

<source lang=asm>
.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, LiteralRCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

This special variant of “ldr” lets the assembler calculate to offset as with “adr”, adds it to “PC” at runtime and loads the data found at the address into r1. This is much easier than the first variant, as all calculations are done automatically. It is still somewhat cumbersome having to write three lines just to obtain a single 32bit value. Therefore, the assembler offers this already introduced syntax:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

This is a special command for the assembler. If possible, the assembler will use the “mov” or “mvn” instruction to load the value. If the value won’t fit, it will be put into flash memory, and a “ldr” instruction as above will be used. In this case, the “ldr rX, =...” syntax is equivalent to the combination of specifying a label for the value, the “.word” directive and “ldr rX, <Label>”. Therefore, this syntax is usually the best way to load immediates.

The assembler places the literals at the end of the file. If the file is long, the offset will be too long for the “ldr” and “adr” instructions and the assembler will emit an error. You can instruct the assembler to place all literals that have been declared so far at a specific point using the “.ltorg” directive. It is recommended to place an “.ltorg” after each function (after the “bx lr”) - just make sure that execution will never reach there. If a single function is so long that an “.ltorg” at the end is too far away from “ldr”/”adr” at the beginning, you can place an “.ltorg” somewhere in the middle and jump over it with “b”.

In summary, the following rules can help make literal loads more efficient
* Avoid literal loads if possible; try to calculate needed values from other values that have already been loaded, possibly by using offset-addressing in “ldr”/”str”
* When accessing multiple registers of a single periphery module, load its base address once and use offset addressing to access the individual registers
* If you need a pointer to a location in flash memory, try using “adr”
* If speed is important, use “movw”+”movt” to load the value
* Else, use “ldr rX, =...” to have the assembler choose the optimal encoding
* Place “.ltorg” after each function

The “ldr … =” instruction can also be used to load any immediate 32bit value into the PC to cause a jump to that address, simply by specifying “pc” as the target register. If you perform an ordinary branch (via “b” or “bl”) to some function whose address is too far away from the current code location, the linker will insert a “wrapper” function that does exactly that to perform the “far” jump. That function is called a “veneer”.
=== The SysTick timer ===
An important aspect of many embedded systems is to control timing of technical processes. In the blinker example, the timing of the LED flashes was handled by having the processor execute dummy instructions to pass time. It is however virtually impossible to accurately predict the runtime of any piece of code on a complex processor such as ARM ones, and the runtime may vary among multiple runs and depending on the actual micro controller and its configuration. For a simple LED blinker this may be acceptable, but not for e.g. a closed loop controller for some mechanical actor. Therefore, almost all micro controllers and also application processors feature one or more hardware timers, which allow to measure time independently of the execution speed of the software. Timer features vary widely among different processors, but that basic idea is to increment or decrement some digital counter at each clock cycle and trigger some event when it reaches a certain value.

All ARMv7-M processors feature the so-called “SysTick”-Timer as part of the processor core itself. This is a rather simple 24bit-timer that counts from a configurable value back to zero, then resets to that value and triggers an event. This timer is frequently used as a time base for RTOS or other runtime libraries. The timer uses three periphery registers: “RVR” contains the value from which to count down. “CVR” contains the current value, and “CSR” contains some status and control bits. The timer can be used for the “Delay” function like this:

<source lang=asm>
SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
ldr r1, =SCS
add r0, r0, r0, lsl #1

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

DelayLoop:
ldr r0, [r1, #SCS_SYST_CSR]
tst r0, #0x10000
beq DelayLoop

ldr r0, =0
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>

The SysTick is part of the “System Control Space”, SCS. The SCS base address is defined as a symbol, and the relative addresses of the registers as well. The count value is stored in “RVR”, after which “CVR” has to be set to zero. The timer is started by writing “5” into the “CSR” register. The loop repeatedly reads the “CSR” register and continues until bit 16 is set. The “tst” instruction is used to perform an “and” operation with the register contents and an immediate value without keeping the result while just updating the flags. At the end, the “CSR” register is set to zero to disable the timer. The “add” instruction at the beginning is used to multiply the count value by 3: r0 is shifted left by one, i.e. multiplied by two, and then added to itself, as in r0*2^1+r0. This is a common trick to quickly multiply by constants. By including this multiplication, the duration is the same as with the previous “Delay” variant, which, on this microcontroller, uses about 3 cycles per loop iteration.

Managing timing this way (or any other kind of “Delay” function) is still not very accurate. The time needed to call the function, start the timer, return, and set the pins is added to the actual duration and may also vary each time. The timing errors accumulate over time - a clock implemented this way will quickly go wrong. The proper way to achieve accurate timing is to start the timer once, let it run continuously, and react to its events. The internal clock source used by the microcontroller is also quite inaccurate (up to 2.5% deviation), which can be improved upon by a quartz crystal (typical accuracy of e.g. 0.005%), which will be covered later. Reacting to events instead of calling a function that executes dummy code requires restructuring the program code, without using any kind of “Delay” function.

To do that, the timer is started once at program startup and kept running. After setting the LED pin, wait for the timer event, and repeat. In the last example, the values 3000000 and 7500000 are used for the timer register (3x1000000 and 3x2500000, respectively). Changing the timer value while it is running continuously is problematic, so one fixed value should be used; to achieve variable blinker duration, multiple timer events need to be counted. The greatest common denominator of the two numbers is 1500000, so to achieve the two different times, 2 and 5 timer events need to be registered, respectively. Since these numbers fit into a single byte, the table entries and corresponding access instructions are changed to byte. A function “StartSysTick” is implemented to start the timer once, and a function “WaitSysTick” to wait for a given number of timer events:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =TimerValue
bl StartSysTick
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
adr r7, BlinkTable @ Move address of "BlinkTable" into r8
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r9

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2, 2
BlinkTableEnd:
.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

bx lr

@ r0 = Number of timer events to wait for
.type WaitSysTick, %function
WaitSysTick:
ldr r1, =SCS

WaitSysTickLoop:
ldr r2, [r1, #SCS_SYST_CSR]
tst r2, #0x10000
beq WaitSysTickLoop

subs r0, #1
bne WaitSysTickLoop

bx lr

</source>
Example name: “BlinkSysTick”

This way, the blinker frequency will be as stable and accurate as possible with the given clock source.
=== Exceptions & Interrupts ===
Exceptions and interrupts play an important role in low-level development. They provide a facility for hardware to notify the software of events, such as received data blocks or a timer event. On ARM, interrupts are a sub-group of exceptions – there are some “system-level” exceptions mostly for dealing with processor errors and providing operating system support, while interrupts are “special” exceptions for events signaled by periphery modules. When writing “regular” microcontroller software, you will mostly work with interrupts.

Exceptions (and interrupts) interrupt normal program flow, and cause the processor to execute some other piece of code which is called the exception handler or Interrupt Service Routine (ISR) (even for the “system-level” exceptions that are not interrupts). After dealing with the indicated event, the ISR typically returns and normal program flow resumes. As exceptions can interrupt the program anytime, data (and periphery) may be in any kind of inconsistent state, so special care must be taken to avoid corrupting program state in an ISR. The ARMv7-M processor (including the Cortex-M3) provide sophisticated support for exceptions, with configurable priorities and nested exception calls. This chapter will only cover the basics for using exceptions.

On ARMv7-M, exception handlers are implemented as regular functions, for example:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
@ Handle event ...
bx lr
</source>

Like any other function, it has a label, returns with “bx lr”, and is also made globally visible to other source files using “.global”. The “.type … %function” is required here for the same reason as for the already-mentioned “Reset_Handler”. Exception handlers can be located anywhere in flash memory, among the other regular functions. To tell the processor where the exception handlers for the various exception types are located, the vector table needs to be adjusted. Until now, the vector table was defined as:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Recall that the first 32bit-word in flash memory contains the initial stack pointer (defined via “.word _StackEnd”) and the second word contains the address of the first instruction of the program (defined via “.word Reset_Handler”). Actually, resetting the controller is an exception too, and the code to be executed after reset (or start-up) is the handler for the reset exception (hence the name “Reset_Handler”). The next 228 bytes of flash memory contain 57 32bit-addresses of the handlers of the other exceptions, including interrupts. The “.space” directive just fills those with zeroes. To tell the processor the address of an exception handler, the appropriate entry in this table needs to be set to that address. In chapter 10.1.2, table 63 of the controller’s reference manual, the format of the vector table, and which exception’s address should go where, is defined. Only the interrupts up until position 42 actually exist on the STM32F103RB/C8, as defined in chapter 2.3.5 of the datasheet; everything from “TIM8_BRK” is only present on larger controllers. According to the table, the SysTick exception handler’s address needs to be put at location 0x3C relative to the beginning of flash memory. Since the first 8 bytes are already occupied, 0x34 bytes of space are needed after those first 8 bytes.

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0x34
.word SysTick_Handler
.space 0xac
</source>

With this modification, the SysTick_Handler function is now declared as the handler for the SysTick exception. By default, the SysTick timer does not trigger an exception. To do that, you have to set bit 2 in the SCS_SYST_CSR register. By placing the logic for the blinker into the timer's ISR, you get an interrupt-based blinker:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

ldr r0, =TimerValue
bl StartSysTick
SleepLoop:
wfi
b SleepLoop

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq Return

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs Return

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne Return

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs SkipRestart

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

SkipRestart:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

Return:
bx lr

.align 2
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr

</source>
Example name: “BlinkSysTickInterrupt”

The regular program flow now consists only of initializing the periphery, timer, and the first step of the blinker (setting the pin high). After that, the processor should do nothing but wait for exceptions, which is achieved by a simple endless loop. The “wfi” instruction suspends the processor; when an exception occurs, the processor will wake up, execute the ISR, and return execution after the “wfi”. Therefore, “wfi” is usually put in an endless loop as shown. This technique can reduce the processor’s power consumption significantly, as it is only running when something needs to be done, as indicated via interrupts. The ISR first checks whether the interrupt flag in the timer register is set - this is necessary, since exceptions can sometimes occur “spuriously”, i.e. without an actual event causing it. The decision whether to set or reset the pin state is taken based on the lowest bit of the table index, such that the output alternates between 1 and 0.

The code inside the ISR needs to know which step in the blinking sequence is currently active, and how many timer events have already occurred inside the current step. Therefore, two 1-byte-variables are stored in RAM. To access them, offset addressing is used, where r0 contains the base address of the variables in memory, and the offsets inside “ldrb” and “strb” are set accordingly. The last number of the blink sequence table is omitted, since it is actually superfluous, because no action is taken after the last delay has elapsed. Because the table size is now odd, an “.align” directive after it is required. Always putting “.align” after outputting data is a good idea anyways.

Since exceptions can occur at any point in regular program flow, the processor registers may contain some data that will be used after the exception handler returns. Therefore, if the exception handler writes anything into the registers, they need to be restored when returning from the exception. Upon exception entry, the Cortex-M3/4 processors automatically store the registers r0-r3, r12, r14 (LR) and APSR (including the flags) on the stack. The link register is filled with a special “dummy” value, and when the exception handler returns via “bx lr” using this value, the processor restores said registers to their previous state. This effectively means that you can implement exception handlers like any other function, i.e. freely overwrite r0-r3, r12 and the flags and push/pop r4-r11 and the LR if needed.
=== Macros ===
The assembler provides a few mechanisms to make assembly-language development easier. One of those are macros, which allow you to define snippets of assembly code that you can then insert easily whenever you need them. While looking similar to function invocations, the code inside the macro is actually copied each time the macro is used, so don’t overuse them. Macros are started with “.macro” and end at the next “.endm” directive. For example, the following macro sets the LED pin to 0 or 1:

<source lang=asm>
.macro SETLED value
ldr r0, =GPIOA_BSRR
ldr r1, =(((!\value) << 24) | (\value<<8))
str r1, [r0]
.endm

SETLED 0
SETLED 1
</source>

The macro name is defined as “SETLED”, and a single parameter with name “value” is given. By typing “\value”, the value of the parameter is substituted in the macro body. Some bit-shifting is used to calculate the right bit pattern to write into BSRR to set or reset the pin accordingly.
=== Weak symbols ===
As explained before, labels defined in assembly files get translated into symbols in the object code files, which are resolved by the linker. Sometimes it is desirable to provide a “default” or “fallback” implementation of some function (or data block) which is only used when no other implementation is given. This can be achieved by marking the “fallback” variant with “.weak”:

<source lang=asm>
.type Function1, %function
.global Function1
.weak Function1
Function1:
@ Default implementation …
…
bl Function1 @ Call the function
</source>

With this code alone, “Function1” will be used normally. If you put another function with the same name in a different assembly source file, that second variant will be used.
=== Symbol aliases ===
It is also possible to define aliases for symbols using “.thumb_set”, which sets the address accordingly. For example:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

When trying to call “Function2”, the linker will automatically fill in the address of “Function1”. This can also be combined with “.weak” to define a weak alias:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.weak Function2
.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

If you now define another “Function2” in a different assembly source file, that will be used. If not, “Function1” will be called, which is the target of the alias definition. This is useful if you want to define one default implementation for several different functions, for each of which you need one “.weak” and one “.thumb_set” directive.
=== Improved vector table ===
The techniques from the last three sections can be used to improve the definition of the vector table. The way it was defined before is not very flexible; to insert new entries, you have to calculate the new gap sizes and offsets. First, define a default handler ISR that is called by exceptions for which no other ISR is defined, and a macro that defines an alias for one exception with the default handler as the target, and finally a table of all exceptions by using the macro:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.macro defisr name
.global \name
.weak \name
.thumb_set \name, Default_Handler
.word \name
.endm

.global VectorTable
.section .VectorTable, "a"
.type VectorTable, %object
VectorTable:
.word _StackEnd
defisr Reset_Handler
defisr NMI_Handler
defisr HardFault_Handler
defisr MemManage_Handler
defisr BusFault_Handler
defisr UsageFault_Handler
.word 0
.word 0
.word 0
.word 0
defisr SVC_Handler
defisr DebugMon_Handler
.word 0
defisr PendSV_Handler
defisr SysTick_Handler
defisr WWDG_IRQHandler
defisr PVD_IRQHandler
defisr TAMPER_IRQHandler
defisr RTC_IRQHandler
defisr FLASH_IRQHandler
defisr RCC_IRQHandler
defisr EXTI0_IRQHandler
defisr EXTI1_IRQHandler
defisr EXTI2_IRQHandler
defisr EXTI3_IRQHandler
defisr EXTI4_IRQHandler
defisr DMA1_Channel1_IRQHandler
defisr DMA1_Channel2_IRQHandler
defisr DMA1_Channel3_IRQHandler
defisr DMA1_Channel4_IRQHandler
defisr DMA1_Channel5_IRQHandler
defisr DMA1_Channel6_IRQHandler
defisr DMA1_Channel7_IRQHandler
defisr ADC1_2_IRQHandler
defisr USB_HP_CAN1_TX_IRQHandler
defisr USB_LP_CAN1_RX0_IRQHandler
defisr CAN1_RX1_IRQHandler
defisr CAN1_SCE_IRQHandler
defisr EXTI9_5_IRQHandler
defisr TIM1_BRK_IRQHandler
defisr TIM1_UP_IRQHandler
defisr TIM1_TRG_COM_IRQHandler
defisr TIM1_CC_IRQHandler
defisr TIM2_IRQHandler
defisr TIM3_IRQHandler
defisr TIM4_IRQHandler
defisr I2C1_EV_IRQHandler
defisr I2C1_ER_IRQHandler
defisr I2C2_EV_IRQHandler
defisr I2C2_ER_IRQHandler
defisr SPI1_IRQHandler
defisr SPI2_IRQHandler
defisr USART1_IRQHandler
defisr USART2_IRQHandler
defisr USART3_IRQHandler
defisr EXTI15_10_IRQHandler
defisr RTCAlarm_IRQHandler
defisr USBWakeUp_IRQHandler

.text

.type Default_Handler, %function
.global Default_Handler
Default_Handler:
bkpt
b.n Default_Handler
</source>

There are a few empty entries in the table that are not used by the processor. At the beginning, there is still the definition for the initial stack pointer and the “Reset_Handler”. If you replace your “vectortable.S” by this code, you get a “proper” vector table. The “SysTick_Handler” will continue to work as before, and if you need to define any other ISR, for example for USART1, just define a function by the exact name “USART1_IRQHandler”. The address of this function will automatically be put in the vector table. If an exception without a corresponding ISR occurs, the “Default_Handler” will be called, which uses the “bkpt” instruction to force a breakpoint via the attached debugger. This helps debugging missed exceptions while avoiding to define several individual dummy handler functions.
=== .include ===
Having to put the register and bit definitions (“RCC_APB2ENR”, “RCC_APB2ENR_IOPAEN”, ...) in each assembly source file is redundant and error-prone. Instead, you can put them into a separate file (e.g. called “stm32f103.inc”) and use the “.include” directive to reference it:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

@ Normal code ...
</source>

The assembler will read the code from the included file and pretend it was written instead of the “.include” line. This can help improve code structure. While working on the project structure, you can also restructure the definitions for the GPIO registers to facilitate offset addressing:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000
</source>

The next example incorporates these changes in addressing the registers.
=== Local Labels ===
Having to invent unique labels for all jump targets inside functions (e.g. for conditional code and loops) can be tedious. When using a disassembler (see below), each label will appear as its own functions. Therefore, the GNU assembler supports local labels. These are labels whose name consist only a number. Local names need not be unique; several labels called e.g. “1” may exist in one file. To perform a jump to a local label, use the number and append a “f” or “b” to indicate whether to jump forward or backward. Local labels can not be exported with the “.global” directive. The interrupt-based blinker can be modified like this using local labels:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

1:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>
Example name: “BlinkLocalLabels”
=== Initializing RAM ===
The blinker program uses 2 byte-variables in memory, which have to be initialized to some value at startup. For large programs with many variables, this quickly becomes hard to maintain and also inefficient. Assembler and linker can help producing an “image” of how the RAM contents should look like after initializing, and place this image in flash memory alongside the normal program data. At startup, this image can be simply copied 1:1 into RAM in a loop. Most programs contain many variables that will be initialized with zero, so placing a (possibly large) block of zeroes in flash memory is wasteful; therefore, an additional loop is used to initialize all zero-variables to zero. Both techniques are also employed by C and C++ compilers, so implementing the initialization code is required there too. First, change the declaration of your variables by using “.byte”, “.hword” and “.word” and include the desired initialization value. Variables that should be initialized by zero get to be placed after a “.bss” directive to put them into the equally-named section. They don’t get an initialization value but just reserved space by using “.space”:

<source lang=asm>
.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1
</source>

From the assembler’s point of view, the initialization data - in this case, just one byte of value “2” - will directly end up in RAM. However, this is not possible on microcontrollers, as the RAM always contains random data on startup and isn’t automatically initialized. To achieve that, change the linker script as follows:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
. = ALIGN(4);
} >FLASH

.stack (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
} >SRAM

.data : {
_DataStart = .;
*(.data);
. = ALIGN(4);
_DataEnd = .;
} >SRAM AT >FLASH

_DataLoad = LOADADDR(.data);

.bss (NOLOAD) : {
_BssStart = .;
*(.bss);
. = ALIGN(4);
_BssEnd = .;
} >SRAM
}
</source>
Example name: “BlinkInitRAM”

The stack got put in its own section with the “NOLOAD” attribute, since it doesn’t need initializing. The data is now put in the “.data” section. The initial data for that section is put into flash memory via the “>SRAM AT >FLASH” construct. The addresses of symbols inside the “.data” section are still the addresses in RAM, so accesses to the symbols from assembly code still work. The symbol “_DataStart” is assigned the beginning of the initialized data in RAM, and “_DataEnd” the end. The “LOADADDR” function is used to get the beginning of the initialization data in flash, and assign it to “_DataLoad”. The “.bss” section contains all the variables that should be zero-initialized, and the symbols “_BssStart” and “_BssEnd” are set to its beginning and end address, respectively. It is marked with “NOLOAD” as well as we don't want to store (potentially many) zeroes in the linked program file, and we will explicitly initialize it (see below). As the beginning and size of the stack are already a multiple of 4, the beginning of “.data” is as well. The size of .data might not be a multiple of 4 however, so an “.=ALIGN(4)” command is inserted right before the definition of “_DataEnd”. This adds 0-3 dummy bytes by incrementing the location counter to make sure the address is a multiple of 4. The same thing is done right before “_BssEnd” and also at the end of the “.text” section, to make sure that “_BssEnd” and “_DataLoad” are multiples of 4 as well.

The only thing left is the actual initialization of the RAM. To do that, change the “Reset_Handler” as follows:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b
.ltorg
</source>

The explicit initialization of the variables was removed. Instead, the addresses for “_DataStart”, “_DataEnd” and “_DataLoad” that were defined in the linker script are loaded. Then, a short loop repeatedly loads a word from flash (i.e. starting with “_DataLoad”) and stores it into RAM (starting at “_DataStart”). The address pointers are incremented by the “ldr”/”str” instructions after the access. The pointer for the RAM location is compared with the end of the RAM area (“_DataEnd”) to decide whether to jump back to the beginning of the loop. To start the loop, a jump directly to the comparison is performed; this avoids the need to do the comparison at the beginning and inside of the loop. The second loop performs the zero-initialization of the area between “_BssStart” and “_BssEnd”; it works similarly, but does not need to load any data.

Unfortunately, the program as shown can’t be translated - as the two variables now reside in two different sections (“.data” and “.bss”), the offset addressing in the “SysTick_Handler” doesn’t work anymore. Therefore, direct addressing has to be used:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr
</source>

=== Peripheral interrupts ===
Interrupts, i.e. exceptions called by periphery modules, need a little extra code compared to the “core” exceptions including the SysTick. The Cortex-M’s interrupt controller (the NVIC) contains several registers for configuring these interrupts. It is possible to configure the priority and manually trigger interrupts, but for most applications, the only necessary thing to do is enabling the desired interrupt. This is done via the registers “NVIC_ISER0” through “NVIC_ISER15”, which are documented in the ARMv7M Architecture Reference Manual in chapter B3.4.4. Each of those registers contains 32 bits with which 32 of the interrupts can be enabled. The STM32F103RB/C8 has 43 interrupts, so only two of the possible 16 registers are present. The number of interrupts is given in chapter 2.3.5 of the controller’s datasheet. So, to enable some interrupt x, the bit “x mod 32” in register NVIC_ISER''y'' with y=x/32 has to be set. This register’s address is 0xE000E100+y*4. Given an interrupt’s number in r0, the following function does just that:

<source lang=asm>
NVIC_ISER0 = 0xE000E100

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>
Example name: “BlinkTIM1”

The “and” instruction calculates “x mod 32”, and the following left-shift (“lsls”) calculates the value where bit “x mod 32” is one, and all others are zero. To calculate the offset address “y*4”, i.e. “(x/32)*4”, the register is first shifted right by 5 bits and then shifted back left by 2 bits. This is the same as shifting 3 bits right and zeroing out the lower 2 bits; but two shift instructions actually consume less program memory space. Finally, the calculated value is written into the register by using offset addressing.

In addition to enabling the interrupt in the processor core’s NVIC, it also has to be enabled in the periphery module. Many periphery modules support several different events, each of which has to be enabled in the periphery’s register individually. Depending on the controller, these can be mapped to one single processor interrupt (and hence, one single ISR) or multiple ones, and need to be configured in the NVIC appropriately.

This example uses the STM32’s periphery timer TIM1 instead of the SysTick timer:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500
TimerPrescaler=1000

.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl EnableClockTIM1
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TIM1_UP_IRQn
bl EnableIRQ
bl StartTIM1
1:
wfi
b 1b
.ltorg

.type TIM1_UP_IRQHandler, %function
.global TIM1_UP_IRQHandler
TIM1_UP_IRQHandler:
ldr r0, =TIM1
ldr r2, =(~(1 << TIMx_SR_UIF))

ldr r1, [r0, #TIMx_SR]
bics r1, r2
beq 2f

str r2, [r0, #TIMx_SR]

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type EnableClockTIM1, %function
EnableClockTIM1:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_TIM1EN)
str r0, [r1, #RCC_APB2ENR] @ Set TIM1EN bit in RCC_APB2ENR to 1 to enable TIM1
bx lr @ Return to caller
.ltorg

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartTIM1, %function
StartTIM1:
ldr r0, =TIM1
ldr r1, =(1 << TIMx_CR1_URS)
str r1, [r0, #TIMx_CR1]

ldr r1, =TimerPrescaler
str r1, [r0, #TIMx_PSC]

ldr r1, =TimerValue
str r1, [r0, #TIMx_ARR]

ldr r1, =(1 << TIMx_DIER_UIE)
str r1, [r0, #TIMx_DIER]

ldr r1, =(1 << TIMx_EGR_UG)
str r1, [r0, #TIMx_EGR]

dsb

ldr r1, =(1 << TIMx_CR1_CEN)
str r1, [r0, #TIMx_CR1]

bx lr
.ltorg

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>

The corresponding stm32f103.inc file with the added definitions for the timer registers is:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

RCC = 0x40021000
RCC_APB2ENR = 0x18
RCC_APB2ENR_IOPAEN = 2
RCC_APB2ENR_TIM1EN = 11

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

TIM1 = 0x40012C00

TIMx_CR1 = 0
TIMx_CR1_ARPE = 7
TIMx_CR1_URS = 2
TIMx_CR1_CEN = 0

TIMx_DIER = 0xC
TIMx_DIER_UIE = 0

TIMx_SR = 0x10
TIMx_SR_UIF = 0

TIMx_EGR = 0x14
TIMx_EGR_UG = 0

TIMx_PSC = 0x28
TIMx_ARR = 0x2C

TIM1_UP_IRQn = 25

NVIC_ISER0 = 0xE000E100
</source>

The source code enables the timer’s clock in the RCC before configuring it. The timer supports both a freely configurable prescaler for dividing the clock and a freely configurable maximum value, both of which are set by the StartTIM1 function. The TIMx_DIER_UIE bit is set to enable the interrupt for the so-called “update event”, which is triggered whenever the timer reaches the maximum value. A delicate sequence of register accesses is required to start the timer with the right configuration but without triggering the interrupt right away: To apply the modified settings immediately, the “TIMx_EGR_UG” bit is set to trigger an “artificial” update event. To prevent this from also triggering the interrupt, the “TIMx_CR1_URS” bit is set and cleared before and after, respectively. The timer is started by setting the “TIMx_CR1_CEN” bit at the end. Before that, a “dsb” instruction is inserted. This “Data Synchronization Barrier” waits until all write accesses before that have been completely processed - usually, the processors pipeline is working on several instructions at once. Because the timer configuration needs to be truly finished before starting the timer, this instruction is required. There are some other situations where the processor is too fast for the periphery and needs to be temporarily halted by a “dsb”. If some periphery-accessing code works in step-by-step mode while debugging, but not when executing normally, a well-placed “dsb” might help.

The ISR “TIM1_UP_IRQHandler” is used for the timer. It checks the “TIMx_SR_UIF” bit to verify an update event has actually happened. In that case, the register is overwritten with the value 0xFFFFFFFE, i.e. all bits are written with “1” except the UIF bit. Writing ones has no effect on the bits in this register, and writing a zero clears the respective bit. Therefore, this write access clears the UIF bit but keeps the others. These interrupt flags must always be cleared as soon as possible in the ISR, or the periphery might trigger the interrupt again immediately. The rest of the ISR stays the same.
=== Analysis tools ===
When working on a low level directly with linker scripts and assembly code, it is frequently necessary to directly verify the translation output, as you can’t rely on a compiler doing it right automatically, and flashing the program each time to see whether it works isn’t the most efficient way. This was, in fact, important in creating the example codes for this tutorial. The “binutils” package, of which assembler and linker are part of, offers a few tools that help with analyzing the assembler’s and linker’s output.
==== Disassembler ====
As the name implies, a disassembler is the opposite of an assembler - it turns binary machine code back into a (more or less) readable textual representation. If you feed an ELF file generated by the assembler or linker into the disassembler, it will read the header information to tell apart data (i.e. constants) and code, get names of symbols (and therefore, labels) and can even tell which instructions were generated from which assembly source file, if it was assembler with debug information (i.e. the “-g” flag was used). If you disassemble a binary flash image, the disassembler doesn’t have all those information and will produce a much less readable output and attempt to decode data bytes as instructions.

The disassembler from binutils is called “objdump”. Invoking it on the blinker looks like this:
<pre>
$ arm-none-eabi-objdump -d -s prog1.elf

prog1.elf: file format elf32-littlearm

Contents of section .VectorTable:
8000000 00040020 ed000008 ed010008 ed010008 ... ............
8000010 ed010008 ed010008 ed010008 00000000 ................
8000020 00000000 00000000 00000000 ed010008 ................
8000030 ed010008 00000000 ed010008 49010008 ............I...
8000040 ed010008 ed010008 ed010008 ed010008 ................
8000050 ed010008 ed010008 ed010008 ed010008 ................
8000060 ed010008 ed010008 ed010008 ed010008 ................
8000070 ed010008 ed010008 ed010008 ed010008 ................
8000080 ed010008 ed010008 ed010008 ed010008 ................
8000090 ed010008 ed010008 ed010008 ed010008 ................
80000a0 ed010008 ed010008 ed010008 ed010008 ................
80000b0 ed010008 ed010008 ed010008 ed010008 ................
80000c0 ed010008 ed010008 ed010008 ed010008 ................
80000d0 ed010008 ed010008 ed010008 ed010008 ................
80000e0 ed010008 ed010008 ed010008 ............
Contents of section .text:
80000ec 0f481049 104a03e0 52f8043b 40f8043b .H.I.J..R..;@..;
80000fc 8842f9d3 0d480e49 4ff00002 01e040f8 .B...H.IO.....@.
800010c 042b8842 fbd300f0 47f800f0 4bf84ff4 .+.B....G...K.O.
800011c 80710848 01600848 00f058f8 30bffde7 .q.H.`.H..X.0...
800012c 00040020 04040020 f0010008 04040020 ... ... .......
800013c 08040020 10080140 60e31600 4ff0e020 ... ...@`...O..
800014c 006910f4 803f1dd0 1a480178 b1f1110f .i...?...H.x....
800015c 18d21948 0378013b 1cbf0370 12e001f1 ...H.x.;...p....
800016c 0101b1f1 110f02d2 144a535c 03701148 .........JS\.p.H
800017c 017011f0 01010cbf 4ff48071 4ff08071 .p......O..qO..q
800018c 0f480160 70470202 02020202 05020502 .H.`pG..........
800019c 05020202 02020200 0a490868 40f00400 .........I.h@...
80001ac 08607047 08490868 20f00f00 40f00200 .`pG.I.h ...@...
80001bc 08607047 04040020 00040020 92010008 .`pG... ... ....
80001cc 10080140 18100240 04080140 4ff0e021 ...@...@...@O..!
80001dc 48614ff0 00008861 4ff00700 08617047 HaO....aO....apG
80001ec 00befde7 ....
Contents of section .data:
20000400 02000000 ....
Contents of section .ARM.attributes:
0000 41200000 00616561 62690001 16000000 A ...aeabi......
0010 05436f72 7465782d 4d330006 0a074d09 .Cortex-M3....M.
0020 02 .
Contents of section .debug_line:
0000 98000000 02001e00 00000201 fb0e0d00 ................
0010 01010101 00000001 00000100 70726f67 ............prog
0020 312e5300 00000000 000502ec 00000803 1.S.............
0030 15012121 22212f2f 21222121 30212f21 ..!!"!//!"!!0!/!
0040 222f302f 21232130 21036120 2f2f362f "/0/!#!0!.a //6/
0050 030c2e32 030a2e2f 212f2222 222f2221 ...2.../!/"""/"!
0060 21222121 222f2f22 21212321 222f212f !"!!"//"!!#!"/!/
0070 30212303 0d9e2121 2f212421 212f2f21 0!#...!!/!$!!//!
0080 03422035 030c2e03 0d2e0311 2e36030b .B 5.........6..
0090 2e30212f 222f2202 01000101 3b000000 .0!/"/".....;...
00a0 02002400 00000201 fb0e0d00 01010101 ..$.............
00b0 00000001 00000100 76656374 6f727461 ........vectorta
00c0 626c652e 53000000 00000005 02ec0100 ble.S...........
00d0 0803d000 01210201 000101 .....!.....
Contents of section .debug_info:
0000 22000000 02000000 00000401 00000000 "...............
0010 ec000008 ec010008 00000000 08000000 ................
0020 12000000 01802200 00000200 14000000 ......".........
0030 04019c00 0000ec01 0008f001 00082100 ..............!.
0040 00000800 00001200 00000180 ............
Contents of section .debug_abbrev:
0000 01110010 06110112 01030e1b 0e250e13 .............%..
0010 05000000 01110010 06110112 01030e1b ................
0020 0e250e13 05000000 .%......
Contents of section .debug_aranges:
0000 1c000000 02000000 00000400 00000000 ................
0010 ec000008 00010000 00000000 00000000 ................
0020 1c000000 02002600 00000400 00000000 ......&.........
0030 ec010008 04000000 00000000 00000000 ................
Contents of section .debug_str:
0000 70726f67 312e5300 2f746d70 2f746573 prog1.S./tmp/tes
0010 7400474e 55204153 20322e32 392e3531 t.GNU AS 2.29.51
0020 00766563 746f7274 61626c65 2e5300 .vectortable.S.

Disassembly of section .text:

080000ec <Reset_Handler>:
80000ec: 480f ldr r0, [pc, #60] ; (800012c <Reset_Handler+0x40>)
80000ee: 4910 ldr r1, [pc, #64] ; (8000130 <Reset_Handler+0x44>)
80000f0: 4a10 ldr r2, [pc, #64] ; (8000134 <Reset_Handler+0x48>)
80000f2: e003 b.n 80000fc <Reset_Handler+0x10>
80000f4: f852 3b04 ldr.w r3, [r2], #4
80000f8: f840 3b04 str.w r3, [r0], #4
80000fc: 4288 cmp r0, r1
80000fe: d3f9 bcc.n 80000f4 <Reset_Handler+0x8>
8000100: 480d ldr r0, [pc, #52] ; (8000138 <Reset_Handler+0x4c>)
8000102: 490e ldr r1, [pc, #56] ; (800013c <Reset_Handler+0x50>)
8000104: f04f 0200 mov.w r2, #0
8000108: e001 b.n 800010e <Reset_Handler+0x22>
800010a: f840 2b04 str.w r2, [r0], #4
800010e: 4288 cmp r0, r1
8000110: d3fb bcc.n 800010a <Reset_Handler+0x1e>
8000112: f000 f847 bl 80001a4 <EnableClockGPIOA>
8000116: f000 f84b bl 80001b0 <ConfigurePA8>
800011a: f44f 7180 mov.w r1, #256 ; 0x100
800011e: 4808 ldr r0, [pc, #32] ; (8000140 <Reset_Handler+0x54>)
8000120: 6001 str r1, [r0, #0]
8000122: 4808 ldr r0, [pc, #32] ; (8000144 <Reset_Handler+0x58>)
8000124: f000 f858 bl 80001d8 <StartSysTick>
8000128: bf30 wfi
800012a: e7fd b.n 8000128 <Reset_Handler+0x3c>
800012c: 20000400 .word 0x20000400
8000130: 20000404 .word 0x20000404
8000134: 080001f0 .word 0x080001f0
8000138: 20000404 .word 0x20000404
800013c: 20000408 .word 0x20000408
8000140: 40010810 .word 0x40010810
8000144: 0016e360 .word 0x0016e360

08000148 <SysTick_Handler>:
8000148: f04f 20e0 mov.w r0, #3758153728 ; 0xe000e000
800014c: 6900 ldr r0, [r0, #16]
800014e: f410 3f80 tst.w r0, #65536 ; 0x10000
8000152: d01d beq.n 8000190 <SysTick_Handler+0x48>
8000154: 481a ldr r0, [pc, #104] ; (80001c0 <ConfigurePA8+0x10>)
8000156: 7801 ldrb r1, [r0, #0]
8000158: f1b1 0f11 cmp.w r1, #17
800015c: d218 bcs.n 8000190 <SysTick_Handler+0x48>
800015e: 4819 ldr r0, [pc, #100] ; (80001c4 <ConfigurePA8+0x14>)
8000160: 7803 ldrb r3, [r0, #0]
8000162: 3b01 subs r3, #1
8000164: bf1c itt ne
8000166: 7003 strbne r3, [r0, #0]
8000168: e012 bne.n 8000190 <SysTick_Handler+0x48>
800016a: f101 0101 add.w r1, r1, #1
800016e: f1b1 0f11 cmp.w r1, #17
8000172: d202 bcs.n 800017a <SysTick_Handler+0x32>
8000174: 4a14 ldr r2, [pc, #80] ; (80001c8 <ConfigurePA8+0x18>)
8000176: 5c53 ldrb r3, [r2, r1]
8000178: 7003 strb r3, [r0, #0]
800017a: 4811 ldr r0, [pc, #68] ; (80001c0 <ConfigurePA8+0x10>)
800017c: 7001 strb r1, [r0, #0]
800017e: f011 0101 ands.w r1, r1, #1
8000182: bf0c ite eq
8000184: f44f 7180 moveq.w r1, #256 ; 0x100
8000188: f04f 7180 movne.w r1, #16777216 ; 0x1000000
800018c: 480f ldr r0, [pc, #60] ; (80001cc <ConfigurePA8+0x1c>)
800018e: 6001 str r1, [r0, #0]
8000190: 4770 bx lr

08000192 <BlinkTable>:
8000192: 0202 0202 0202 0205 0205 0205 0202 0202 ................
80001a2: .

080001a3 <BlinkTableEnd>:
...

080001a4 <EnableClockGPIOA>:
80001a4: 490a ldr r1, [pc, #40] ; (80001d0 <ConfigurePA8+0x20>)
80001a6: 6808 ldr r0, [r1, #0]
80001a8: f040 0004 orr.w r0, r0, #4
80001ac: 6008 str r0, [r1, #0]
80001ae: 4770 bx lr

080001b0 <ConfigurePA8>:
80001b0: 4908 ldr r1, [pc, #32] ; (80001d4 <ConfigurePA8+0x24>)
80001b2: 6808 ldr r0, [r1, #0]
80001b4: f020 000f bic.w r0, r0, #15
80001b8: f040 0002 orr.w r0, r0, #2
80001bc: 6008 str r0, [r1, #0]
80001be: 4770 bx lr
80001c0: 20000404 .word 0x20000404
80001c4: 20000400 .word 0x20000400
80001c8: 08000192 .word 0x08000192
80001cc: 40010810 .word 0x40010810
80001d0: 40021018 .word 0x40021018
80001d4: 40010804 .word 0x40010804

080001d8 <StartSysTick>:
80001d8: f04f 21e0 mov.w r1, #3758153728 ; 0xe000e000
80001dc: 6148 str r0, [r1, #20]
80001de: f04f 0000 mov.w r0, #0
80001e2: 6188 str r0, [r1, #24]
80001e4: f04f 0007 mov.w r0, #7
80001e8: 6108 str r0, [r1, #16]
80001ea: 4770 bx lr

080001ec <Default_Handler>:
80001ec: be00 bkpt 0x0000
80001ee: e7fd b.n 80001ec <Default_Handler>
</pre>

This is a lot of information. The “-d” flag tells objdump to disassemble code sections, and the “-s” flag lets it output data sections. At first, it prints the contents of “.VectorTable”. Each line is prefixed with the address of where this data is found in memory. Then, the 32bit data blocks from the vector table are output. The disassembler prints the bytes in the order they appear in memory, which, since the Cortex-M3 uses little endian, is reversed - for example, the printed “ed000008” actually refers to the address “0x080000ed”, which is the address of the “Reset_Handler” with the lowest bit set to one, as it is a thumb function. Most of the addresses in the vector table reflect the address of the default handler, 0x080001ec, except for the zero-entries and the SysTick_Handler. The contents of the “.text” section is the hexadecimal representation of the machine code, and hardly readable. The “.data” section contains a single “two” - this is the “02” put into “TimerEvents”. The contents of “.ARM.attributes:” and the various “.debug” sections is not very interesting, as it does not end up on the controller, and is only read by the various analysis tools to provide nicer output.

After that comes the actual disassembly. This is a list of all the instructions in the code section. The list is grouped by the symbols found in the input file. For C Code, each symbol usually matches one function, so each block in the disassembly represents one C function. In assembly code, if you put non-local labels into a function, that function will be split into multiple blocks by the disassembler, making it harder to read - the main reason for using local labels. Each instruction is translated into one line inside the blocks. The first column is the address where that instruction is found. The next column contains the hexadecimal representation of the 2 or 4 bytes that make up the machine code of that instruction, i.e. the actual content of flash memory. After that comes a textual representation of that instruction, as inferred by the disassembler. If the instruction contains some number, the disassembler sometimes outputs a semicolon followed by some interpretation of that number. If the instruction employs PC-relative addressing, that interpretation will be the absolute address. As many instructions have multiple spellings, there can be discrepancies between the original code and the disassembly. The disassembler will also output data, such as the “BlinkTable” and the literal pools, as such. Using the “.type” directive is helpful in that case so that the disassembler does not attempt to interpret the data bytes as code.

objdump can also be used to disassembly raw binary files that can be obtained by reading back the flash memory of some controller. To do this, use this command line:

<source lang=sh>
$ arm-none-eabi-objdump -b binary -m arm -D prog1.bin -Mforce-thumb --adjust-vma=0x08000000
</source>

The address of the binary in flash memory is specified so that the printed instruction addresses are correct. However, as the disassembler can’t tell data and code apart, the result will be of limited use. If you have to analyze a binary without having an ELF file or the source code, a more sophisticated disassembler such as IDA Pro is helpful. If you have the code and only need the disassembler to identify potential problems with the project (esp. the linker script), objdump is usually sufficient.
==== readelf ====
The “readelf” program is a powerful utility that can read and output various information from ELF files. The most useful option is the “-S” flag, which lets readelf print a summary of the sections in the respective file, e.g.:

<pre>
$ arm-none-eabi-readelf -S prog1.elf
There are 15 section headers, starting at offset 0x11268:

Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .VectorTable PROGBITS 08000000 010000 0000ec 00 A 0 0 1
[ 2] .text PROGBITS 080000ec 0100ec 000104 00 AX 0 0 4
[ 3] .stack NOBITS 20000000 020000 000400 00 WA 0 0 1
[ 4] .data PROGBITS 20000400 010400 000004 00 WA 0 0 1
[ 5] .bss NOBITS 20000404 010404 000004 00 WA 0 0 1
[ 6] .ARM.attributes ARM_ATTRIBUTES 00000000 010404 000021 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 010425 0000db 00 0 0 1
[ 8] .debug_info PROGBITS 00000000 010500 00004c 00 0 0 1
[ 9] .debug_abbrev PROGBITS 00000000 01054c 000028 00 0 0 1
[10] .debug_aranges PROGBITS 00000000 010578 000040 00 0 0 8
[11] .debug_str PROGBITS 00000000 0105b8 00002f 01 MS 0 0 1
[12] .symtab SYMTAB 00000000 0105e8 0006a0 10 13 45 4
[13] .strtab STRTAB 00000000 010c88 000550 00 0 0 1
[14] .shstrtab STRTAB 00000000 0111d8 000090 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
</pre>

For each section, one line is output. The sections “.strtab”, “.shstrtab”, “.symtab” and “NULL” are an integral part of ELF and always present. The “.debug” sections are present if the source was assembled with the “-g” flag. The “.ARM.attributes” section defines for which ARM processor the contained code was translated. These sections don’t end up on the microcontroller. The remaining sections were defined in the linker script: “.VectorTable” contains the addresses of the exception handlers, “.text” contains the program code and constant data for flash memory, “.stack” the stack in RAM, “.data” contains variables in RAM and “.bss” contains zero-initialized variables in RAM. For these sections, the column “Type” contains either “PROGBITS” or “NOBITS” that tells you whether the section in the ELF file actually contains some data - this is only the case for “.VectorTable”, “.text” and “.data”. The sections “.bss” and “.stack” only reserve memory that is written at runtime, but the ELF file doesn’t contain data to be written in these sections. The column “Addr” defines where this section begins in the address space. The most useful column is “Size”: If you sum up the sizes of the sections “.VectorTable”, “.text” and “.data”, you can obtain the used flash memory. By summing ob “.data”, “.stack” and “.bss”, you get the used amount of RAM. Note that “.data” is counted twice, as the initialization data is stored in flash.
==== nm ====
The “nm” utility prints the symbols defined in an ELF file, for example:

<pre>
$ arm-none-eabi-nm prog1.elf
080001ec W ADC1_2_IRQHandler
20000404 b BlinkStep
08000192 t BlinkTable
080001a3 t BlinkTableEnd
20000408 B _BssEnd
20000404 B _BssStart
…
</pre>

This can be helpful in analyzing errors in linker scripts where symbols might get assigned wrong addresses.
==== addr2line ====
The “addr2line” utility reads the debug information from an ELF file to determine which line in which source file produced the instruction found at a particular given address. For example:

<source lang=sh>
$ arm-none-eabi-addr2line 0x080000f0 -e prog1.elf
/tmp/test/prog1.S:24
</source>

Here, line 24 of “prog1.S” contains the assembler command that produced the instruction that ends up at address 0x080000f0.

==== objcopy ====
The “objcopy” utility allows you to translate program files between different formats. It is useful to convert the ELF files to both the Intel Hex format and a simple binary representation. For example,

<source lang=sh>
arm-none-eabi-objcopy -O ihex prog1.elf prog1.hex
</source>

produces a “.hex” file that contains an image of the flash contents in hexadecimal form. With

<source lang=sh>
arm-none-eabi-objcopy -O binary prog1.elf prog1.bin
</source>

a binary file is created which contains an exact 1:1 image of the flash contents. Some flashing tools require these formats instead of ELF, and viewing the binary file with a hex editor can be interesting as well.

=== Interfacing C and C++ code ===
Since assembly is rarely used to implement entire complex projects, but mostly for few time-critical or especially low-level routines that are part of larger code bases written in a high-level-language, interfacing C and assembly code is an important topic, which will be covered here. While it is possible to write the main project structure in assembly and integrate some C modules, it is usually done the other way round. Most of the code shown is already ready to be included in C programs. Most of this topic works the same way for C++, apart from C++ exceptions (not to be confused with ARM processor exceptions) - but these are rarely used on embedded targets anyways.

If you compile C, C++ and assembly code into individual .o object files, you can link these together using “ld” as before. However, C and C++ code usually requires access to the respective standard library, and “ld” doesn’t link these by default - therefore it is necessary to substitute “ld” for a call to “gcc” or “g++” for C or C++, respectively. This will call “ld” internally and pass the required libraries.
==== Environment setup for C and C++ ====
Many C projects use a reset handler and vector table implemented in assembly, although writing them in C is possible too. As required by the C standard, C programs start with the “main()” function, so the (assembly) reset handler should setup the environment such that it is ready for C, and then call “main”. The C code might then later call some assembly functions or inline assembly. When using C++ code, or some GCC extension for C code, it is required to call some additional functions before calling “main”. This is used by C++ to call the constructors of global objects. The C and C++ compilers emit a table of function pointers to functions that should be called at startup. This table has to be put into flash memory by modifying the linker script as follows:

<source>
.text : {
*(.text)
. = ALIGN(4);

_InitArrayStart = .;
*(SORT(.preinit_array*))
*(SORT(.init_array*))
_InitArrayEnd = .;
} >FLASH
</source>

The table of function pointers is sorted to keep the order needed by the compiler. The symbols “_InitArrayStart” and “_InitArrayEnd” mark beginning and end of that table. A reset handler that performs the memory initialization as before and calls the table of initialization functions could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

ldr r4, =_InitArrayStart
ldr r5, =_InitArrayEnd

b 2f
1: ldr r0, [r4], #4
blx r0
2: cmp r4, r5
blo 1b

bl main
1: bkpt
b 1b
.ltorg
</source>

Note that for iterating the table, registers r4 and r5 are used, since the called functions may not overwrite those. The “blx” instruction is needed to perform the indirect function call. When everything is set up, the main function is called. For embedded programs, the main function should never return (i.e. contain an endless loop). If it does, that’s an error, and to make it easier to find, an endless loop with a forced breakpoint is put right after the call to “main”.
==== Calling functions ====
To call assembly functions from C code and vice-versa, the assembly functions should observe the calling convention, as mentioned before. C functions can be called just like assembly functions from assembly code, by placing the parameters in register r0-r3 and on the stack, calling the function using “bl” and retrieving the return value from r0. To call an assembly function from C code, you need to declare it in C first just like a C function. For example, to call a function that takes 2 integer arguments and returns an integer:

<source lang=c>
int AssemblyFunction (int a, int b);
</source>

If you now define a function named “AssemblyFunction” in your assembly code and export it via “.global”, you can call it from C code just like any function.
==== Accessing global variables ====
Global variables defined in C can be accessed from assembly code just like variables defined in assembly code, by using the variable’s name. To access an assembly variable from C code, you need to declare it first by specifying the type. For example, to declare an integer variable:

<source lang=c>
extern int AssemblyVariable;
</source>

If you now define a variable named “AssemblyVariable” in your assembly code and export it via “.global”, you can access it from C code just like any variable. The “extern” is required to make sure the C code doesn’t attempt to declare another variable of the same name.
=== Clock configuration ===
By default, STM32 controllers use an internal RC-oscillator with 8 MHz as a clock source for the core and periphery. This oscillator is too inaccurate for implementing a clock or using serial interfaces such as UART, USB or CAN. To obtain a more accurate clock, an external quartz crystal is usually applied. Many STM32 boards feature an 8 MHz crystal. To use it, some initialization code is required that activates the microcontroller’s built-in crystal-oscillator circuit and switches the clock input to that. The STM32 controllers also include a PLL which can multiply some input clock by a configurable factor before feeding it to the processor core and peripherals. This way, a precise and fast clock can be achieved - the STM32F103 supports up to 72 MHz core frequency. Unfortunately, flash memory is not capable of keeping up with such a high frequency. Therefore, when enabling a fast clock, the flash memory needs to be configured to use wait states depending on the frequency.

The following function configures the flash wait states, enables the crystal oscillator, configures the PLL to multiply the input clock by a factor of 9, and use that as the system clock. The prescaler for the internal bus APB1 is set to 2. Assuming an 8 MHz crystal, this achieves the maximum performance possible with this microcontroller - 72 MHz for the core and APB2 domain, 36 MHz for APB1. If a different crystal is used, the PLL factors have to be adjusted.

<source lang=asm>
RCC = 0x40021000

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

.type ConfigureSysClock, %function
.global ConfigureSysClock
ConfigureSysClock:
@ Turn on HSE
ldr r0, =RCC
ldr r1, =((1 << RCC_CR_HSION)|(1 << RCC_CR_HSEON))
str r1, [r0, #RCC_CR]

@ Configure (but not start yet) PLL
@ Mul = 9, Prediv = 1, APB1 Prescaler = 2, APB2 Prescaler = 1, AHB Prescaler = 1
ldr r2, =(((9-2)<<RCC_CFGR_PLLMUL)|(1 << RCC_CFGR_USBPRE)|(1 << RCC_CFGR_PLLSRC)|(4 << RCC_CFGR_PPRE1))
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
orr r1, #(1 << RCC_CR_PLLON)

@ Wait for HSE ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_HSERDY)
beq 1b

@ Turn on PLL
str r1, [r0, #RCC_CR]

@ Pre-Calculate value for RCC_CFGR
orr r2, #(2 << RCC_CFGR_SW)

@ Wait for PLL ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_PLLRDY)
beq 1b

@ Set flash wait states to 2
ldr r0, =FLASH
ldr r3, =((1<<FLASH_ACR_PRFTBE)|(2<<FLASH_ACR_LATENCY))
str r3, [r0, #FLASH_ACR]
ldr r0, =RCC

@ Switch system clock to PLL
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
bic r1, #(1 << RCC_CR_HSION)

@ Wait for switch to PLL
1: ldr r3, [r0, #RCC_CFGR]
and r3, #(3 << RCC_CFGR_SWS)
cmp r3, #(2 << RCC_CFGR_SWS)
bne 1b

@ Turn off HSI to save power
str r1, [r0, #RCC_CR]

bx lr
.ltorg
</source>

Many projects perform the clock configuration by the reset handler before calling the main function. If you want to follow that practice, place a “bl ConfigureSysClock” as the first instruction in the “Reset_Handler” - this way, all the setup will run with the higher clock frequency, making start-up faster. This and the completed startup code from the previous chapters is implemented in the “startup.S” file in the example repository. If you use it, put your code in the “main” function, where RAM and system clock will already be initialized. This is shown in the “BlinkStartup” example.

=== Project template & makefile ===
To quickly start your own project, a project template is supplied in the examples repository under the directory [https://github.com/Erlkoenig90/ArmAsmTutorial/tree/master/ProjectTemplate-STM32F103RB ProjectTemplate-STM32F103RB]. Put your own application code in the program.S file. The startup.S and vectortable.S contain the reset handler with RAM initialization and the vector table with default handler, respectively. A linker script is included too.

The project also contains a makefile. This allows you to quickly translate your project without having to type the assembler and linker commands. Simply type
<source lang=sh>
make
</source>
To translate the code and produce program.elf, program.bin and program.hex files. All “.S” files in the directory will be automatically translated. Writing makefiles is a complex topic on its own with a lot of information already available on the web, so no further explanations on that will be made here.

ARM-ASM-Tutorial

2021-07-08T14:08:21Z

Nicolas:

[[Kategorie:ARM]][[Kategorie:STM32]][[Kategorie:Entwicklungstools]][[Kategorie:Programmiersprachen]]The [[ARM]] processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language.

Tutorial by [[Benutzer:Erlkoenig|Niklas Gürtler]]. [https://www.mikrocontroller.net/topic/482409 Thread in Forum] for feedback and questions.

== Introduction ==
=== Why assembly? ===
Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language!

As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation.

Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary.

=== About ARM ===
Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers.

Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications.

Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor.

=== Architecture and processor variants ===
A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”.

While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm.

ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations.

Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android).

An overview of ARM processors and their implemented architecture version can be found on [https://en.wikipedia.org/wiki/ARM_architecture#Cores Wikipedia].
This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use.

There is actually not a single, but three instruction sets for ARM processors:
* The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.
* The “A64” instruction set is for the new 64bit ARM processors
* The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes.

The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set.

== Prerequisites ==
First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components.

=== Microcontroller selection ===
For the microcontroller, an [https://www.st.com/en/microcontrollers-microprocessors/stm32f103c8.html STM32F103C8] or [https://www.st.com/en/microcontrollers-microprocessors/stm32f103rb.html STM32F103RB] by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: [https://www.st.com/en/evaluation-tools/nucleo-f103rb.html Nucleo-F103], “[https://os.mbed.com/users/hudakz/code/STM32F103C8T6_Hello/ Blue Pill]” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), [https://www.olimex.com/Products/Duino/STM32/OLIMEXINO-STM32/open-source-hardware Olimexino-STM32], [https://www.olimex.com/Products/ARM/ST/STM32-P103/ STM32-P103], [https://www.olimex.com/Products/ARM/ST/STM32-H103/ STM32-H103], [https://www.st.com/en/evaluation-tools/stm3210e-eval.html STM3210E-EVAL].

=== Processor type & documentation ===
First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation.

* The first source of information is the [https://www.st.com/resource/en/datasheet/stm32f103rb.pdf STM32F103RB/C8 datasheet]. According to the headline, this is a '''medium-density''' device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a '''Cortex-M3''' processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.
* The next important document is the [https://www.st.com/resource/en/reference_manual/cd00171190.pdf STM32F103 reference manual] that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.
* The [https://developer.arm.com/ip-products/processors/cortex-m/cortex-m3 ARM developer website] provides information about the Cortex-M3 processor core, particularly the [https://static.docs.arm.com/100165/0201/arm_cortexm3_processor_trm_100165_0201_01_en.pdf ARM Cortex-M3 Processor Technical Reference Manual]. According to chapter 1.5.3, this processor implements the '''ARMv7-M architecture'''.
* The architecture is documented in the [https://static.docs.arm.com/ddi0403/ed/DDI0403E_d_armv7m_arm.pdf ARMv7M Architecture Reference Manual]. Particularly, it contains the complete documentation of the instruction set.

For any serious STM32 development, you should be familiar with all these documents.

=== Debug adapter ===
There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD.

Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the [https://www.st.com/en/development-tools/st-link-v2.html ST-Link] debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them.

=== Development Software ===
On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested).

First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the [https://www.st.com/en/development-tools/stsw-link004.html ST-Link Utility] can be useful, but is not strictly required.

Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm GNU Arm Embedded Toolchain] is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it.

Another component is required to talk with the debug adapter. For the ST-Link, this is done by [http://openocd.org/ OpenOCD], which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software.

Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable.

== Setup ==
Follow the instructions in the next chapters to set up your development environment.

=== Hardware ===
The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input.

If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB.

=== Software ===
==== Linux ====
Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM.

Download the Linux binary tarball from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”).

Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via:

<source lang=sh>
echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh
</source>

OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian):
<source lang=sh>
sudo apt-get install openocd
</source>

After that, log out and back in (or just reboot). In a terminal, type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source lang=sh>
$ arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source lang=sh>
$ openocd -v
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

==== Windows ====
[[File:ArmAsmTutorial_GccOptions.png|300px|thumb|right|Options for installing GCC]]Download the Windows installer from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”.

A Windows package for OpenOCD can be obtained from the [https://github.com/gnu-mcu-eclipse/openocd/releases gnu-mcu-eclipse downloads page]. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path.

[[File:ArmAsmTutorial_PcProperties.png|300px|thumb|right|Opening PC properties]][[File:ArmAsmTutorial_SetEnvVar.png|300px|thumb|right|Setting environment variable]]Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times.

Open a ''new'' command window (Windows Key + R, type “cmd” + Return). Type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source>
C:\>arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source>
C:\>openocd -v
GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

== Writing assembly applications ==
The full source code of the examples in the following chapters contain be found on [https://github.com/Erlkoenig90/ArmAsmTutorial GitHub]. The name of the corresponding directory is given after each example code below.
=== First rudimentary program ===
After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”.

Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

nop @ Do Nothing
b . @ Endless loop
</source>
Example name: “EmptyProgram”

When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents.

The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment.

The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later.

The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program.

To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing <code>cd <Path to Project Directory></code>. Call the assembler like this:

<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
</source>

This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this:

<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -Ttext=0x8000000
</source>

This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this:

<pre>arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000</pre>

This is not relevant for executing the program without an operating system and can be ignored.

=== Flashing the program ===
To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit"
</source>
Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption.
=== Starting the debugger ===
To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg
</source>
Then, open a new terminal/command window and start a GDB session:
<source lang=sh>
arm-none-eabi-gdb prog1.elf
</source>
GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance:
<source>
target remote :3333
</source>
Then, stop the currently running program:
<source>
monitor reset halt
</source>
If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing:
<source>
load
</source>
Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again:
<source>
monitor reset halt
</source>
Now, examine the contents of the CPU registers:
<source>
info reg
</source>
The output should look something like
<pre>
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0x0 0x0
lr 0x0 0
pc 0x8000000 0x8000000 <_stack+133693440>
xPSR 0x1000000 16777216
msp 0x20000400 0x20000400
psp 0x27e3fa34 0x27e3fa34
primask 0x0 0
basepri 0x0 0
faultmask 0x0 0
control 0x0 0
</pre>

At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing
<source>
stepi
</source>
If you type <code>info reg</code> again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do <source>stepi</source> again (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing
<source>
continue
</source>
You can interrupt the running program by pressing “Ctrl+C”. Run the commands
<source>
kill
quit
</source>
to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal.
=== Using processor registers ===
The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing.

All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”.

As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is:
<source lang=asm>
ldr r0, =123456789
</source>
The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead.

The instruction “mov” can be used to copy the contents from one register to another:
<source lang=asm>
mov r1, r0
</source>
This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers.

In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual.
=== Accessing periphery ===
To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses.

The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor.

As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step.

==== Clock Configuration ====
Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably.

This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004.

To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”.

In short: To enable GPIOA, the value '''0x00000004''' needs to be written to address '''0x40021018'''.

According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is:
* Load the value 0x00000004 into a register
* Load the value 0x40021018 into another register
* Store the value from the first register into the memory location specified by the second register.
This last step is performed by the “STR” instruction as follows:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
b .
</source>
The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet.

==== GPIO Configuration ====
By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”.

Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”.

Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on.

The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register).

Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”:
<source lang=asm>
ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

==== Writing GPIO pins ====
The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x100
ldr r1, =0x4001080C
str r0, [r1] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high

b .
</source>
Example name: “SetPin”

This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect!

=== Data processing ===
ARM supports many instructions for mathematical operations. For example, addition can be performed as:
<source lang=asm>
ldr r0, =222
ldr r1, =111
add r2, r0, r1
</source>
This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right.

You can also overwrite an input register with the result:
<source lang=asm>
add r0, r0, r1
</source>
This will write the result to r0, overwriting the previous value. This is commonly shortened to
<source lang=asm>
add r0, r1
</source>
The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are:
* '''sub''' for subtraction
* '''mul''' for multiplication
* '''and''' for bitwise and
* '''orr''' for bitwise or
* '''eor''' for bitwise exclusive or (“xor”)
* '''lsl''' for logical left shift
* '''lsr''' for logical right shift

Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example,
<source lang=asm>
add r0, r0, #23
</source>
adds 23 to the register r0 and stores the result in r0. This can again be shortened to
<source lang=asm>
add r0, #23
</source>

Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register.

Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program.

=== Reading periphery registers ===
The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle.

Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via:
<source lang=asm>
ldr r1, =0x40021018
ldr r0, [r1]
</source>
Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1.

The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction:
<source lang=asm>
orr r0, r0, #4
</source>
After that, we can store r0 as before.

With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value:

<source lang=asm>
ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

So, the modified program is:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “SetPin2”
=== Jump instructions ===
For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1]
</source>

By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning.

A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b 0x8000104
</source>
Example name: “Blink”

The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative.

When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there.

This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones.

Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon:

<source lang=asm>
BlinkLoop:

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b BlinkLoop
</source>
Example name: “Blink2”

This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this:

<source lang=asm>
b (.+4+42*2)
</source>

The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though.

=== Counting Loops ===
The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count ''down'' than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”:

<source lang=asm>
ldr r2, =1000000
subs r2, #1
</source>

Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is ''not'' set, i.e. when the result of the last math instruction (with an “s” appended) was ''not'' zero. The “beq” instruction is the opposite of that – it only performs a jump if the result ''was'' zero.

So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet:
<source lang=asm>
ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop
</source>

The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

BlinkLoop:
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r2, =1000000
delay2:
subs r2, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay”

You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change ''before'' the loop. Then, just use them inside the loop:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay2”

=== Using RAM ===
Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored.

As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right ''before'' that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash):

<source lang=asm>
.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text
@ Instructions go here...
</source>

Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text

ldr r0, =var1 @ Get address of var1
ldr r1, =0x12345678
str r1, [r0] @ Store 0x12345678 into memory block “var1”

ldr r1, [r0] @ Read memory block “var1”
and r1, #0xFF @ Set bits 8..31 to zero
ldr r0, =var2 @ Get address of var2
strb r1, [r0] @ Store a single byte into var2

b .
</source>
Example name: “RAMVariables”

Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored.

There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option <code>-Tdata=0x20000000</code> to the <code>arm-none-eabi-ld</code> call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly.

If you run this program via GDB, you can use the commands <code>x/1xw &var1</code> and <code>x/1xb &var2</code> to read the data stored in memory. After this quick introduction a more abstract overview is indicated.

== Memory Management ==
If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be used effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works.

=== Address space ===
In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123.

However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash.

While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined ''before'' the code, the code actually comes first in address space. What makes all this work is the Linker.

=== The Linker ===
Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as using multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes.

[[File:ArmAsmTutorial_Linking.svg|300px|thumb|right|Translation of native applications using assembler, compiler and linker]]You have already used a linker – the command <code>arm-none-eabi-ld</code> calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file.

When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable.

On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found [https://developer.arm.com/architectures/system-architectures/software-standards/abi here], and the generic specification for ELF on which the ARM ELF variant is based can be found [http://www.sco.com/developers/gabi/2003-12-17/contents.html here].

ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses.

Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well.

As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error:

<pre>(.text+0x132): undefined reference to `Foo'</pre>

Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker.

=== Linker Scripts ===
A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found [https://sourceware.org/binutils/docs/ld/Scripts.html here].

It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
*(.data)
} >SRAM
}
</source>
Example name: “LinkerScriptSimple”

This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations.

The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example:

<source>
SECTIONS {
.FlashText : {
*(.text)
} >FLASH

.RamData (NOLOAD) : {
*(.data)
} >SRAM
}
</source>

The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone.

To specify the filename of the linker script, use the “-T” option:
<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld
</source>
The <code>-Tdata</code> and <code>-Ttext</code> aren’t needed anymore, as the addresses are now defined in the linker script.

Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM):
<pre>arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH'
arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes</pre>

==== Reserving memory blocks ====
Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the ''beginning'' of SRAM, so put this before the “*(.data)” command:

<source>. = . + 0x400;</source>

Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400.

==== Defining symbols in linker scripts ====

As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file:

<source lang=asm>
.word 0x20000400
.word 0x080000ed
.space 0xe4
</source>

The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte ''after'' the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script:

<source>
.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
</source>
Example name: “LinkerScriptSymbols”

This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol:

<source lang=asm>
.word _StackEnd
</source>

The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores.

The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word _StackEnd
.word Reset_Handler
.space 0xe4

.type Reset_Handler, %function
Reset_Handler:

@ Put code here
</source>

The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now.

==== Absolute section placement ====
The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:
</source>
Example name: “LinkerScriptAbsolutePlacement”

The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.VectorTable)
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>

=== Program Structure ===
Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Assemble and link this source code with two assembler commands:
<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
arm-none-eabi-as -g vectortable.S -o vectortable.o
arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld
</source>

This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
@ Code here ...
</source>

This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a ''local'' symbol for each label, which can’t be used from other source files (same as ''static'' in C). The symbol is still there in the final program file, though - it can be used for debugging purposes.

== More assembly techniques ==
After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced.

=== Instruction set state ===
As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective.

Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32.

Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is:

<source lang=asm>
ldr r0, =SomeLabel
bx r0
</source>

Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions.

In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive:

<source lang=asm>
.type SomeLabel, %function
SomeLabel:
@ Some code...
</source>

That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch.

As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too:

<source lang=asm>
.type Reset_Handler, %function
</source>

If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example:

<source lang=asm>
.cpu cortex-a8
.syntax unified

@ Small but slower code here
.thumb

.type Block1, %function
Block1:
ldr r0, =Block2
bx r0

@ Larger but faster code here
.arm

.type Block2, %function
Block2:
@ ...
</source>

The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below).

=== Constants ===
The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly.

The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways:
<source lang=asm>
RCC_APB2ENR = 0x40021018
.set GPIOA_CRH, 0x40010804
.equ GPIOA_ODR, 0x4001080C
</source>

and then use it in place of the number:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_10MHz = 1
GPIOx_CRx_GP_PP_2MHz = 2
GPIOx_CRx_GP_PP_50MHz = 3

GPIOx_CRx_GP_OD_10MHz = 1|4
GPIOx_CRx_GP_OD_2MHz = 2|4
GPIOx_CRx_GP_OD_50MHz = 3|4

GPIOx_CRx_AF_PP_10MHz = 1|8
GPIOx_CRx_AF_PP_2MHz = 2|8
GPIOx_CRx_AF_PP_50MHz = 3|8

GPIOx_CRx_AF_OD_10MHz = 1|4|8
GPIOx_CRx_AF_OD_2MHz = 2|4|8
GPIOx_CRx_AF_OD_50MHz = 3|4|8

GPIOx_CRx_IN_ANLG = 0
GPIOx_CRx_IN_FLOAT = 4
GPIOx_CRx_IN_PULL = 8

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkConstants”

This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules.

The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate:

<source lang=asm>
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffff0ff
orr r0, #(GPIOx_CRx_GP_OD_10MHz<<8)
str r0, [r1]
</source>

C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values.

Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label.

A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register:

<source lang=asm>
MyData .req r7
ldr MyData, =123
add MyData, 3
</source>

This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code.
=== The Stack ===
In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”.

Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack.

On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs:

* Some operation that writes to r0
* '''Push''' (save) r0 to the stack
* Some operation that overwrites r0
* '''Pop''' (restore) r0 from the stack
* Use the value in r0 which is the same as initially assigned

ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ … Some code that overwrites r0 …

pop { r0 } @ Restore value 1000000
@ Continue using r0 ...
</source>

It is also possible to save/restore multiple registers in one go:

<source lang=asm>
ldr r0, =1000000
ldr r1, =1234567
@ Use r0 and r1 ...
push { r0, r1 } @ Save values 1000000 and 1234567

@ … Some code that overwrites r0 and r1 …

pop { r0, r2 } @ Restore 1000000 into r0 and 1234567 into r2
@ Continue using r0 and r2...
</source>

It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ Inner Code Block:

ldr r0, =123
@ Use r0 …

push { r0 } @ Save value 123

@ Inner-Inner Code Block that overwrites r0

pop { r0 } @ Restore value 123
@ Continue using r0 ...

pop { r0 } @ Restore value 1000000 into r0

@ Continue using r0 …
</source>

The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”.

At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa.

As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data.

When storing a 32bit register value using “push”, the stack pointer is '''first''' decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later.

An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte '''after''' the stack memory area. If the stack is completely full, it contains the address of the very first byte '''inside''' the stack memory area. This means that the stack grows '''downward'''. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash.

The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address ''inside'' the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4127.html ARM ABI requires] the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8.

In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the ''end'' of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the ''beginning'' of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial.
=== Function calls ===
Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the ''next'' instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again).

This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

bl EnableClockGPIOA @ Call function to enable GPIOA's peripheral clock

@ Some more code ...
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1]

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

</source>

Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”.

When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files.
==== Using the stack for functions ====
In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r5, =5 @ Number of LED flashes.
bl Blink
b .

.type Blink, %function
Blink:
push { lr }
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

bl Delay

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

bl Delay

subs r5, #1
bne BlinkLoop

pop { lr }
bx lr

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

.type Delay, %function
Delay:
mov r4, r3
DelayLoop:
subs r4, #1
bne DelayLoop @ Iterate delay loop
bx lr

</source>
Example name: “BlinkFunctions”

The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions.

The two lines

<source lang=asm>
pop { lr }
bx lr
</source>

are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC:

<source lang=asm>
pop { pc }
</source>

This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running.
==== Calling Convention ====
Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller.

When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics:

* Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.
* The LR is used as shown for the return address.
* When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).
* The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.
* The register r0 may be used to pass a result value back to the caller, which is called the return value.

This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved ''outside'' the function (“caller-save”), and the registers r4-r11 are (if at all) saved ''inside'' the function (“callee-save”).

A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack.

If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =5
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r7, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
mov r7, r0 @ Number of LED flashes.

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r7, #1
bne BlinkLoop

pop { r4-r7, pc }

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
DelayLoop:
subs r0, #1
bne DelayLoop @ Iterate delay loop
bx lr
</source>
Example name: “BlinkFunctionCallingConvention”

The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7.

Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes:

<source lang=asm>
.type Blink, %function
Blink:
push { r4, lr }

mov r4, r0

BlinkLoop:
ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r4, #1
bne BlinkLoop

pop { r4, pc }
</source>
Example name: “BlinkFunctionCallingConvention2”

A third variant would not use any of the callee-save-registers (r4-r11) at all, and instead just save r0 before the function calls and restore it as needed

<source lang=asm>
.type Blink, %function
Blink:
push { lr }

BlinkLoop:
push { r0 }

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

pop { r0 }
subs r0, #1
bne BlinkLoop

pop { pc }

</source>
Example name: “BlinkFunctionCallingConvention3”

The frequent stack accesses would however make this slower. Be sure to always document the meaning (and units, if applicable) of parameters e.g. via comments.
=== Conditional Execution ===
As mentioned, the conditional variants of the “b” instruction (e.g. “bne”) can be used to execute certain blocks of code only if a certain condition is met. First, more ways to formulate conditions will be shown. Next, the ARM instruction “it” will be introduced, which makes executing small blocks of code conditionally more efficient.
==== Conditions ====
All conditions for conditional execution depend on the outcome of some mathematical operation. When instructions such as “adds”, “subs”, “ands” are used, they update the flags in the APSR register depending on the outcome, which are then read by the conditional variants of “b” to decide whether to actually perform the jump.

Often it is necessary to compare two numbers without actually doing a calculation. This can be done with the “cmp” instruction to which you can pass two registers or a register and a literal:

<source lang=asm>
cmp r0, #42
cmp r0, r1
</source>

The “cmp” instruction is very similar to “subs” - it subtracts the second operand from the first, but doesn’t save the result anywhere, i.e. the registers keep their values. Just the flags in the APSR are updated according to the result, just as with “subs”. For example, if both operands were equal, the result of the subtraction is zero, and the zero flag will be set. So, to test whether two numbers are equal:

<source lang=asm>
cmp r0, #42
beq TheAnswer

@ This is executed if r0 is not 42

TheAnswer:
@ This is executed if r0 is 42
</source>

The “bne” instruction is the opposite of “beq”.

The “tst” instruction works similarly to “cmp”, but instead of subtracting, perform a bitwise “and” operation - like the “ands” instruction, but without keeping the result. This way, you can test whether a bit in a register is set:

<source lang=asm>
tst r0, #4
beq BitNotSet

@ This is executed if bit 2 in r0 is set

BitNotSet:
@ This is executed if bit 2 in r0 is not set
</source>

A more useful use case for “tst” is to pass the same register twice. Applying “and” to the same value twice yields the same result as the input, so “tst” in this case effectively checks the properties of the input (negative/positive, zero):

<source lang=asm>
tst r0, r0
beq ValueZero

@ This is executed if r0 is not zero

ValueZero:
@ This is executed if r0 is zero
</source>

There is also the “teq” instruction which performs an exclusive or operation.

As mentioned, the suffixes “eq” and “ne” are called condition codes. ARM has 14 of those which define how the flags in the APSR form the condition. The details about how a subtraction (by “subs” or “cmp”) sets the flags in the APSR and how their interpretation by the different condition codes correlates to the mathematical result are somewhat complicated, involving the way the 2’s complement format works and relying on the fact that subtracting works by adding a negated number. Instead of diving into all the details, a table with a more high-level view and a practical interpretation of the condition should be more helpful:

{| class="wikitable"
! style="text-align:left;" | Code
! style="text-align:left;" | Meaning
! style="text-align:left;" | Unsigned / Signed
! style="text-align:left;" | Flags
! style="text-align:left;" | Condition after “cmp/subs r0, r1”
! style="text-align:left;" | Condition after “tst r0, r0”
|-
| style="text-align:center;" | EQ
| Equal
| style="text-align:center;" | U+S
| Z==1
| r0 = r1
| r0 = 0
|-
| style="text-align:center;" | NE
| Not equal
| style="text-align:center;" | U+S
| Z==0
| r0 ≠ r1
| r0 ≠ 0
|-
| style="text-align:center;" | MI
| Negative
| style="text-align:center;" | S
| N==1
| ---
| r0 < 0
|-
| style="text-align:center;" | PL
| Positive or Zero
| style="text-align:center;" | S
| N==0
| ---
| r0 ≥ 0
|-
| style="text-align:center;" | VS
| Overflow
| style="text-align:center;" | S
| V==1
| r0-r1 out of range¹
| ---
|-
| style="text-align:center;" | VC
| No Overflow
| style="text-align:center;" | S
| V==0
| r0-r1 in range¹
| ---
|-
| style="text-align:center;" | HS
| Unsigned higher or same
| style="text-align:center;" | U
| C==1
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LO
| Unsigned lower
| style="text-align:center;" | U
| C==0
| r0 < r1
| ---
|-
| style="text-align:center;" | HI
| Unsigned higher
| style="text-align:center;" | U
| C==1 and Z==0
| r0 > r1
| ---
|-
| style="text-align:center;" | LS
| Unsigned lower or same
| style="text-align:center;" | U
| C==0 or Z==1
| r0 ≤ r1
| ---
|-
| style="text-align:center;" | GE
| Signed greater or equal
| style="text-align:center;" | S
| N==V
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LT
| Signed less than
| style="text-align:center;" | S
| N!=V
| r0 < r1
| ---
|-
| style="text-align:center;" | GT
| Signed greater than
| style="text-align:center;" | S
| Z==0 and N==V
| r0 > r1
| ---
|-
| style="text-align:center;" | LE
| Signed less or equal
| style="text-align:center;" | S
| Z==1 or N!=V
| r0 ≤ r1
| ---
|}
1: Range meaning the numbers from -(2^31) until (2^31-1), inclusive

To determine which condition code you need, first think about whether the number is unsigned (range 0 to 2^32-1) or is using two's complement to represent signed numbers (range -2^31 to 2^31-1). Ignore all rows in the table with the wrong format.

If you want to compare two numbers, use the “cmp” instruction, and search for the desired condition within the “cmp”-condition column of the table. If you want to test a single number's properties, use the “tst”-column. Use the condition code from the first column with the conditional “b” instruction (“bne”, “beq”, “bmi”, “bpl”, “bhs”, ... ) right after the appropriate “cmp”/”tst” instruction.

Note that all the condition codes have a corresponding inverse code that has exactly the negated meaning. Most also have a swapped partner code, using which is equivalent to swapping the operands for cmp.
==== The IT instruction ====
Jumping is inefficient, so having many conditional jumps may slow down your program. The ARM architecture offers a way to make a few instructions conditional without requiring a jump via the “it” (if-then) instruction. It is used in place of a conditional jump after an instruction that set the flags (“cmp”, “tst”, “adds”…) and also needs a condition code. The next instruction right after the it will then only be executed when the condition is met, and skipped otherwise. You have to repeat the condition code and add it to that instruction; this is just to make the code clearer and avoid confusion.

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

it hi @ Make the next instruction conditional
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

This checks if r0 is higher than r1 (it isn’t), and only sets the pin PA8 to high if this condition is met. Up to 4 instructions can be made conditional like this; for each one, an additional “t” has to be appended to the “it” instruction:

<source lang=asm>
cmp r0, r1 @ Perform comparison

ittt hi @ Make the next instruction conditional
ldrhi r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldrhi r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

You can also add instructions that will be executed if the condition was ''not'' met (like an “else”-case in high-level-languages), by appending “e” instead of “t” to the “it” instruction. Since the “t” in “it” is fixed, the first instruction is always executed if the condition is met; only the next three instructions can be either a “then” case (“t”) or “else” case (“e”). You also have to provide the inverted condition code for the “else”-instructions:

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

ite hi @ Make the next two instructions conditional (if-then-else)
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
strls r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low
</source>

There are several restrictions on which instructions may appear within an it-block. Most importantly, instructions that set the flags are forbidden here, as is the “b” instruction except for the last instruction in an “it” block. Directly jumping to one of the conditional instructions is forbidden too.

In T32 code, only the conditional “b” instruction is capable of encoding a condition code together with some operation, so the “it” instruction is provided to make any instruction conditional. On A32, most instructions include a condition code and can therefore be conditional, and the “it” instruction is actually ignored by the assembler here. You can and should still put “it” into code intended for A32, as this makes it compatible with T32. This is one of the reasons why A32 is more time-efficient, and T32 more space-efficient.

Conditional instructions sometimes make surprisingly compact programs. For example, the euclidean algorithm for calculating the greatest common divisor (gcd) of two numbers can be written in ARM assembly like this:

<source lang=asm>
gcd:
cmp r0, r1
ite gt
subgt r0, r0, r1
suble r1, r1, r0
bne gcd
</source>

While the C equivalent is actually longer:

<source lang=asm>
int gcd(int a, int b) {
while (a != b) {
if (a > b)
a = a - b;
else
b = b - a;
}
return a;
}
</source>

The usage of conditional instructions is also [https://developer.arm.com/docs/dui0801/latest/condition-codes/example-showing-the-benefits-of-conditional-instructions-in-a32-and-t32-code faster] than using conditional jumps. Note that the final “bne” instruction is independent of the “if-then” block; it just directly uses the result of “cmp”.
=== 8/16 bit arithmetic ===
So far, all numbers had 32 bit. However, especially for space reasons, smaller numbers are needed with 8 or 16 bit. Cortex-M3 doesn't provide any instructions for calculating 8 or 16 bit numbers directly. Instead, after loading such a number from memory into a processor register, it has to be extended into 32bit to allow the 32bit instructions to work properly. When storing the result back, only the lower 8/16 bit are used. If 8/16bit overflow behavior is required (i.e. overflow at -128/127 for 8bit signed, 0/256 for 8bit unsigned, -32768/32767 for 16bit signed, 0/65536 for 16bit unsigned) for calculations, the numbers have to be truncated after each calculation. This actually makes it slightly less efficient to deal with smaller numbers.

A 16bit value (“halfword”) can be read from memory with the ldrh instruction:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
</source>

“ldrh” loads 16bit from memory, writes them into the lower 16 bits of the destination register (here: r1), and sets the upper 16bits to zero. If the value is signed, it has to be sign-extended so that it can be used with 32bit-calculations:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
sxth r1, r1
</source>

The “sxth” instruction copies the sign bit (i.e. bit 15) into the upper 16 bits (“sign-extension”); this makes sure that negative 16bit-numers keep their value when interpreted as 32 bits. The “ldrsh” instruction combines both “ldrh” and “sxth”. “ldrb”, “sxtb”, “ldrsb” are for loading and sign-extending 8bit-values and the combination of both, respectively.

To simulate 8/16bit overflow behaviour after a mathematical operation, use uxtb/uxth for unsigned 8/16 bit numbers, or sxtb/sxth for signed 8/16 bit numbers:

<source lang=asm>
add r0, #1
uxth r0, r0
</source>

The “uxth”/”uxtb” instructions copy the lower 16/8 bits of a register into another one, setting the upper 16/24 bits to zero. This way, if r0 contained 65535 before, the result will be 0 instead of 65536 after using “uxth”.

This is a common trap when coding in C - when using e.g. the "uint16_t" type for local variables such as loop counters, this implicitly requests 16bit overflow behavior, requiring the truncating after each calculation, even though the overflow may actually never happen. This is why e.g. uint16_fast_t should be used for local variables, as this is 32 bit on ARM, which is faster.
=== Alignment ===
There are certain restrictions on the address when accessing data in memory using the “str”/”ldr” variants:
* The “ldrd”/”strd”/”ldm”/”stm” instructions, which can load/store multiple registers at once, always require the address to be a multiple of 4. If it isn’t, the program will crash.
* The “ldr”/”str” instructions require the address to be a multiple of 4, and “strh”/”ldrh” require it to be a multiple of 2. If it isn’t, the behaviour depends on the ARM version:
** On ARMv6-M and before, the program will crash.
** On ARMv7-M:
*** If the CCR.UNALIGN_TRP is set to zero (the default), the access will be slow
*** If the CCR.UNALIGN_TRP bit is set to one, the program will crash, emulating the ARMv6-M behaviour

For “strb”/”ldrb” there are no such requirements.

The number of which the address needs to be a multiple of is called the “alignment” (e.g. 2-byte-alignment, 4-byte-alignment, …). An access with an address that is a multiple of 2/4 as specified above is called an “aligned access”; others are called “unaligned access” (which are slow or cause a crash).

Even though slow accesses may be acceptable, it is still a good idea to make sure all accesses are always correctly aligned in case the code is ported to an ARM version or operating system that requires it. The addresses of periphery registers are already aligned correctly, so there is no need to worry. When placing data in RAM however, you should make sure that the addresses of the individual elements that are accessed via one of the “ldr” variants are aligned properly. For example, if a previous example code was modified like this:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

The address of “var1” will not be a multiple of 4, and an access via “ldr” would be unaligned. This could be improved by adding a space of 3 bytes in between:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.space 3
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

This would require you to keep in mind all the other things in memory that were declared before, which is impractical especially if multiple assembly files are used. Therefore, the assembler offers the “.align” directive:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.align 2
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

When using “.align X”, the assembler makes sure that the next address will be a multiple of 2^X, so in this case, a multiple of 2^2=4. The assembler will therefore insert 0 to 2^X-1 bytes of space. The section containing the directive in the object code file will also be marked to require that alignment, such that the linker will automatically place it at the appropriate location in address space.
=== Offset addressing ===
The various “ldr”/”str” instructions can optionally perform some calculation on the address before executing the memory access. What is shown for “ldr” here works for “str” and the variants for halfwords and bytes equivalently. There are several variants for this. This first one adds a fixed offset that is encoded within the instruction itself to the address:

<source lang=asm>
ldr r0, [r1, #8]
</source>

This adds 8 to r1 and uses the result as the address to access. The number can also be negative. This variant is useful for accessing members of a heterogeneous container organized like a C struct or the registers in a periphery module. For example, you can load the base address of a periphery module into a register, and then access the various registers using offset-addressing without having to load each address individually:

<source lang=asm>
GPIOA=0x40010800
GPIOx_CRH = 0x04
GPIOx_BSRR = 0x10

GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA

ldr r1, =GPIOA

ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r0, [r1, #GPIOx_BSRR] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “OffsetAddressing”

This way, you can avoid repeated loads of similar addresses. This variant is also capable of writing the newly calculated address back into the address register by appending a “!”:

<source lang=asm>
ldr r0, [r1, #8]!
</source>

This will add 8 to r1, write the result into r1, and also use it as an address from which to load 4 bytes and store them into r0. The variant

<source lang=asm>
ldr r0, [r1], #8
</source>

works just the opposite - r1 is used as an address from which to load the data, and “r1+8” is written back to r1. The next variant adds two registers to obtain the memory address:

<source lang=asm>
ldr r0, [r1, r2]
</source>

This loads the data from the address calculated by “r1+r2”. The second register (here: r2) can also be optionally shifted left by a fixed number of bits in the range 0-3:

<source lang=asm>
ldr r0, [r1, r2, lsl #2]
</source>

This shifts r2 left by two bits (i.e. multiplies it by 4), adds it to r1, and uses that as the address (r2 itself is not modified).
=== Iterating arrays ===
The offset addressing mechanism is perfectly suited to iterating arrays. This could be used to make an array defining a sequence of LED flashes that is iterated by the LED blinker application. Such an array would contain the duration of each on-and off-cycle (as passed to the “Delay” function) and be placed in flash memory:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =BlinkTableEnd @ Move address of "BlinkTableEnd" into r8

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
.word 2500000, 1000000, 2500000, 1000000, 2500000, 1000000
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
BlinkTableEnd:
</source>
Example name: “BlinkPattern”

The “.word” directive is used to place a sequence of 32bit- numbers into flash memory. The label “BlinkTable” will refer the the start address of the array, and “BlinkTableEnd” to the first address ''after'' the array. These two addresses are loaded into registers before the loop. The “.align” directive is used to make sure the 32bit-words are stored at properly aligned addresses. Inside the loop, the “ldr” instruction is used to load a 32bit-word from the array and pass it to the “Delay” function. The r7 register is advanced by 4 bytes to the next 32bit-word. This is done twice, for the on-and off-time. At the end of the loop, the address register is compared with the address of “BlinkTableEnd” - until that address has been reached, the loop will continue.

Another possibility is to keep the base address of the array in a register, and increment another register that contains the offset:

<source lang=asm>
.type Blink, %function
Blink:
push { r4-r9, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =0
ldr r9, =18

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

cmp r8, r9
blo BlinkLoop

pop { r4-r9, pc }
</source>
Example name: “BlinkPattern2”

Here, r8 is incremented in steps of 1 to denote the index in the array. The “lsl” syntax for “ldr” is used to multiply r8 by 4 (since each word is 4 bytes in size) and add it to r7, which contains the array’s base address. At the end of the loop, r8 is compared with 18, which is the number of entries in the array. This variant is actually less efficient, as it needs to keep both the base address and the index in registers and also has to increment the index in each iteration.
=== Literal loads ===

Regardless of architecture, any processor obviously needs to work with addresses in its own address space a lot. ARM can do calculations with its 32bit addresses just fine, but there is a bottleneck: The instruction set itself. To work with any address, it needs to be initially loaded into a processor register, but ARM instructions are only 16 or 32bit in size - not enough space for an arbitrary 32bit number plus the instruction encoding. Allowing even larger instructions (e.g. 40 bit) would complicate matters, so ARM instead uses several tricks to deal with this problem, which will be discussed here.

The “ldr r0, =1234” syntax allows you to load any arbitrary 32bit numbers, but is not actually a machine code instruction, but is translated by the assembler into one. In this chapter, the actual instructions for loading immediate numbers are discussed.

==== The “mov”-instruction ====
The most basic way of loading an immediate number into a register is the “mov” instruction:

<source lang=asm>
mov r0, #1234
</source>

This allows you to load any 16bit number (0 to 2^16-1) into a register. “mov” also includes some clever encodings that allow you to load certain commonly-used patterns:

* Any 32bit number that consists of one byte of arbitrary bits (i.e. 8 adjacent arbitrary bits) at any location, and zeros otherwise, e.g. 0x00000045, 0x00045000, 0x7f800000.
* Any 32bit number that consists of the same byte repeated 2 or 4 times in fixed places, as in 0x23002300, 0x00230023, 0x23232323
* The bit-wise negated result of any of these two patterns, e.g. 0xffffffba, 0xfffbafff, 0x807fffff or 0xdcffdcff. The assembler will actually use the “mvn” instruction for this, which works identically to “mov”, but negates the value.

By specifying a number that falls into one of these patterns, the assembler will automatically use the appropriate encoding. The first two ways of encoding numbers are not only available with “mov”, but also several other mathematical instructions that expect some immediate value: “add”, “and”, “bic”, “cmn”, “cmp”, “eor”, “mov”, “mvn”, “orn”, “orr”, “rsb”, “sbc”, “sub”, “teq”, “tst”. In the ARM Architecture Reference Manual, check the description of the instructions and look out for “ThumbExpandImm” to see whether it supports the first two patterns above.

You can also use the “mvn” instruction directly, e.g.:

<source lang=asm>
mov r0, #0xf807ffff
mvn r0, #0x07f80000
</source>

both lines are identical and write the number 0xf807ffff into r0.
==== The “movt” instruction ====
While supporting many common patterns, this does not allow arbitrary 32 bit numbers. One way to load any 32bit number is to split the number into two 16bit halves, and use both “mov” and “movt” to combine these two half-words into one register:

<source lang=asm>
mov r0, #0xabcd
movt r0, #0x1234
</source>

The “movt” instruction loads the given number into the upper 16 bits of the register, so this example loads 0x1234abcd into r0. The order is important, as “mov” overwrites the upper 16 bits with zeros, but “movt” keeps the lower 16 bits. If a single “mov” can't fit the desired number, the combination of “mov” and “movt” is the fastest way of loading any 32bit number. As two 32bit instructions are needed, this consumes 8 bytes of program memory. If you want to load the address of a symbol into a register, you need to tell the assembler to split it automatically. This can be achieved by prefixing the symbol with “:lower16:” or “:upper16:”, e.g.:

<source lang=asm>
movw r0, #:lower16:GPIOA_BSRR
movt r0, #:upper16:GPIOA_BSRR
</source>

Note that “movw” needs to be specified in this case to explicitly tell the assembler to use the “mov” variant that accepts 16bit numbers (which it otherwise does automatically when a direct value is given).
==== PC-relative loads ====
The other way of loading arbitrary 32bit values into registers is to place the value directly in flash memory, and load it from there using “ldr”:

<source lang=asm>
@ Some code …
mov r0, … address of Literal ...
ldr r1, [r0]
@ More code …
Literal:
.word 0x12345678
</source>

However, there is a Chicken-And-Egg problem - the address of “Literal” is a 32bit number itself, so how to load it into r0? Luckily, there is a register that contains a number close to the one needed - the program counter (PC, r15) indicates the address of the instruction currently being executed. By reading it and adding a small offset that fits into the instruction itself, the address of “Literal” can be obtained, provided that “Literal” is located close enough. Consider this example of the EnableClockGPIOA function:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
add r1, pc, #12
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
.word RCC_APB2ENR
</source>

The 32bit-value “RCC_APB2ENR” is stored in flash memory. The “add” instruction is used to add the offset 12 to the address of the instruction itself to obtain the address of said 32bit-value, which is then loaded via “ldr”. The offset 12 is actually not easy to calculate and even depends on the alignment of the “add” instruction itself (hence the “.align” to ensure a consistent example). The assembler is capable of doing the calculation on its own, for which the “adr” instruction is used:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
adr r1, LiteralRCC_APB2ENR
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

The label LiteralRCC_APB2ENR refers to the address of the 32bit-value in memory. “adr” is actually a variant of “add” that instructs the assembler to calculate the offset and place it into the instruction itself, which then lets the processor add it to PC and write the result to r1. This address is then used by “ldr”.

The “adr” instruction is useful when the address of some literal is explicitly needed; for example, in the blinker program, it can be used to obtain the addresses of the array:

<source lang=asm>
adr r7, BlinkTable @ Move address of "BlinkTable" into r7
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r8
</source>

However, for loading a single value, the address is actually not needed. In this case, “adr” and “ldr” can be combined:

<source lang=asm>
.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, LiteralRCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

This special variant of “ldr” lets the assembler calculate to offset as with “adr”, adds it to “PC” at runtime and loads the data found at the address into r1. This is much easier than the first variant, as all calculations are done automatically. It is still somewhat cumbersome having to write three lines just to obtain a single 32bit value. Therefore, the assembler offers this already introduced syntax:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

This is a special command for the assembler. If possible, the assembler will use the “mov” or “mvn” instruction to load the value. If the value won’t fit, it will be put into flash memory, and a “ldr” instruction as above will be used. In this case, the “ldr rX, =...” syntax is equivalent to the combination of specifying a label for the value, the “.word” directive and “ldr rX, <Label>”. Therefore, this syntax is usually the best way to load immediates.

The assembler places the literals at the end of the file. If the file is long, the offset will be too long for the “ldr” and “adr” instructions and the assembler will emit an error. You can instruct the assembler to place all literals that have been declared so far at a specific point using the “.ltorg” directive. It is recommended to place an “.ltorg” after each function (after the “bx lr”) - just make sure that execution will never reach there. If a single function is so long that an “.ltorg” at the end is too far away from “ldr”/”adr” at the beginning, you can place an “.ltorg” somewhere in the middle and jump over it with “b”.

In summary, the following rules can help make literal loads more efficient
* Avoid literal loads if possible; try to calculate needed values from other values that have already been loaded, possibly by using offset-addressing in “ldr”/”str”
* When accessing multiple registers of a single periphery module, load its base address once and use offset addressing to access the individual registers
* If you need a pointer to a location in flash memory, try using “adr”
* If speed is important, use “movw”+”movt” to load the value
* Else, use “ldr rX, =...” to have the assembler choose the optimal encoding
* Place “.ltorg” after each function

The “ldr … =” instruction can also be used to load any immediate 32bit value into the PC to cause a jump to that address, simply by specifying “pc” as the target register. If you perform an ordinary branch (via “b” or “bl”) to some function whose address is too far away from the current code location, the linker will insert a “wrapper” function that does exactly that to perform the “far” jump. That function is called a “veneer”.
=== The SysTick timer ===
An important aspect of many embedded systems is to control timing of technical processes. In the blinker example, the timing of the LED flashes was handled by having the processor execute dummy instructions to pass time. It is however virtually impossible to accurately predict the runtime of any piece of code on a complex processor such as ARM ones, and the runtime may vary among multiple runs and depending on the actual micro controller and its configuration. For a simple LED blinker this may be acceptable, but not for e.g. a closed loop controller for some mechanical actor. Therefore, almost all micro controllers and also application processors feature one or more hardware timers, which allow to measure time independently of the execution speed of the software. Timer features vary widely among different processors, but that basic idea is to increment or decrement some digital counter at each clock cycle and trigger some event when it reaches a certain value.

All ARMv7-M processors feature the so-called “SysTick”-Timer as part of the processor core itself. This is a rather simple 24bit-timer that counts from a configurable value back to zero, then resets to that value and triggers an event. This timer is frequently used as a time base for RTOS or other runtime libraries. The timer uses three periphery registers: “RVR” contains the value from which to count down. “CVR” contains the current value, and “CSR” contains some status and control bits. The timer can be used for the “Delay” function like this:

<source lang=asm>
SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
ldr r1, =SCS
add r0, r0, r0, lsl #1

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

DelayLoop:
ldr r0, [r1, #SCS_SYST_CSR]
tst r0, #0x10000
beq DelayLoop

ldr r0, =0
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>

The SysTick is part of the “System Control Space”, SCS. The SCS base address is defined as a symbol, and the relative addresses of the registers as well. The count value is stored in “RVR”, after which “CVR” has to be set to zero. The timer is started by writing “5” into the “CSR” register. The loop repeatedly reads the “CSR” register and continues until bit 16 is set. The “tst” instruction is used to perform an “and” operation with the register contents and an immediate value without keeping the result while just updating the flags. At the end, the “CSR” register is set to zero to disable the timer. The “add” instruction at the beginning is used to multiply the count value by 3: r0 is shifted left by one, i.e. multiplied by two, and then added to itself, as in r0*2^1+r0. This is a common trick to quickly multiply by constants. By including this multiplication, the duration is the same as with the previous “Delay” variant, which, on this microcontroller, uses about 3 cycles per loop iteration.

Managing timing this way (or any other kind of “Delay” function) is still not very accurate. The time needed to call the function, start the timer, return, and set the pins is added to the actual duration and may also vary each time. The timing errors accumulate over time - a clock implemented this way will quickly go wrong. The proper way to achieve accurate timing is to start the timer once, let it run continuously, and react to its events. The internal clock source used by the microcontroller is also quite inaccurate (up to 2.5% deviation), which can be improved upon by a quartz crystal (typical accuracy of e.g. 0.005%), which will be covered later. Reacting to events instead of calling a function that executes dummy code requires restructuring the program code, without using any kind of “Delay” function.

To do that, the timer is started once at program startup and kept running. After setting the LED pin, wait for the timer event, and repeat. In the last example, the values 3000000 and 7500000 are used for the timer register (3x1000000 and 3x2500000, respectively). Changing the timer value while it is running continuously is problematic, so one fixed value should be used; to achieve variable blinker duration, multiple timer events need to be counted. The greatest common denominator of the two numbers is 1500000, so to achieve the two different times, 2 and 5 timer events need to be registered, respectively. Since these numbers fit into a single byte, the table entries and corresponding access instructions are changed to byte. A function “StartSysTick” is implemented to start the timer once, and a function “WaitSysTick” to wait for a given number of timer events:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =TimerValue
bl StartSysTick
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
adr r7, BlinkTable @ Move address of "BlinkTable" into r8
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r9

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2, 2
BlinkTableEnd:
.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

bx lr

@ r0 = Number of timer events to wait for
.type WaitSysTick, %function
WaitSysTick:
ldr r1, =SCS

WaitSysTickLoop:
ldr r2, [r1, #SCS_SYST_CSR]
tst r2, #0x10000
beq WaitSysTickLoop

subs r0, #1
bne WaitSysTickLoop

bx lr

</source>
Example name: “BlinkSysTick”

This way, the blinker frequency will be as stable and accurate as possible with the given clock source.
=== Exceptions & Interrupts ===
Exceptions and interrupts play an important role in low-level development. They provide a facility for hardware to notify the software of events, such as received data blocks or a timer event. On ARM, interrupts are a sub-group of exceptions – there are some “system-level” exceptions mostly for dealing with processor errors and providing operating system support, while interrupts are “special” exceptions for events signaled by periphery modules. When writing “regular” microcontroller software, you will mostly work with interrupts.

Exceptions (and interrupts) interrupt normal program flow, and cause the processor to execute some other piece of code which is called the exception handler or Interrupt Service Routine (ISR) (even for the “system-level” exceptions that are not interrupts). After dealing with the indicated event, the ISR typically returns and normal program flow resumes. As exceptions can interrupt the program anytime, data (and periphery) may be in any kind of inconsistent state, so special care must be taken to avoid corrupting program state in an ISR. The ARMv7-M processor (including the Cortex-M3) provide sophisticated support for exceptions, with configurable priorities and nested exception calls. This chapter will only cover the basics for using exceptions.

On ARMv7-M, exception handlers are implemented as regular functions, for example:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
@ Handle event ...
bx lr
</source>

Like any other function, it has a label, returns with “bx lr”, and is also made globally visible to other source files using “.global”. The “.type … %function” is required here for the same reason as for the already-mentioned “Reset_Handler”. Exception handlers can be located anywhere in flash memory, among the other regular functions. To tell the processor where the exception handlers for the various exception types are located, the vector table needs to be adjusted. Until now, the vector table was defined as:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Recall that the first 32bit-word in flash memory contains the initial stack pointer (defined via “.word _StackEnd”) and the second word contains the address of the first instruction of the program (defined via “.word Reset_Handler”). Actually, resetting the controller is an exception too, and the code to be executed after reset (or start-up) is the handler for the reset exception (hence the name “Reset_Handler”). The next 228 bytes of flash memory contain 57 32bit-addresses of the handlers of the other exceptions, including interrupts. The “.space” directive just fills those with zeroes. To tell the processor the address of an exception handler, the appropriate entry in this table needs to be set to that address. In chapter 10.1.2, table 63 of the controller’s reference manual, the format of the vector table, and which exception’s address should go where, is defined. Only the interrupts up until position 42 actually exist on the STM32F103RB/C8, as defined in chapter 2.3.5 of the datasheet; everything from “TIM8_BRK” is only present on larger controllers. According to the table, the SysTick exception handler’s address needs to be put at location 0x3C relative to the beginning of flash memory. Since the first 8 bytes are already occupied, 0x34 bytes of space are needed after those first 8 bytes.

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0x34
.word SysTick_Handler
.space 0xac
</source>

With this modification, the SysTick_Handler function is now declared as the handler for the SysTick exception. By default, the SysTick timer does not trigger an exception. To do that, you have to set bit 2 in the SCS_SYST_CSR register. By placing the logic for the blinker into the timer's ISR, you get an interrupt-based blinker:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

ldr r0, =TimerValue
bl StartSysTick
SleepLoop:
wfi
b SleepLoop

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq Return

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs Return

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne Return

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs SkipRestart

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

SkipRestart:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

Return:
bx lr

.align 2
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr

</source>
Example name: “BlinkSysTickInterrupt”

The regular program flow now consists only of initializing the periphery, timer, and the first step of the blinker (setting the pin high). After that, the processor should do nothing but wait for exceptions, which is achieved by a simple endless loop. The “wfi” instruction suspends the processor; when an exception occurs, the processor will wake up, execute the ISR, and return execution after the “wfi”. Therefore, “wfi” is usually put in an endless loop as shown. This technique can reduce the processor’s power consumption significantly, as it is only running when something needs to be done, as indicated via interrupts. The ISR first checks whether the interrupt flag in the timer register is set - this is necessary, since exceptions can sometimes occur “spuriously”, i.e. without an actual event causing it. The decision whether to set or reset the pin state is taken based on the lowest bit of the table index, such that the output alternates between 1 and 0.

The code inside the ISR needs to know which step in the blinking sequence is currently active, and how many timer events have already occurred inside the current step. Therefore, two 1-byte-variables are stored in RAM. To access them, offset addressing is used, where r0 contains the base address of the variables in memory, and the offsets inside “ldrb” and “strb” are set accordingly. The last number of the blink sequence table is omitted, since it is actually superfluous, because no action is taken after the last delay has elapsed. Because the table size is now odd, an “.align” directive after it is required. Always putting “.align” after outputting data is a good idea anyways.

Since exceptions can occur at any point in regular program flow, the processor registers may contain some data that will be used after the exception handler returns. Therefore, if the exception handler writes anything into the registers, they need to be restored when returning from the exception. Upon exception entry, the Cortex-M3/4 processors automatically store the registers r0-r3, r12, r14 (LR) and APSR (including the flags) on the stack. The link register is filled with a special “dummy” value, and when the exception handler returns via “bx lr” using this value, the processor restores said registers to their previous state. This effectively means that you can implement exception handlers like any other function, i.e. freely overwrite r0-r3, r12 and the flags and push/pop r4-r11 and the LR if needed.
=== Macros ===
The assembler provides a few mechanisms to make assembly-language development easier. One of those are macros, which allow you to define snippets of assembly code that you can then insert easily whenever you need them. While looking similar to function invocations, the code inside the macro is actually copied each time the macro is used, so don’t overuse them. Macros are started with “.macro” and end at the next “.endm” directive. For example, the following macro sets the LED pin to 0 or 1:

<source lang=asm>
.macro SETLED value
ldr r0, =GPIOA_BSRR
ldr r1, =(((!\value) << 24) | (\value<<8))
str r1, [r0]
.endm

SETLED 0
SETLED 1
</source>

The macro name is defined as “SETLED”, and a single parameter with name “value” is given. By typing “\type”, the value of the parameter is substituted in the macro body. Some bit-shifting is used to calculate the right bit pattern to write into BSRR to set or reset the pin accordingly.
=== Weak symbols ===
As explained before, labels defined in assembly files get translated into symbols in the object code files, which are resolved by the linker. Sometimes it is desirable to provide a “default” or “fallback” implementation of some function (or data block) which is only used when no other implementation is given. This can be achieved by marking the “fallback” variant with “.weak”:

<source lang=asm>
.type Function1, %function
.global Function1
.weak Function1
Function1:
@ Default implementation …
…
bl Function1 @ Call the function
</source>

With this code alone, “Function1” will be used normally. If you put another function with the same name in a different assembly source file, that second variant will be used.
=== Symbol aliases ===
It is also possible to define aliases for symbols using “.thumb_set”, which sets the address accordingly. For example:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

When trying to call “Function2”, the linker will automatically fill in the address of “Function1”. This can also be combined with “.weak” to define a weak alias:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.weak Function2
.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

If you now define another “Function2” in a different assembly source file, that will be used. If not, “Function1” will be called, which is the target of the alias definition. This is useful if you want to define one default implementation for several different functions, for each of which you need one “.weak” and one “.thumb_set” directive.
=== Improved vector table ===
The techniques from the last three sections can be used to improve the definition of the vector table. The way it was defined before is not very flexible; to insert new entries, you have to calculate the new gap sizes and offsets. First, define a default handler ISR that is called by exceptions for which no other ISR is defined, and a macro that defines an alias for one exception with the default handler as the target, and finally a table of all exceptions by using the macro:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.macro defisr name
.global \name
.weak \name
.thumb_set \name, Default_Handler
.word \name
.endm

.global VectorTable
.section .VectorTable, "a"
.type VectorTable, %object
VectorTable:
.word _StackEnd
defisr Reset_Handler
defisr NMI_Handler
defisr HardFault_Handler
defisr MemManage_Handler
defisr BusFault_Handler
defisr UsageFault_Handler
.word 0
.word 0
.word 0
.word 0
defisr SVC_Handler
defisr DebugMon_Handler
.word 0
defisr PendSV_Handler
defisr SysTick_Handler
defisr WWDG_IRQHandler
defisr PVD_IRQHandler
defisr TAMPER_IRQHandler
defisr RTC_IRQHandler
defisr FLASH_IRQHandler
defisr RCC_IRQHandler
defisr EXTI0_IRQHandler
defisr EXTI1_IRQHandler
defisr EXTI2_IRQHandler
defisr EXTI3_IRQHandler
defisr EXTI4_IRQHandler
defisr DMA1_Channel1_IRQHandler
defisr DMA1_Channel2_IRQHandler
defisr DMA1_Channel3_IRQHandler
defisr DMA1_Channel4_IRQHandler
defisr DMA1_Channel5_IRQHandler
defisr DMA1_Channel6_IRQHandler
defisr DMA1_Channel7_IRQHandler
defisr ADC1_2_IRQHandler
defisr USB_HP_CAN1_TX_IRQHandler
defisr USB_LP_CAN1_RX0_IRQHandler
defisr CAN1_RX1_IRQHandler
defisr CAN1_SCE_IRQHandler
defisr EXTI9_5_IRQHandler
defisr TIM1_BRK_IRQHandler
defisr TIM1_UP_IRQHandler
defisr TIM1_TRG_COM_IRQHandler
defisr TIM1_CC_IRQHandler
defisr TIM2_IRQHandler
defisr TIM3_IRQHandler
defisr TIM4_IRQHandler
defisr I2C1_EV_IRQHandler
defisr I2C1_ER_IRQHandler
defisr I2C2_EV_IRQHandler
defisr I2C2_ER_IRQHandler
defisr SPI1_IRQHandler
defisr SPI2_IRQHandler
defisr USART1_IRQHandler
defisr USART2_IRQHandler
defisr USART3_IRQHandler
defisr EXTI15_10_IRQHandler
defisr RTCAlarm_IRQHandler
defisr USBWakeUp_IRQHandler

.text

.type Default_Handler, %function
.global Default_Handler
Default_Handler:
bkpt
b.n Default_Handler
</source>

There are a few empty entries in the table that are not used by the processor. At the beginning, there is still the definition for the initial stack pointer and the “Reset_Handler”. If you replace your “vectortable.S” by this code, you get a “proper” vector table. The “SysTick_Handler” will continue to work as before, and if you need to define any other ISR, for example for USART1, just define a function by the exact name “USART1_IRQHandler”. The address of this function will automatically be put in the vector table. If an exception without a corresponding ISR occurs, the “Default_Handler” will be called, which uses the “bkpt” instruction to force a breakpoint via the attached debugger. This helps debugging missed exceptions while avoiding to define several individual dummy handler functions.
=== .include ===
Having to put the register and bit definitions (“RCC_APB2ENR”, “RCC_APB2ENR_IOPAEN”, ...) in each assembly source file is redundant and error-prone. Instead, you can put them into a separate file (e.g. called “stm32f103.inc”) and use the “.include” directive to reference it:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

@ Normal code ...
</source>

The assembler will read the code from the included file and pretend it was written instead of the “.include” line. This can help improve code structure. While working on the project structure, you can also restructure the definitions for the GPIO registers to facilitate offset addressing:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000
</source>

The next example incorporates these changes in addressing the registers.
=== Local Labels ===
Having to invent unique labels for all jump targets inside functions (e.g. for conditional code and loops) can be tedious. When using a disassembler (see below), each label will appear as its own functions. Therefore, the GNU assembler supports local labels. These are labels whose name consist only a number. Local names need not be unique; several labels called e.g. “1” may exist in one file. To perform a jump to a local label, use the number and append a “f” or “b” to indicate whether to jump forward or backward. Local labels can not be exported with the “.global” directive. The interrupt-based blinker can be modified like this using local labels:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

1:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>
Example name: “BlinkLocalLabels”
=== Initializing RAM ===
The blinker program uses 2 byte-variables in memory, which have to be initialized to some value at startup. For large programs with many variables, this quickly becomes hard to maintain and also inefficient. Assembler and linker can help producing an “image” of how the RAM contents should look like after initializing, and place this image in flash memory alongside the normal program data. At startup, this image can be simply copied 1:1 into RAM in a loop. Most programs contain many variables that will be initialized with zero, so placing a (possibly large) block of zeroes in flash memory is wasteful; therefore, an additional loop is used to initialize all zero-variables to zero. Both techniques are also employed by C and C++ compilers, so implementing the initialization code is required there too. First, change the declaration of your variables by using “.byte”, “.hword” and “.word” and include the desired initialization value. Variables that should be initialized by zero get to be placed after a “.bss” directive to put them into the equally-named section. They don’t get an initialization value but just reserved space by using “.space”:

<source lang=asm>
.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1
</source>

From the assembler’s point of view, the initialization data - in this case, just one byte of value “2” - will directly end up in RAM. However, this is not possible on microcontrollers, as the RAM always contains random data on startup and isn’t automatically initialized. To achieve that, change the linker script as follows:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
. = ALIGN(4);
} >FLASH

.stack (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
} >SRAM

.data : {
_DataStart = .;
*(.data);
. = ALIGN(4);
_DataEnd = .;
} >SRAM AT >FLASH

_DataLoad = LOADADDR(.data);

.bss (NOLOAD) : {
_BssStart = .;
*(.bss);
. = ALIGN(4);
_BssEnd = .;
} >SRAM
}
</source>
Example name: “BlinkInitRAM”

The stack got put in its own section with the “NOLOAD” attribute, since it doesn’t need initializing. The data is now put in the “.data” section. The initial data for that section is put into flash memory via the “>SRAM AT >FLASH” construct. The addresses of symbols inside the “.data” section are still the addresses in RAM, so accesses to the symbols from assembly code still work. The symbol “_DataStart” is assigned the beginning of the initialized data in RAM, and “_DataEnd” the end. The “LOADADDR” function is used to get the beginning of the initialization data in flash, and assign it to “_DataLoad”. The “.bss” section contains all the variables that should be zero-initialized, and the symbols “_BssStart” and “_BssEnd” are set to its beginning and end address, respectively. It is marked with “NOLOAD” as well as we don't want to store (potentially many) zeroes in the linked program file, and we will explicitly initialize it (see below). As the beginning and size of the stack are already a multiple of 4, the beginning of “.data” is as well. The size of .data might not be a multiple of 4 however, so an “.=ALIGN(4)” command is inserted right before the definition of “_DataEnd”. This adds 0-3 dummy bytes by incrementing the location counter to make sure the address is a multiple of 4. The same thing is done right before “_BssEnd” and also at the end of the “.text” section, to make sure that “_BssEnd” and “_DataLoad” are multiples of 4 as well.

The only thing left is the actual initialization of the RAM. To do that, change the “Reset_Handler” as follows:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b
.ltorg
</source>

The explicit initialization of the variables was removed. Instead, the addresses for “_DataStart”, “_DataEnd” and “_DataLoad” that were defined in the linker script are loaded. Then, a short loop repeatedly loads a word from flash (i.e. starting with “_DataLoad”) and stores it into RAM (starting at “_DataStart”). The address pointers are incremented by the “ldr”/”str” instructions after the access. The pointer for the RAM location is compared with the end of the RAM area (“_DataEnd”) to decide whether to jump back to the beginning of the loop. To start the loop, a jump directly to the comparison is performed; this avoids the need to do the comparison at the beginning and inside of the loop. The second loop performs the zero-initialization of the area between “_BssStart” and “_BssEnd”; it works similarly, but does not need to load any data.

Unfortunately, the program as shown can’t be translated - as the two variables now reside in two different sections (“.data” and “.bss”), the offset addressing in the “SysTick_Handler” doesn’t work anymore. Therefore, direct addressing has to be used:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr
</source>

=== Peripheral interrupts ===
Interrupts, i.e. exceptions called by periphery modules, need a little extra code compared to the “core” exceptions including the SysTick. The Cortex-M’s interrupt controller (the NVIC) contains several registers for configuring these interrupts. It is possible to configure the priority and manually trigger interrupts, but for most applications, the only necessary thing to do is enabling the desired interrupt. This is done via the registers “NVIC_ISER0” through “NVIC_ISER15”, which are documented in the ARMv7M Architecture Reference Manual in chapter B3.4.4. Each of those registers contains 32 bits with which 32 of the interrupts can be enabled. The STM32F103RB/C8 has 43 interrupts, so only two of the possible 16 registers are present. The number of interrupts is given in chapter 2.3.5 of the controller’s datasheet. So, to enable some interrupt x, the bit “x mod 32” in register NVIC_ISER''y'' with y=x/32 has to be set. This register’s address is 0xE000E100+y*4. Given an interrupt’s number in r0, the following function does just that:

<source lang=asm>
NVIC_ISER0 = 0xE000E100

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>
Example name: “BlinkTIM1”

The “and” instruction calculates “x mod 32”, and the following left-shift (“lsls”) calculates the value where bit “x mod 32” is one, and all others are zero. To calculate the offset address “y*4”, i.e. “(x/32)*4”, the register is first shifted right by 5 bits and then shifted back left by 2 bits. This is the same as shifting 3 bits right and zeroing out the lower 2 bits; but two shift instructions actually consume less program memory space. Finally, the calculated value is written into the register by using offset addressing.

In addition to enabling the interrupt in the processor core’s NVIC, it also has to be enabled in the periphery module. Many periphery modules support several different events, each of which has to be enabled in the periphery’s register individually. Depending on the controller, these can be mapped to one single processor interrupt (and hence, one single ISR) or multiple ones, and need to be configured in the NVIC appropriately.

This example uses the STM32’s periphery timer TIM1 instead of the SysTick timer:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500
TimerPrescaler=1000

.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl EnableClockTIM1
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TIM1_UP_IRQn
bl EnableIRQ
bl StartTIM1
1:
wfi
b 1b
.ltorg

.type TIM1_UP_IRQHandler, %function
.global TIM1_UP_IRQHandler
TIM1_UP_IRQHandler:
ldr r0, =TIM1
ldr r2, =(~(1 << TIMx_SR_UIF))

ldr r1, [r0, #TIMx_SR]
bics r1, r2
beq 2f

str r2, [r0, #TIMx_SR]

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type EnableClockTIM1, %function
EnableClockTIM1:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_TIM1EN)
str r0, [r1, #RCC_APB2ENR] @ Set TIM1EN bit in RCC_APB2ENR to 1 to enable TIM1
bx lr @ Return to caller
.ltorg

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartTIM1, %function
StartTIM1:
ldr r0, =TIM1
ldr r1, =(1 << TIMx_CR1_URS)
str r1, [r0, #TIMx_CR1]

ldr r1, =TimerPrescaler
str r1, [r0, #TIMx_PSC]

ldr r1, =TimerValue
str r1, [r0, #TIMx_ARR]

ldr r1, =(1 << TIMx_DIER_UIE)
str r1, [r0, #TIMx_DIER]

ldr r1, =(1 << TIMx_EGR_UG)
str r1, [r0, #TIMx_EGR]

dsb

ldr r1, =(1 << TIMx_CR1_CEN)
str r1, [r0, #TIMx_CR1]

bx lr
.ltorg

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>

The corresponding stm32f103.inc file with the added definitions for the timer registers is:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

RCC = 0x40021000
RCC_APB2ENR = 0x18
RCC_APB2ENR_IOPAEN = 2
RCC_APB2ENR_TIM1EN = 11

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

TIM1 = 0x40012C00

TIMx_CR1 = 0
TIMx_CR1_ARPE = 7
TIMx_CR1_URS = 2
TIMx_CR1_CEN = 0

TIMx_DIER = 0xC
TIMx_DIER_UIE = 0

TIMx_SR = 0x10
TIMx_SR_UIF = 0

TIMx_EGR = 0x14
TIMx_EGR_UG = 0

TIMx_PSC = 0x28
TIMx_ARR = 0x2C

TIM1_UP_IRQn = 25

NVIC_ISER0 = 0xE000E100
</source>

The source code enables the timer’s clock in the RCC before configuring it. The timer supports both a freely configurable prescaler for dividing the clock and a freely configurable maximum value, both of which are set by the StartTIM1 function. The TIMx_DIER_UIE bit is set to enable the interrupt for the so-called “update event”, which is triggered whenever the timer reaches the maximum value. A delicate sequence of register accesses is required to start the timer with the right configuration but without triggering the interrupt right away: To apply the modified settings immediately, the “TIMx_EGR_UG” bit is set to trigger an “artificial” update event. To prevent this from also triggering the interrupt, the “TIMx_CR1_URS” bit is set and cleared before and after, respectively. The timer is started by setting the “TIMx_CR1_CEN” bit at the end. Before that, a “dsb” instruction is inserted. This “Data Synchronization Barrier” waits until all write accesses before that have been completely processed - usually, the processors pipeline is working on several instructions at once. Because the timer configuration needs to be truly finished before starting the timer, this instruction is required. There are some other situations where the processor is too fast for the periphery and needs to be temporarily halted by a “dsb”. If some periphery-accessing code works in step-by-step mode while debugging, but not when executing normally, a well-placed “dsb” might help.

The ISR “TIM1_UP_IRQHandler” is used for the timer. It checks the “TIMx_SR_UIF” bit to verify an update event has actually happened. In that case, the register is overwritten with the value 0xFFFFFFFE, i.e. all bits are written with “1” except the UIF bit. Writing ones has no effect on the bits in this register, and writing a zero clears the respective bit. Therefore, this write access clears the UIF bit but keeps the others. These interrupt flags must always be cleared as soon as possible in the ISR, or the periphery might trigger the interrupt again immediately. The rest of the ISR stays the same.
=== Analysis tools ===
When working on a low level directly with linker scripts and assembly code, it is frequently necessary to directly verify the translation output, as you can’t rely on a compiler doing it right automatically, and flashing the program each time to see whether it works isn’t the most efficient way. This was, in fact, important in creating the example codes for this tutorial. The “binutils” package, of which assembler and linker are part of, offers a few tools that help with analyzing the assembler’s and linker’s output.
==== Disassembler ====
As the name implies, a disassembler is the opposite of an assembler - it turns binary machine code back into a (more or less) readable textual representation. If you feed an ELF file generated by the assembler or linker into the disassembler, it will read the header information to tell apart data (i.e. constants) and code, get names of symbols (and therefore, labels) and can even tell which instructions were generated from which assembly source file, if it was assembler with debug information (i.e. the “-g” flag was used). If you disassemble a binary flash image, the disassembler doesn’t have all those information and will produce a much less readable output and attempt to decode data bytes as instructions.

The disassembler from binutils is called “objdump”. Invoking it on the blinker looks like this:
<pre>
$ arm-none-eabi-objdump -d -s prog1.elf

prog1.elf: file format elf32-littlearm

Contents of section .VectorTable:
8000000 00040020 ed000008 ed010008 ed010008 ... ............
8000010 ed010008 ed010008 ed010008 00000000 ................
8000020 00000000 00000000 00000000 ed010008 ................
8000030 ed010008 00000000 ed010008 49010008 ............I...
8000040 ed010008 ed010008 ed010008 ed010008 ................
8000050 ed010008 ed010008 ed010008 ed010008 ................
8000060 ed010008 ed010008 ed010008 ed010008 ................
8000070 ed010008 ed010008 ed010008 ed010008 ................
8000080 ed010008 ed010008 ed010008 ed010008 ................
8000090 ed010008 ed010008 ed010008 ed010008 ................
80000a0 ed010008 ed010008 ed010008 ed010008 ................
80000b0 ed010008 ed010008 ed010008 ed010008 ................
80000c0 ed010008 ed010008 ed010008 ed010008 ................
80000d0 ed010008 ed010008 ed010008 ed010008 ................
80000e0 ed010008 ed010008 ed010008 ............
Contents of section .text:
80000ec 0f481049 104a03e0 52f8043b 40f8043b .H.I.J..R..;@..;
80000fc 8842f9d3 0d480e49 4ff00002 01e040f8 .B...H.IO.....@.
800010c 042b8842 fbd300f0 47f800f0 4bf84ff4 .+.B....G...K.O.
800011c 80710848 01600848 00f058f8 30bffde7 .q.H.`.H..X.0...
800012c 00040020 04040020 f0010008 04040020 ... ... .......
800013c 08040020 10080140 60e31600 4ff0e020 ... ...@`...O..
800014c 006910f4 803f1dd0 1a480178 b1f1110f .i...?...H.x....
800015c 18d21948 0378013b 1cbf0370 12e001f1 ...H.x.;...p....
800016c 0101b1f1 110f02d2 144a535c 03701148 .........JS\.p.H
800017c 017011f0 01010cbf 4ff48071 4ff08071 .p......O..qO..q
800018c 0f480160 70470202 02020202 05020502 .H.`pG..........
800019c 05020202 02020200 0a490868 40f00400 .........I.h@...
80001ac 08607047 08490868 20f00f00 40f00200 .`pG.I.h ...@...
80001bc 08607047 04040020 00040020 92010008 .`pG... ... ....
80001cc 10080140 18100240 04080140 4ff0e021 ...@...@...@O..!
80001dc 48614ff0 00008861 4ff00700 08617047 HaO....aO....apG
80001ec 00befde7 ....
Contents of section .data:
20000400 02000000 ....
Contents of section .ARM.attributes:
0000 41200000 00616561 62690001 16000000 A ...aeabi......
0010 05436f72 7465782d 4d330006 0a074d09 .Cortex-M3....M.
0020 02 .
Contents of section .debug_line:
0000 98000000 02001e00 00000201 fb0e0d00 ................
0010 01010101 00000001 00000100 70726f67 ............prog
0020 312e5300 00000000 000502ec 00000803 1.S.............
0030 15012121 22212f2f 21222121 30212f21 ..!!"!//!"!!0!/!
0040 222f302f 21232130 21036120 2f2f362f "/0/!#!0!.a //6/
0050 030c2e32 030a2e2f 212f2222 222f2221 ...2.../!/"""/"!
0060 21222121 222f2f22 21212321 222f212f !"!!"//"!!#!"/!/
0070 30212303 0d9e2121 2f212421 212f2f21 0!#...!!/!$!!//!
0080 03422035 030c2e03 0d2e0311 2e36030b .B 5.........6..
0090 2e30212f 222f2202 01000101 3b000000 .0!/"/".....;...
00a0 02002400 00000201 fb0e0d00 01010101 ..$.............
00b0 00000001 00000100 76656374 6f727461 ........vectorta
00c0 626c652e 53000000 00000005 02ec0100 ble.S...........
00d0 0803d000 01210201 000101 .....!.....
Contents of section .debug_info:
0000 22000000 02000000 00000401 00000000 "...............
0010 ec000008 ec010008 00000000 08000000 ................
0020 12000000 01802200 00000200 14000000 ......".........
0030 04019c00 0000ec01 0008f001 00082100 ..............!.
0040 00000800 00001200 00000180 ............
Contents of section .debug_abbrev:
0000 01110010 06110112 01030e1b 0e250e13 .............%..
0010 05000000 01110010 06110112 01030e1b ................
0020 0e250e13 05000000 .%......
Contents of section .debug_aranges:
0000 1c000000 02000000 00000400 00000000 ................
0010 ec000008 00010000 00000000 00000000 ................
0020 1c000000 02002600 00000400 00000000 ......&.........
0030 ec010008 04000000 00000000 00000000 ................
Contents of section .debug_str:
0000 70726f67 312e5300 2f746d70 2f746573 prog1.S./tmp/tes
0010 7400474e 55204153 20322e32 392e3531 t.GNU AS 2.29.51
0020 00766563 746f7274 61626c65 2e5300 .vectortable.S.

Disassembly of section .text:

080000ec <Reset_Handler>:
80000ec: 480f ldr r0, [pc, #60] ; (800012c <Reset_Handler+0x40>)
80000ee: 4910 ldr r1, [pc, #64] ; (8000130 <Reset_Handler+0x44>)
80000f0: 4a10 ldr r2, [pc, #64] ; (8000134 <Reset_Handler+0x48>)
80000f2: e003 b.n 80000fc <Reset_Handler+0x10>
80000f4: f852 3b04 ldr.w r3, [r2], #4
80000f8: f840 3b04 str.w r3, [r0], #4
80000fc: 4288 cmp r0, r1
80000fe: d3f9 bcc.n 80000f4 <Reset_Handler+0x8>
8000100: 480d ldr r0, [pc, #52] ; (8000138 <Reset_Handler+0x4c>)
8000102: 490e ldr r1, [pc, #56] ; (800013c <Reset_Handler+0x50>)
8000104: f04f 0200 mov.w r2, #0
8000108: e001 b.n 800010e <Reset_Handler+0x22>
800010a: f840 2b04 str.w r2, [r0], #4
800010e: 4288 cmp r0, r1
8000110: d3fb bcc.n 800010a <Reset_Handler+0x1e>
8000112: f000 f847 bl 80001a4 <EnableClockGPIOA>
8000116: f000 f84b bl 80001b0 <ConfigurePA8>
800011a: f44f 7180 mov.w r1, #256 ; 0x100
800011e: 4808 ldr r0, [pc, #32] ; (8000140 <Reset_Handler+0x54>)
8000120: 6001 str r1, [r0, #0]
8000122: 4808 ldr r0, [pc, #32] ; (8000144 <Reset_Handler+0x58>)
8000124: f000 f858 bl 80001d8 <StartSysTick>
8000128: bf30 wfi
800012a: e7fd b.n 8000128 <Reset_Handler+0x3c>
800012c: 20000400 .word 0x20000400
8000130: 20000404 .word 0x20000404
8000134: 080001f0 .word 0x080001f0
8000138: 20000404 .word 0x20000404
800013c: 20000408 .word 0x20000408
8000140: 40010810 .word 0x40010810
8000144: 0016e360 .word 0x0016e360

08000148 <SysTick_Handler>:
8000148: f04f 20e0 mov.w r0, #3758153728 ; 0xe000e000
800014c: 6900 ldr r0, [r0, #16]
800014e: f410 3f80 tst.w r0, #65536 ; 0x10000
8000152: d01d beq.n 8000190 <SysTick_Handler+0x48>
8000154: 481a ldr r0, [pc, #104] ; (80001c0 <ConfigurePA8+0x10>)
8000156: 7801 ldrb r1, [r0, #0]
8000158: f1b1 0f11 cmp.w r1, #17
800015c: d218 bcs.n 8000190 <SysTick_Handler+0x48>
800015e: 4819 ldr r0, [pc, #100] ; (80001c4 <ConfigurePA8+0x14>)
8000160: 7803 ldrb r3, [r0, #0]
8000162: 3b01 subs r3, #1
8000164: bf1c itt ne
8000166: 7003 strbne r3, [r0, #0]
8000168: e012 bne.n 8000190 <SysTick_Handler+0x48>
800016a: f101 0101 add.w r1, r1, #1
800016e: f1b1 0f11 cmp.w r1, #17
8000172: d202 bcs.n 800017a <SysTick_Handler+0x32>
8000174: 4a14 ldr r2, [pc, #80] ; (80001c8 <ConfigurePA8+0x18>)
8000176: 5c53 ldrb r3, [r2, r1]
8000178: 7003 strb r3, [r0, #0]
800017a: 4811 ldr r0, [pc, #68] ; (80001c0 <ConfigurePA8+0x10>)
800017c: 7001 strb r1, [r0, #0]
800017e: f011 0101 ands.w r1, r1, #1
8000182: bf0c ite eq
8000184: f44f 7180 moveq.w r1, #256 ; 0x100
8000188: f04f 7180 movne.w r1, #16777216 ; 0x1000000
800018c: 480f ldr r0, [pc, #60] ; (80001cc <ConfigurePA8+0x1c>)
800018e: 6001 str r1, [r0, #0]
8000190: 4770 bx lr

08000192 <BlinkTable>:
8000192: 0202 0202 0202 0205 0205 0205 0202 0202 ................
80001a2: .

080001a3 <BlinkTableEnd>:
...

080001a4 <EnableClockGPIOA>:
80001a4: 490a ldr r1, [pc, #40] ; (80001d0 <ConfigurePA8+0x20>)
80001a6: 6808 ldr r0, [r1, #0]
80001a8: f040 0004 orr.w r0, r0, #4
80001ac: 6008 str r0, [r1, #0]
80001ae: 4770 bx lr

080001b0 <ConfigurePA8>:
80001b0: 4908 ldr r1, [pc, #32] ; (80001d4 <ConfigurePA8+0x24>)
80001b2: 6808 ldr r0, [r1, #0]
80001b4: f020 000f bic.w r0, r0, #15
80001b8: f040 0002 orr.w r0, r0, #2
80001bc: 6008 str r0, [r1, #0]
80001be: 4770 bx lr
80001c0: 20000404 .word 0x20000404
80001c4: 20000400 .word 0x20000400
80001c8: 08000192 .word 0x08000192
80001cc: 40010810 .word 0x40010810
80001d0: 40021018 .word 0x40021018
80001d4: 40010804 .word 0x40010804

080001d8 <StartSysTick>:
80001d8: f04f 21e0 mov.w r1, #3758153728 ; 0xe000e000
80001dc: 6148 str r0, [r1, #20]
80001de: f04f 0000 mov.w r0, #0
80001e2: 6188 str r0, [r1, #24]
80001e4: f04f 0007 mov.w r0, #7
80001e8: 6108 str r0, [r1, #16]
80001ea: 4770 bx lr

080001ec <Default_Handler>:
80001ec: be00 bkpt 0x0000
80001ee: e7fd b.n 80001ec <Default_Handler>
</pre>

This is a lot of information. The “-d” flag tells objdump to disassemble code sections, and the “-s” flag lets it output data sections. At first, it prints the contents of “.VectorTable”. Each line is prefixed with the address of where this data is found in memory. Then, the 32bit data blocks from the vector table are output. The disassembler prints the bytes in the order they appear in memory, which, since the Cortex-M3 uses little endian, is reversed - for example, the printed “ed000008” actually refers to the address “0x080000ed”, which is the address of the “Reset_Handler” with the lowest bit set to one, as it is a thumb function. Most of the addresses in the vector table reflect the address of the default handler, 0x080001ec, except for the zero-entries and the SysTick_Handler. The contents of the “.text” section is the hexadecimal representation of the machine code, and hardly readable. The “.data” section contains a single “two” - this is the “02” put into “TimerEvents”. The contents of “.ARM.attributes:” and the various “.debug” sections is not very interesting, as it does not end up on the controller, and is only read by the various analysis tools to provide nicer output.

After that comes the actual disassembly. This is a list of all the instructions in the code section. The list is grouped by the symbols found in the input file. For C Code, each symbol usually matches one function, so each block in the disassembly represents one C function. In assembly code, if you put non-local labels into a function, that function will be split into multiple blocks by the disassembler, making it harder to read - the main reason for using local labels. Each instruction is translated into one line inside the blocks. The first column is the address where that instruction is found. The next column contains the hexadecimal representation of the 2 or 4 bytes that make up the machine code of that instruction, i.e. the actual content of flash memory. After that comes a textual representation of that instruction, as inferred by the disassembler. If the instruction contains some number, the disassembler sometimes outputs a semicolon followed by some interpretation of that number. If the instruction employs PC-relative addressing, that interpretation will be the absolute address. As many instructions have multiple spellings, there can be discrepancies between the original code and the disassembly. The disassembler will also output data, such as the “BlinkTable” and the literal pools, as such. Using the “.type” directive is helpful in that case so that the disassembler does not attempt to interpret the data bytes as code.

objdump can also be used to disassembly raw binary files that can be obtained by reading back the flash memory of some controller. To do this, use this command line:

<source lang=sh>
$ arm-none-eabi-objdump -b binary -m arm -D prog1.bin -Mforce-thumb --adjust-vma=0x08000000
</source>

The address of the binary in flash memory is specified so that the printed instruction addresses are correct. However, as the disassembler can’t tell data and code apart, the result will be of limited use. If you have to analyze a binary without having an ELF file or the source code, a more sophisticated disassembler such as IDA Pro is helpful. If you have the code and only need the disassembler to identify potential problems with the project (esp. the linker script), objdump is usually sufficient.
==== readelf ====
The “readelf” program is a powerful utility that can read and output various information from ELF files. The most useful option is the “-S” flag, which lets readelf print a summary of the sections in the respective file, e.g.:

<pre>
$ arm-none-eabi-readelf -S prog1.elf
There are 15 section headers, starting at offset 0x11268:

Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .VectorTable PROGBITS 08000000 010000 0000ec 00 A 0 0 1
[ 2] .text PROGBITS 080000ec 0100ec 000104 00 AX 0 0 4
[ 3] .stack NOBITS 20000000 020000 000400 00 WA 0 0 1
[ 4] .data PROGBITS 20000400 010400 000004 00 WA 0 0 1
[ 5] .bss NOBITS 20000404 010404 000004 00 WA 0 0 1
[ 6] .ARM.attributes ARM_ATTRIBUTES 00000000 010404 000021 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 010425 0000db 00 0 0 1
[ 8] .debug_info PROGBITS 00000000 010500 00004c 00 0 0 1
[ 9] .debug_abbrev PROGBITS 00000000 01054c 000028 00 0 0 1
[10] .debug_aranges PROGBITS 00000000 010578 000040 00 0 0 8
[11] .debug_str PROGBITS 00000000 0105b8 00002f 01 MS 0 0 1
[12] .symtab SYMTAB 00000000 0105e8 0006a0 10 13 45 4
[13] .strtab STRTAB 00000000 010c88 000550 00 0 0 1
[14] .shstrtab STRTAB 00000000 0111d8 000090 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
</pre>

For each section, one line is output. The sections “.strtab”, “.shstrtab”, “.symtab” and “NULL” are an integral part of ELF and always present. The “.debug” sections are present if the source was assembled with the “-g” flag. The “.ARM.attributes” section defines for which ARM processor the contained code was translated. These sections don’t end up on the microcontroller. The remaining sections were defined in the linker script: “.VectorTable” contains the addresses of the exception handlers, “.text” contains the program code and constant data for flash memory, “.stack” the stack in RAM, “.data” contains variables in RAM and “.bss” contains zero-initialized variables in RAM. For these sections, the column “Type” contains either “PROGBITS” or “NOBITS” that tells you whether the section in the ELF file actually contains some data - this is only the case for “.VectorTable”, “.text” and “.data”. The sections “.bss” and “.stack” only reserve memory that is written at runtime, but the ELF file doesn’t contain data to be written in these sections. The column “Addr” defines where this section begins in the address space. The most useful column is “Size”: If you sum up the sizes of the sections “.VectorTable”, “.text” and “.data”, you can obtain the used flash memory. By summing ob “.data”, “.stack” and “.bss”, you get the used amount of RAM. Note that “.data” is counted twice, as the initialization data is stored in flash.
==== nm ====
The “nm” utility prints the symbols defined in an ELF file, for example:

<pre>
$ arm-none-eabi-nm prog1.elf
080001ec W ADC1_2_IRQHandler
20000404 b BlinkStep
08000192 t BlinkTable
080001a3 t BlinkTableEnd
20000408 B _BssEnd
20000404 B _BssStart
…
</pre>

This can be helpful in analyzing errors in linker scripts where symbols might get assigned wrong addresses.
==== addr2line ====
The “addr2line” utility reads the debug information from an ELF file to determine which line in which source file produced the instruction found at a particular given address. For example:

<source lang=sh>
$ arm-none-eabi-addr2line 0x080000f0 -e prog1.elf
/tmp/test/prog1.S:24
</source>

Here, line 24 of “prog1.S” contains the assembler command that produced the instruction that ends up at address 0x080000f0.

==== objcopy ====
The “objcopy” utility allows you to translate program files between different formats. It is useful to convert the ELF files to both the Intel Hex format and a simple binary representation. For example,

<source lang=sh>
arm-none-eabi-objcopy -O ihex prog1.elf prog1.hex
</source>

produces a “.hex” file that contains an image of the flash contents in hexadecimal form. With

<source lang=sh>
arm-none-eabi-objcopy -O binary prog1.elf prog1.bin
</source>

a binary file is created which contains an exact 1:1 image of the flash contents. Some flashing tools require these formats instead of ELF, and viewing the binary file with a hex editor can be interesting as well.

=== Interfacing C and C++ code ===
Since assembly is rarely used to implement entire complex projects, but mostly for few time-critical or especially low-level routines that are part of larger code bases written in a high-level-language, interfacing C and assembly code is an important topic, which will be covered here. While it is possible to write the main project structure in assembly and integrate some C modules, it is usually done the other way round. Most of the code shown is already ready to be included in C programs. Most of this topic works the same way for C++, apart from C++ exceptions (not to be confused with ARM processor exceptions) - but these are rarely used on embedded targets anyways.

If you compile C, C++ and assembly code into individual .o object files, you can link these together using “ld” as before. However, C and C++ code usually requires access to the respective standard library, and “ld” doesn’t link these by default - therefore it is necessary to substitute “ld” for a call to “gcc” or “g++” for C or C++, respectively. This will call “ld” internally and pass the required libraries.
==== Environment setup for C and C++ ====
Many C projects use a reset handler and vector table implemented in assembly, although writing them in C is possible too. As required by the C standard, C programs start with the “main()” function, so the (assembly) reset handler should setup the environment such that it is ready for C, and then call “main”. The C code might then later call some assembly functions or inline assembly. When using C++ code, or some GCC extension for C code, it is required to call some additional functions before calling “main”. This is used by C++ to call the constructors of global objects. The C and C++ compilers emit a table of function pointers to functions that should be called at startup. This table has to be put into flash memory by modifying the linker script as follows:

<source>
.text : {
*(.text)
. = ALIGN(4);

_InitArrayStart = .;
*(SORT(.preinit_array*))
*(SORT(.init_array*))
_InitArrayEnd = .;
} >FLASH
</source>

The table of function pointers is sorted to keep the order needed by the compiler. The symbols “_InitArrayStart” and “_InitArrayEnd” mark beginning and end of that table. A reset handler that performs the memory initialization as before and calls the table of initialization functions could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

ldr r4, =_InitArrayStart
ldr r5, =_InitArrayEnd

b 2f
1: ldr r0, [r4], #4
blx r0
2: cmp r4, r5
blo 1b

bl main
1: bkpt
b 1b
.ltorg
</source>

Note that for iterating the table, registers r4 and r5 are used, since the called functions may not overwrite those. The “blx” instruction is needed to perform the indirect function call. When everything is set up, the main function is called. For embedded programs, the main function should never return (i.e. contain an endless loop). If it does, that’s an error, and to make it easier to find, an endless loop with a forced breakpoint is put right after the call to “main”.
==== Calling functions ====
To call assembly functions from C code and vice-versa, the assembly functions should observe the calling convention, as mentioned before. C functions can be called just like assembly functions from assembly code, by placing the parameters in register r0-r3 and on the stack, calling the function using “bl” and retrieving the return value from r0. To call an assembly function from C code, you need to declare it in C first just like a C function. For example, to call a function that takes 2 integer arguments and returns an integer:

<source lang=c>
int AssemblyFunction (int a, int b);
</source>

If you now define a function named “AssemblyFunction” in your assembly code and export it via “.global”, you can call it from C code just like any function.
==== Accessing global variables ====
Global variables defined in C can be accessed from assembly code just like variables defined in assembly code, by using the variable’s name. To access an assembly variable from C code, you need to declare it first by specifying the type. For example, to declare an integer variable:

<source lang=c>
extern int AssemblyVariable;
</source>

If you now define a variable named “AssemblyVariable” in your assembly code and export it via “.global”, you can access it from C code just like any variable. The “extern” is required to make sure the C code doesn’t attempt to declare another variable of the same name.
=== Clock configuration ===
By default, STM32 controllers use an internal RC-oscillator with 8 MHz as a clock source for the core and periphery. This oscillator is too inaccurate for implementing a clock or using serial interfaces such as UART, USB or CAN. To obtain a more accurate clock, an external quartz crystal is usually applied. Many STM32 boards feature an 8 MHz crystal. To use it, some initialization code is required that activates the microcontroller’s built-in crystal-oscillator circuit and switches the clock input to that. The STM32 controllers also include a PLL which can multiply some input clock by a configurable factor before feeding it to the processor core and peripherals. This way, a precise and fast clock can be achieved - the STM32F103 supports up to 72 MHz core frequency. Unfortunately, flash memory is not capable of keeping up with such a high frequency. Therefore, when enabling a fast clock, the flash memory needs to be configured to use wait states depending on the frequency.

The following function configures the flash wait states, enables the crystal oscillator, configures the PLL to multiply the input clock by a factor of 9, and use that as the system clock. The prescaler for the internal bus APB1 is set to 2. Assuming an 8 MHz crystal, this achieves the maximum performance possible with this microcontroller - 72 MHz for the core and APB2 domain, 36 MHz for APB1. If a different crystal is used, the PLL factors have to be adjusted.

<source lang=asm>
RCC = 0x40021000

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

.type ConfigureSysClock, %function
.global ConfigureSysClock
ConfigureSysClock:
@ Turn on HSE
ldr r0, =RCC
ldr r1, =((1 << RCC_CR_HSION)|(1 << RCC_CR_HSEON))
str r1, [r0, #RCC_CR]

@ Configure (but not start yet) PLL
@ Mul = 9, Prediv = 1, APB1 Prescaler = 2, APB2 Prescaler = 1, AHB Prescaler = 1
ldr r2, =(((9-2)<<RCC_CFGR_PLLMUL)|(1 << RCC_CFGR_USBPRE)|(1 << RCC_CFGR_PLLSRC)|(4 << RCC_CFGR_PPRE1))
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
orr r1, #(1 << RCC_CR_PLLON)

@ Wait for HSE ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_HSERDY)
beq 1b

@ Turn on PLL
str r1, [r0, #RCC_CR]

@ Pre-Calculate value for RCC_CFGR
orr r2, #(2 << RCC_CFGR_SW)

@ Wait for PLL ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_PLLRDY)
beq 1b

@ Set flash wait states to 2
ldr r0, =FLASH
ldr r3, =((1<<FLASH_ACR_PRFTBE)|(2<<FLASH_ACR_LATENCY))
str r3, [r0, #FLASH_ACR]
ldr r0, =RCC

@ Switch system clock to PLL
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
bic r1, #(1 << RCC_CR_HSION)

@ Wait for switch to PLL
1: ldr r3, [r0, #RCC_CFGR]
and r3, #(3 << RCC_CFGR_SWS)
cmp r3, #(2 << RCC_CFGR_SWS)
bne 1b

@ Turn off HSI to save power
str r1, [r0, #RCC_CR]

bx lr
.ltorg
</source>

Many projects perform the clock configuration by the reset handler before calling the main function. If you want to follow that practice, place a “bl ConfigureSysClock” as the first instruction in the “Reset_Handler” - this way, all the setup will run with the higher clock frequency, making start-up faster. This and the completed startup code from the previous chapters is implemented in the “startup.S” file in the example repository. If you use it, put your code in the “main” function, where RAM and system clock will already be initialized. This is shown in the “BlinkStartup” example.

=== Project template & makefile ===
To quickly start your own project, a project template is supplied in the examples repository under the directory [https://github.com/Erlkoenig90/ArmAsmTutorial/tree/master/ProjectTemplate-STM32F103RB ProjectTemplate-STM32F103RB]. Put your own application code in the program.S file. The startup.S and vectortable.S contain the reset handler with RAM initialization and the vector table with default handler, respectively. A linker script is included too.

The project also contains a makefile. This allows you to quickly translate your project without having to type the assembler and linker commands. Simply type
<source lang=sh>
make
</source>
To translate the code and produce program.elf, program.bin and program.hex files. All “.S” files in the directory will be automatically translated. Writing makefiles is a complex topic on its own with a lot of information already available on the web, so no further explanations on that will be made here.

ARM-ASM-Tutorial

2021-07-08T13:47:18Z

Nicolas:

[[Kategorie:ARM]][[Kategorie:STM32]][[Kategorie:Entwicklungstools]][[Kategorie:Programmiersprachen]]The [[ARM]] processor architecture is widely used in all kinds of industrial applications and also a significant number of hobby and maker projects. This tutorial aims to teach the fundamentals of programming ARM processors in assembly language.

Tutorial by [[Benutzer:Erlkoenig|Niklas Gürtler]]. [https://www.mikrocontroller.net/topic/482409 Thread in Forum] for feedback and questions.

== Introduction ==
=== Why assembly? ===
Today, there is actually little reason to use assembly language for entire projects, because high-quality optimizing compilers for high-level languages (especially C and C++) are readily available as free open source software and because the ARM architecture is specifically optimized for high-level languages. However, knowledge in assembly is still useful for debugging certain problems, writing low-level software such as bootloaders and operating system kernels, and reverse engineering software for which no source code is available. Occasionally it is necessary to manually optimize some performance-critical code section. Sometimes claims are made that ARM processors can’t be programmed in assembly. Therefore, this tutorial will show that this is very well possible by showing how to write entire (small) applications entirely in the ARM assembly language!

As most of the resources and tools for ARM focus on C programming and because of the complexity of the ARM ecosystem, the largest difficulty in getting started with ARM assembly is not the language itself, but rather using the tools correctly and finding relevant documentation. Therefore, this tutorial will focus on the development environment and how the written assembly code is transformed into the final program. With a good understanding of the environment, all the ARM instructions can be learned simply by reading the architecture documentation.

Because of the complex ecosystem around ARM, a general introduction of the ARM processor market is necessary.

=== About ARM ===
Arm Holdings is the company behind the ARM architecture. Arm does not manufacture any processors themselves, but designs the “blueprints” for processor cores, which are then licensed by various semiconductor companies such as ST, TI, NXP and many others, who combine the processor with various support hardware (most notably flash and RAM memories) and peripheral modules to produce a final complete processor IC. Some of these peripheral modules are even licensed from other companies – for example, the USB controller modules by Synopsys are found in many different processors from various manufacturers.

Because of this licensing model, ARM processor cores are found in a very large variety of products for which software can be developed using a single set of tools (especially compiler, assembler and debugger). This makes knowledge about the ARM architecture, particularly the ARM assembly language, useful for a large range of applications.

Since the ARM processor cores always require additional hardware modules to function, both the ARM-made processor core and the manufacturer-specific periphery modules have to be considered when developing software for ARM systems. For example, the instruction set is defined by ARM and software tools (compiler, assembler) need to be configured for the correct instruction set version, while the clock configuration is manufacturer-specific and needs to be addressed by initialization code specifically made for one processor.

=== Architecture and processor variants ===
A processor’s architecture defines the interface between hardware and software. Its most important part is the instruction set, but it also defines e.g. hardware behavior under exceptional circumstances (e.g. memory access errors, division by zero, etc.). Processor architectures evolve, so they have multiple versions and variants. They also define optional functionality that may or may not be present in a processor (e.g. a floating-point unit). For ARM, the architectures are documented exhaustively in the “ARM Architecture Reference Manuals”.

While the architecture is an abstract concept, a processor core is a concrete definition of a processor (e.g. as a silicon layout or HDL) that implements a certain architecture. Code that only uses knowledge of the architecture (e.g. an algorithm that does not access any periphery) will run on any processor implementing this architecture. Arm, as mentioned, designs processor cores for their own architectures, but some companies develop custom processors that conform to an ARM architecture, for example Apple and Qualcomm.

ARM architectures are numbered, starting with ARMv1 up until the most recent ARMv8. ARMv6 is the oldest architecture still in significant use, while ARMv7 is the most widespread one. Suffixes are appended to the version to denote variants of the architecture; e.g. ARMv7-M is for small embedded systems while ARMv7-A for more powerful processors. ARMv7E-M adds digital signal processing capabilities including saturating and SIMD operations.

Older ARM processors are named ARM1, ARM2 …, while after ARM11 the name “Cortex” was introduced. The Cortex-M family, including e.g. Cortex-M3 and Cortex-M4 (implementing ARMv7-M and ARMv7E-M architecture, respectively) is designed for microcontrollers, where power consumption, memory size, chip size and latency are important. The Cortex-A family, including e.g. Cortex-A8 and Cortex-A17 (both implementing ARMv7-A architecture) is intended for powerful processors (called “application processors”) for e.g. multimedia and communication products, particularly smartphones and tablets. These processors have much more processing power, typically feature high-bandwidth interfaces to the external world, and are designed to be used with high-level operating systems, most notably Linux (and Android).

An overview of ARM processors and their implemented architecture version can be found on [https://en.wikipedia.org/wiki/ARM_architecture#Cores Wikipedia].
This tutorial will focus on the Cortex-M microcontrollers, as these are much easier to program without an operating system and because assembly language is less relevant on Cortex-A processors. However, the large range of ARM-based devices necessitates flexibility in the architecture specification and software tools, which sometimes complicates their use.

There is actually not a single, but three instruction sets for ARM processors:
* The “A32” instruction set for 32bit ARM architectures, also simply called “ARM” instruction set, favors speed over program memory consumption. All instructions are 4 bytes in size.
* The “A64” instruction set is for the new 64bit ARM processors
* The “T32” instruction set for 32bit ARM architectures, also known as “Thumb”, favors program memory consumption over speed. Most instructions are 2 bytes in size, and some are 4 bytes.

The 64bit Cortex-A application processors support all three instruction sets, while the 32bit ones only A32 and T32. The Cortex-M microcontrollers only support T32. Therefore, this tutorial will only talk about “thumb2”, the second version of the “T32” instruction set.

== Prerequisites ==
First, suitable hardware and software need to be selected for demonstrating the usage of assembly language. For this tutorial, the choice of the specific microcontroller is of no great significance. However, to ensure that the example codes are easily transferable to your setup, it is recommended to use the same components.

=== Microcontroller selection ===
For the microcontroller, an [https://www.st.com/en/microcontrollers-microprocessors/stm32f103c8.html STM32F103C8] or [https://www.st.com/en/microcontrollers-microprocessors/stm32f103rb.html STM32F103RB] by STMicroelectronics will be used. Both controllers are identical except for the flash size (64 KiB vs 128 KiB) and number of pins (48 vs 64). These controllers belong to ST’s “mainstream” entry-level- family and are quite popular among hobbyist developers with many existing online resources. Several development boards with these controllers are available, for example: [https://www.st.com/en/evaluation-tools/nucleo-f103rb.html Nucleo-F103], “[https://os.mbed.com/users/hudakz/code/STM32F103C8T6_Hello/ Blue Pill]” (search for “stm32f103c8t6” on AliExpress, Ebay or Amazon), [https://www.olimex.com/Products/Duino/STM32/OLIMEXINO-STM32/open-source-hardware Olimexino-STM32], [https://www.olimex.com/Products/ARM/ST/STM32-P103/ STM32-P103], [https://www.olimex.com/Products/ARM/ST/STM32-H103/ STM32-H103], [https://www.st.com/en/evaluation-tools/stm3210e-eval.html STM3210E-EVAL].

=== Processor type & documentation ===
First, the microcontroller manufacturer’s documentation is used to find out what kind of ARM processor core and architecture is used for the chosen chip. This information is used to find all the relevant documentation.

* The first source of information is the [https://www.st.com/resource/en/datasheet/stm32f103rb.pdf STM32F103RB/C8 datasheet]. According to the headline, this is a '''medium-density''' device. This term is ST-specific and denotes a product family with certain features. The very first paragraph states that this microcontroller uses a '''Cortex-M3''' processor core with 72 MHz. This document also contains the electrical characteristics and pinouts.
* The next important document is the [https://www.st.com/resource/en/reference_manual/cd00171190.pdf STM32F103 reference manual] that contains detailed descriptions of the periphery. Particularly, detailed information about periphery registers and bits can be found here.
* The [https://developer.arm.com/ip-products/processors/cortex-m/cortex-m3 ARM developer website] provides information about the Cortex-M3 processor core, particularly the [https://static.docs.arm.com/100165/0201/arm_cortexm3_processor_trm_100165_0201_01_en.pdf ARM Cortex-M3 Processor Technical Reference Manual]. According to chapter 1.5.3, this processor implements the '''ARMv7-M architecture'''.
* The architecture is documented in the [https://static.docs.arm.com/ddi0403/ed/DDI0403E_d_armv7m_arm.pdf ARMv7M Architecture Reference Manual]. Particularly, it contains the complete documentation of the instruction set.

For any serious STM32 development, you should be familiar with all these documents.

=== Debug adapter ===
There are many different ways of getting your program to run on an STM32 controller. A debug adapter is not only capable of writing software to the controller’s flash, but can also analyze the program’s behavior while it is running. This allows you to run the program one instruction at a time, analyze program flow and memory contents and find the cause of crashes. While it is not strictly necessary to use such a debugger, it can save a lot of time during development. Since entry-level models are available cheaply, not using one doesn’t even save money. Debuggers connect to a host PC via USB (some via Ethernet) and to the microcontroller (“target”) via JTAG or SWD. While these two interfaces are closely related and perform the same function, SWD uses fewer pins (2 instead of 4, excluding reset and ground). Most STM32 controllers support JTAG, and all support SWD.

Documenting all possible way of flashing and debugging STM32 controllers is beyond the scope of this tutorial; a lot of information is already available online on that topic. Therefore, this tutorial will assume that the [https://www.st.com/en/development-tools/st-link-v2.html ST-Link] debug adapter by STMicroelectronics is used, which is cheap and popular among hobbyists. Some of the aforementioned boards even include an ST-Link adapter, which can also be used “stand-alone” to flash an externally connected microcontroller. The examples should work with other adapters as well; please consult the appropriate documentation on how to use them.

=== Development Software ===
On the software part, several tools are needed for developing microcontroller firmware. Using a complete Integrated Development Environment (IDE) saves time and simplifies repetitive steps but hides some important steps that are necessary to gain a basic understanding of the process. Therefore, this tutorial will show the usage of the basic command line tools to demonstrate the underlying principles. Of course, for productive development, using an IDE is a sensible choice. The tools presented will work on Windows, Linux and Mac OS X (untested).

First, a text editor for writing assembly code is needed. Any good editor such as Notepad++, gedit or Kate is sufficient. When using Windows, the [https://www.st.com/en/development-tools/stsw-link004.html ST-Link Utility] can be useful, but is not strictly required.

Next, an assembler toolchain is needed to translate the written assembly code into machine code. For this, the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm GNU Arm Embedded Toolchain] is used. This is a collection of open source tools for writing software in Assembly, C and C++ for Cortex-M microcontrollers. Even though the package is maintained by ARM, the software is created by a community of open-source developers. For this tutorial, only the contained applications “binutils” (includes assembler & linker) and “GDB” (debugger) are really needed, but if you later decide to work with C or C++ code, the contained compilers will come in handy. Apart from that, this package is also shipped as part of several IDEs such as SW4STM32, Atollic TrueSTUDIO, emIDE, Embedded Studio and even Arduino – so if you (later) wish to work with one of these, your assembly code will be compatible with it.

Another component is required to talk with the debug adapter. For the ST-Link, this is done by [http://openocd.org/ OpenOCD], which communicates with the adapter via USB. Other adapters such as the J-Link ship with their own software.

Lastly, a calculator that supports binary and hexadecimal modes can be very helpful. Both the default Gnome calculator and the Windows calculator (calc.exe) are suitable.

== Setup ==
Follow the instructions in the next chapters to set up your development environment.

=== Hardware ===
The only thing that needs to be done hardware-wise is connecting the debugger with your microcontroller. If you are using a development board with an integrated debugger (such as the Nucleo-F103), this is achieved by setting the jumpers accordingly (see the board’s documentation – for e.g. the Nucleo-F103, both “CN2” jumpers need to be connected). When using an external debugger, connect the “GND”, “JTMS/SWDIO” and “JTCK/SWCLK” pins of debugger and microcontroller. Connect the debugger’s “nRESET” (or “nTRST” if it only has that) pin to the microcontroller’s “NRST” input.

If your board has jumpers or solder bridges for the “BOOT0” pin, make sure that the pin is low. Applying power to the microcontroller board is typically done via USB.

=== Software ===
==== Linux ====
Some linux distributions ship with packages for the ARM toolchain. Unfortunately, these are often outdated and also configured slightly differently than the aforementioned package maintained by ARM. Therefore, to be consistent with the examples, it is strongly recommended to use the package by ARM.

Download the Linux binary tarball from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and extract it to some directory whose path does not contain any spaces. The extracted directory contains a subdirectory called “bin”. Copy the full path to that directory (e.g. “/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin”).

Add this path to the “PATH” environment variable. On Ubuntu/Debian systems, this can be done via:

<source lang=sh>
echo 'export PATH="${PATH}:/home/user/gcc-arm-none-eabi-8-2019-q3-update/bin"' | sudo tee /etc/profile.d/gnu-arm-embedded.sh
</source>

OpenOCD can be installed via the package manager, e.g. (Ubuntu/Debian):
<source lang=sh>
sudo apt-get install openocd
</source>

After that, log out and back in (or just reboot). In a terminal, type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source lang=sh>
$ arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source lang=sh>
$ openocd -v
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

==== Windows ====
[[File:ArmAsmTutorial_GccOptions.png|300px|thumb|right|Options for installing GCC]]Download the Windows installer from the [https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads downloads page] and run it. Enable the options “Add path to environment variable” and “Add registry information”, and disable “Show Readme” and “Launch gccvar.bat”.

A Windows package for OpenOCD can be obtained from the [https://github.com/gnu-mcu-eclipse/openocd/releases gnu-mcu-eclipse downloads page]. Download the appropriate file, e.g. " gnu-mcu-eclipse-openocd-0.10.0-12-20190422-2015-win64.zip”. The archive contains a path like “GNU MCU Eclipse/OpenOCD/0.10.0-12-20190422-2015”. Extract the contents of the inner directory (i.e. the subdirectories “bin”, “doc”, “scripts”…) into some directory whose path does not contain any spaces, e.g. “C:\OpenOCD”. You should now have a directory “C:\OpenOCD\bin” or similar. Copy its full path.

[[File:ArmAsmTutorial_PcProperties.png|300px|thumb|right|Opening PC properties]][[File:ArmAsmTutorial_SetEnvVar.png|300px|thumb|right|Setting environment variable]]Set the “Path” environment variable to include this path: Right-Click on “This PC”, then “Properties” → “Advanced System Settings”→ “Environment Variables”. In the lower list (labeled “System variables”), select “Path”. Click “Edit” → “New”, paste the path, and click “OK” multiple times.

Open a ''new'' command window (Windows Key + R, type “cmd” + Return). Type <code>arm-none-eabi-as -version</code>. The output should look similar to this:
<source>
C:\>arm-none-eabi-as -version
GNU assembler (GNU Tools for Arm Embedded Processors 8-2019-q3-update) 2.32.0.20190703
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-none-eabi'.
</source>
Similarly, for <code>openocd -v</code>:
<source>
C:\>openocd -v
GNU MCU Eclipse OpenOCD, 64-bitOpen On-Chip Debugger 0.10.0+dev-00593-g23ad80df4 (2019-04-22-20:25)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
</source>

If an error message appears, the installation isn’t correct.

== Writing assembly applications ==
The full source code of the examples in the following chapters contain be found on [https://github.com/Erlkoenig90/ArmAsmTutorial GitHub]. The name of the corresponding directory is given after each example code below.
=== First rudimentary program ===
After the software setup, you can begin setting up a first project. Create an empty directory for that, e.g. “prog1”.

Inside the project directory, create your first assembly file “prog1.S” (“.S” being the file name extension for assembly files in GNU context) with the following content:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

nop @ Do Nothing
b . @ Endless loop
</source>
Example name: “EmptyProgram”

When this file is sent to the assembler, it will translate the instructions into binary machine code, with 2 or 4 bytes per instruction. These bytes are concatenated to form a program image, which is later written into the controller’s flash memory. Therefore, assembly code more or less directly describes flash memory contents.

The lines starting with a dot “.” are assembler directives that control the assembler’s operation. Only some of those directives emit bytes that will end up in flash memory. The @ symbol starts a comment.

The first line lets the assembler use the new “unified” instruction syntax (“UAL” - Unified Assembler Language) instead of the old ARM syntax. The second line declares the used processor Cortex-M3, which the assembler needs to know in order to recognize the instructions available on that processor. The third line instructs the assembler to use the Thumb (T32) instruction set. We can’t start putting instructions in flash memory right away, as the processor expects a certain data structure to reside at the very beginning of the memory. This is what the “.word” and “.space” instructions create. These will be explained later.

The first “real” instruction is “nop”, which will be the first instruction executed after the processor starts. “nop” is short for “No OPeration” - it causes the processor to do nothing and continue with the next instruction. This next instruction is “b .”. “b” is short for “branch” and instructs the processor to jump to a certain “target” location, i.e. execute the instruction at that target next. In assembly language, the dot “.” represents the current location in program memory. Therefore, “b .” instructs the processor to jump to this very instruction, i.e. execute it again and again in an endless loop. Such an endless loop is frequently found at the end of microcontroller programs, as it prevents the processor from executing random data that is located in flash memory after the program.

To translate this assembly code, open a terminal (linux) / command window (Windows). Enter the project directory by typing <code>cd <Path to Project Directory></code>. Call the assembler like this:

<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
</source>

This instructs the assembler to translate the source file “prog1.S” into an object file “prog1.o”. This is an intermediary file that contains binary machine code, but is not a complete program yet. The “-g”-Option tells the assembler to include debug information, which does not influence the program itself, but makes debugging easier. To turn this object file into a final program, call the linker like this:

<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -Ttext=0x8000000
</source>

This creates a file “prog1.elf” that contains the whole generated program. The “-Ttext” option instructs the linker to assume 0x8000000 as the start address of the flash memory. The linker might output a warning like this:

<pre>arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000</pre>

This is not relevant for executing the program without an operating system and can be ignored.

=== Flashing the program ===
To download the compiled application to the microcontroller that has been attached via ST-Link, use OpenOCD like so:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg -c "program prog1.elf verify reset exit"
</source>
Unfortunately, the application does not do anything that can be observed from the outside, except perhaps increase the current consumption.
=== Starting the debugger ===
To check whether the program is actually running, start a debugging session to closely observe the processor’s behavior. First, run OpenOCD such that it acts as a GDB server:
<source lang=sh>
openocd -f interface/stlink-v2.cfg -f target/stm32f1x.cfg
</source>
Then, open a new terminal/command window and start a GDB session:
<source lang=sh>
arm-none-eabi-gdb prog1.elf
</source>
GDB provides its own interactive text-based user interface. First, type this command to let GDB connect to the already running OpenOCD instance:
<source>
target remote :3333
</source>
Then, stop the currently running program:
<source>
monitor reset halt
</source>
If this fails, hold your board’s reset button just before executing the command and repeat until it succeeds. GDB can also download code to flash memory by simply typing:
<source>
load
</source>
Which will overwrite the previously flashed program (which, in this case, is identical anyways). After loading the program, reset the controller again:
<source>
monitor reset halt
</source>
Now, examine the contents of the CPU registers:
<source>
info reg
</source>
The output should look something like
<pre>
r0 0x0 0
r1 0x0 0
r2 0x0 0
r3 0x0 0
r4 0x0 0
r5 0x0 0
r6 0x0 0
r7 0x0 0
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
sp 0x0 0x0
lr 0x0 0
pc 0x8000000 0x8000000 <_stack+133693440>
xPSR 0x1000000 16777216
msp 0x20000400 0x20000400
psp 0x27e3fa34 0x27e3fa34
primask 0x0 0
basepri 0x0 0
faultmask 0x0 0
control 0x0 0
</pre>

At this point, the processor is ready to start executing your program. The processor is halted just before the first instruction, which is “nop”. You can let the processor execute one single instruction (i.e. the “nop”) by typing
<source>
stepi
</source>
If you type <code>info reg</code> again, you will see that PC is now “0x80000ee”, i.e. the processor is about to execute the next instruction, “b .”. When you do <source>stepi</source> again (repeatedly), nothing more will happen – the controller is stuck in the mentioned endless loop, exactly as intended. You can instruct the processor to run the program continuously, without stopping after each instruction by typing
<source>
continue
</source>
You can interrupt the running program by pressing “Ctrl+C”. Run the commands
<source>
kill
quit
</source>
to exit GDB. You can terminate OpenOCD by pressing “Ctrl+C” in its terminal.
=== Using processor registers ===
The example program hasn’t done anything useful, but any “real” program will need to process some data. On ARM, any data processing is done via the processor registers. The 32bit ARM platforms have 16 processor registers, each of which is 32bit in size. The last three of those (r13-r15) have a special meaning and can only be used with certain restrictions. The first thirteen (r0-r12) can be used freely by the application code for data processing.

All calculations (e.g. addition, multiplication, logical and/or) need to be performed on those processor registers. To process data from memory, it first has to be loaded into a register, then processed, and stored back into memory. This is typical for RISC platforms and is known as a “load-store-architecture”.

As the starting point for any calculation, some specific values need to be put into the registers. The easiest way to do that is:
<source lang=asm>
ldr r0, =123456789
</source>
The number 123456789 will be encoded as part of the program, and the instruction lets the processor copy it into the register “r0”. Any number and any register in the range r0-r13 can be used instead.

The instruction “mov” can be used to copy the contents from one register to another:
<source lang=asm>
mov r1, r0
</source>
This copies r0 to r1. Unlike some other processor architectures, “mov” can not be used to access memory, but only the processor registers.

In ARM, 32bit numbers are called "words" and are most frequently used. 16bit numbers are known as half-words, and 8bit numbers as bytes, as usual.
=== Accessing periphery ===
To write microcontroller programs that interact with the outside world, access to the controller’s periphery modules is required. Interaction with periphery happens mainly through periphery registers (also known as “special function registers”, SFR). Despite their name, they work quite differently from processor registers. Instead of numbers, they have addresses (in the range of 0x40000000-0x50000000) that are not contiguous (i.e. there are gaps), they cannot be directly used for data processing but need to be explicitly read and written before and after any calculations. Not all of them are 32bit; many have only 16bit, and some of those bits may not exist and can’t be accessed. The microcontroller manufacturer’s documentation uses names for these registers, but the assembler doesn’t know these. Therefore, the assembly code needs to use the numerical addresses.

The easiest way to get the microcontroller to do something that produces some visible result is to send a signal via an output pin to turn on an LED. Using a pin to send/receive arbitrary software-defined signals is called “GPIO” (General Purpose Input/Output). First, choose a pin – for example, PA8 (this one is available on all package variants). Connect an LED to this pin and to GND (“active high”). Use a series resistor to limit the current to max. 15mA (the absolute maximum being 25mA), e.g. 100Ω for a 3,3V supply and a standard LED. For higher loads (e.g. high-power LEDs or a relay) use an appropriate transistor.

As with most microcontrollers, the pins are grouped into so-called “ports”, each of which has up to 16 pins. The ports are named by letters of the alphabet, i.e. “GPIOA”, “GPIOB”, “GPIOC” etc. The number of ports and pins varies among the individual microcontroller types. The 16 pins of one port can be read or written in one single step.

==== Clock Configuration ====
Many ARM controllers feature a certain trap: Most periphery modules are disabled by default to save power. The software has to explicitly enable the needed modules. On STM32 controllers, this is done via the “RCC” (Reset and Clock Control) module. Particularly, this module allows the software to disable/enable the clock signal for each periphery module. Because MOSFET-based circuits (virtually all modern ICs) only draw power if a clock signal is applied, turning off the clock of unused modules can reduce the power usage considerably.

This is documented in the aforementioned reference manual in chapter 7. The subchapter 7.3.7 describes the periphery register “RCC_APB2ENR” which allows you to configure the clock signal for some peripheral modules. This register has 32 bits, of which 14 are “reserved”, i.e. can’t be used and should only be written with zeroes. Each of the available 18 bits enables one specific periphery module if set to “1” or disables it if set to “0”. According to the manual, the reset value of this register is 0, so all periphery modules are disabled by default. In order to turn on the GPIOA module to which the desired pin PA8 belongs, the bit “IOPAEN” needs to be set to “1”. This is bit number two in the register. Since registers can only be accessed to as a whole (individual bits can’t be addressed), a 32bit-value where bit two is “1” and all others are kept as “0” needs to be written. This value is 0x00000004.

To write to the register, its address needs to be given in the code. The addresses of the periphery registers are grouped by the periphery modules they belong to - each periphery module (e.g. RCC, GPIOA, GPIOB, USB, …) has its own base address. The addresses of the individual registers are specified as an offset that needs to be added to this base address to obtain the full absolute address of the register. Chapter 7.3.7 specifies the offset address of RCC_APB2ENR as “0x18”. Chapter 3.3 specifies the base addresses of all periphery modules – RCC is given as “0x40021000”. So, the absolute address of RCC_APB2ENR is “0x40021000+ 0x18=0x40021018”.

In short: To enable GPIOA, the value '''0x00000004''' needs to be written to address '''0x40021018'''.

According to the “load-store” principle, ARM processors can’t do this in a single step. Both the value to be written and the address need to reside in processor registers in order to perform the write access. So, what needs to done is:
* Load the value 0x00000004 into a register
* Load the value 0x40021018 into another register
* Store the value from the first register into the memory location specified by the second register.
This last step is performed by the “STR” instruction as follows:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
b .
</source>
The square brackets are required but just serve as a reminder to the programmer that the contents of “r1” is used as an address. After the “str” instruction, the GPIOA periphery is enabled, but doesn’t do anything yet.

==== GPIO Configuration ====
By default, all GPIO pins are configured as “input”, even if there is no software to process the input data. Since inputs are “high-impedance”, i.e. only a very small current can flow into/out of the pin, the risk of (accidental) short-circuits and damage to the microcontroller is minimized. However, this current is too small to light up an LED, so you have to configure the pin PA8 as “output”. The STM32 support multiple output modes, of which the right one for the LED is “General Purpose Output Push-Pull, 2 MHz”.

Access and configuration of GPIO pins is achieved via the registers of the GPIO periphery. The STM32 have multiple identical instances of GPIO modules, which are named GPIOA, GPIOB, … Each of those instances has a distinct base address, which are again described in chapter 3.3 of the reference manual (e.g. “0x40010800” for GPIOA, “0x40010C00” for GPIOB etc.). The registers of the GPIO module are described in chapter 9.2, and there is one instance of each register per GPIO module. To access a specific register of a specific GPIO module, the base address of that module needs to be added to the offset address of the register. For example, “GPIOA_IDR” has address “0x40010800+0x08=0x40010808”, while “GPIOB_ODR” has address “0x40010C00+0x0C= 0x40010C0C”.

Configuration of the individual GPIO pins happens through the “GPIOx_CRL” and “GPIOx_CRH” registers (“x” is a placeholder for the concrete GPIO module) – see chapters 9.2.1 and 9.2.2. Both registers are structured identically, where each pin uses 4 bits, so each of the two registers handles 8 pins in 8x4=32 bits. Pins 0-7 are configured by “GPIOx_CRL” and pins 8-15 by “GPIOx_CRH”. Pin 0 is configured by bits 0-3 of “GPIOx_CRL”, pin 1 by bits 4-7 of “GPIOx_CRL”, pin 8 by bits 0-3 of “GPIOx_CRH” and so on.

The 4 bits per pin are split into two 2-bit fields: “MODE” occupies bits 0-1, and “CNF” bits 2-3. “MODE” selects from input and output modes (with different speeds). In output mode, “CNF” determines whether the output value is configured from software (“General Purpose” mode) or driven by some other periphery module (“Alternate function” mode), and whether two transistors (“Push-pull”) or one (“open-drain”) are used to drive the output. In input mode, “CNF” selects from analog mode (for ADC), floating input and input with pull-up/down resistors (depending on the value in the “GPIOx_ODR” register).

Therefore, to configure pin PA8 into “General Purpose Output Push-Pull, 2 MHz” mode, bits 0-3 of “GPIOA_CRH” need to be set to value “2”. The default value of “4” configures the pin as “input”. To keep the other pins at their “input” configuration, the value “0x44444442” needs to be written to register “GPIOA_CRH”, which has address “0x40010804”:
<source lang=asm>
ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

==== Writing GPIO pins ====
The GPIO pin still outputs the default value, which is 0 for “low”. To turn on the LED, the output has to be set to “1” for “high”. This is achieved via the GPIOA_ODR register, which has 16bits, one for each pin (see chapter 9.2.4). To enable the LED, set bit 8 to one:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r0, =0x00000004
ldr r1, =0x40021018
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r0, =0x44444442
ldr r1, =0x40010804
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x100
ldr r1, =0x4001080C
str r0, [r1] @ Set ODR8 in GPIOA_ODR to 1 to set PA8 high

b .
</source>
Example name: “SetPin”

This program enables the GPIOA periphery clock, configures PA8 as output, and sets it to high. If you run it on your microcontroller, you should see the LED turn on – the first program to have a visible effect!

=== Data processing ===
ARM supports many instructions for mathematical operations. For example, addition can be performed as:
<source lang=asm>
ldr r0, =222
ldr r1, =111
add r2, r0, r1
</source>
This will first load the value 222 into register r0, load 111 into r1, and finally add r0 and r1 and store the result (i.e. 333) in r2. The operand for the result is (almost) always put on the left, while the input operand(s) follow on the right.

You can also overwrite an input register with the result:
<source lang=asm>
add r0, r0, r1
</source>
This will write the result to r0, overwriting the previous value. This is commonly shortened to
<source lang=asm>
add r0, r1
</source>
The output operand can be omitted, and the first input (here: r0) will be overwritten. This applies to most data processing instructions. Other frequently used data processing instructions that are used in a similar fashion are:
* '''sub''' for subtraction
* '''mul''' for multiplication
* '''and''' for bitwise and
* '''orr''' for bitwise or
* '''eor''' for bitwise exclusive or (“xor”)
* '''lsl''' for logical left shift
* '''lsr''' for logical right shift

Most of these instructions can not only take registers as input, but also immediate arguments. Such an argument is encoded directly into the instruction without needing to put it into a register first. Immediate arguments need to be prefixed by a hash sign #, and can be decimal, hexadecimal or binary. For example,
<source lang=asm>
add r0, r0, #23
</source>
adds 23 to the register r0 and stores the result in r0. This can again be shortened to
<source lang=asm>
add r0, #23
</source>

Such immediate arguments can not be arbitrarily large, because they need to fit inside the instruction, which is 16 or 32 bit in size and also needs some room for the instruction and register numbers as well. So, if you want to add a large number, you have to use “ldr” first as shown to load it into a register.

Try out the above examples and use GDB to examine their behavior. Use GDB’s “info reg” command to display the register contents. Don't forget to execute both the “arm-none-eabi-as” and “arm-none-eabi-ld” commands to translate the program.

=== Reading periphery registers ===
The last example works, but has a flaw: Even though only a few bits per register need to be modified, the code overwrites all the bits in the register at once. The bits that should not be modified are just overwritten with their respective default value. If some of those bits had been changed before – for example to enable some other periphery module – these changes would be lost. Keeping track of the state of the register throughout the program is hardly practical. Since ARM does not permit modifying individual bits, the solution is to read the whole register, modify the bits as needed, and write the result back. This is called a “read-modify-write” cycle.

Reading registers is done via the “ldr” instruction. As with “str”, the address needs to be written into a processor register beforehand, and the instruction stores the read data into a processor register as well. Starting the with the “RCC_APB2ENR” register, you can read it via:
<source lang=asm>
ldr r1, =0x40021018
ldr r0, [r1]
</source>
Even though the two “ldr” instruction look similar, they work differently – the first one loads a fixed value into a register (r1), while the second loads data from the periphery register into r1.

The loaded value should then be modified by setting bit two to “1”. This can be done with the “orr” instruction:
<source lang=asm>
orr r0, r0, #4
</source>
After that, we can store r0 as before.

With the GPIOA_CRH register, it’s slightly more complicated: The bits 0, 2 and 3 need to be cleared, while bit 1 needs to be set to 1. The other bits (4-31) need to keep their value. To clear the bits, use the “and” instruction after loading the current periphery register value:

<source lang=asm>
ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
</source>

For the “GPIOx_ODR” registers, such tricks are not needed, as there is a special “GPIOx_BSRR” register which simplifies writing individual bits: This register can not be read, and writing zeroes to any bit has no effect on the GPIO state. However, if a 1 is written to any of the bits 0-15, the corresponding GPIO pin is set to high (i.e. the corresponding bit in ODR set to 1). If any of the bits 16-31 is written to 1, the corresponding pin is set to low. So, the pin can be set to 1 like this:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

So, the modified program is:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “SetPin2”
=== Jump instructions ===
For a traditional “hello world” experience, the LED should not only light up, but blink, i.e. turn on and off repeatedly. Setting pin PA8 to low level can be achieved by writing a 1 to bit 24 in the “GPIO_BSRR” register:

<source lang=asm>
ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1]
</source>

By pasting the this behind the instructions for turning on the LED, it will be turned on and off again. To get the LED to blink, those two blocks need to be repeated endlessly, i.e. at the end of the code there needs to be an instruction for jumping back to the beginning.

A simple endless loop was already explained: The “b .” instruction, which just executes itself repeatedly. To have it jump somewhere else, the dot needs to be substituted for the desired target address, for example:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b 0x8000104
</source>
Example name: “Blink”

The address specified is an absolute address, which is the address of the “ldr” instruction at the beginning of the block for setting the pin to high. Actually, the branch instruction “b” is not capable of jumping directly to such an absolute address - again, because a 32 bit wide address can't be encoded in a 16/32 bit wide instruction. Instead, the assembler calculates the distance of the jump target and the location of the “b” instruction, and stores it into the instruction. When jumping backwards, this distance is negative.

When executing program code, the processor always stores the address of the currently executed instruction plus four in the r15 register, which is therefore also known as PC, the program counter. When encountering a “b” instruction, the processor adds the contained distance value to the PC value to calculate the absolute address of the jump target before jumping there.

This means that “b” performs a relative jump, and even if the whole machine code section were moved somewhere else in memory, the code would still work. However, the assembly language syntax does not really represent this, as the assembler expects absolute addresses which it then transforms into relative ones.

Specifying the target address directly as shown is very impractical, as it has to be calculated manually, and if the section of code is moved or modified, the address needs to be changed. To rectify this, the assembler supports labels: You can assign a name to a certain code location, and use this name to refer to the code location instead of specifying the address as a number. A label is defined by writing its name followed by a colon:

<source lang=asm>
BlinkLoop:

ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

b BlinkLoop
</source>
Example name: “Blink2”

This is purely a feature of the assembler – the generated machine code will be identical to the previous example. In “b BlinkLoop”, the assembler substitutes the label for the address it represents to calculate the relative jump distance. The assembler actually provides no direct way of directly specifying the relative offset that will be encoded in the instruction, but it can be done like this:

<source lang=asm>
b (.+4+42*2)
</source>

The resulting instruction will contain “42” as the jump offset. As suggested by the syntax, the processor multiples this number by 2 (since instructions can only reside at even memory addresses, it would waste one bit of memory to specify the number directly) and adds to it the address of the “b” instruction plus 4. The assembly syntax is designed to represent the end result of the operation, so the assembler reverses the peculiar pre-calculations of the processor. If you want to do this calculation yourself, you have to again undo the assembler’s own calculation with the expression shown above. There is usually no reason to do that, though.

=== Counting Loops ===
The above example for a blinking LED does not really work yet – the LED blinks so fast the human eye can’t see it. The LED will just appear slightly dim. To achieve a proper blinking frequency, the code needs to be slowed down. The easiest way for that is to have the processor execute a large number of “dummy” instructions between setting the pin high and low. Simply placing many “nop” instructions isn’t possible though, as there is simply not enough program memory to store all of them. The solution is a loop that executes the same instructions a specific number of times (as opposed to the endless loops from the examples above). To do that, the processor has to count the number of loop iterations. It is actually easier to count ''down'' than up, so start by loading the desired number of iterations into a register and begin the loop by subtracting “1”:

<source lang=asm>
ldr r2, =1000000
subs r2, #1
</source>

Now, the processor should make a decision: If the register has reached zero, terminate the loop; else, continue by again subtracting “1”. The ARM math instructions can automatically perform some tests on the result to check whether it is positive/negative or zero and whether an overflow occurred. To enable those checks, append an “s” to the instruction name – hence, “subs” instead of “sub”. The result of these checks is automatically stored in the “Application Program Status Register” (APSR) – the contained bits N, Z, C, V indicate whether the result was negative, zero, set the carry bit or caused an overflow. This register is usually not accessed directly. Instead, use the conditional variant of the “b” instruction, where two letters are appended to indicate the desired condition. The jump is only performed if the condition is met; otherwise, the instruction does nothing. The available condition codes are described in the chapter “Condition Codes” of this tutorial. The conditions are formulated in terms of the mentioned bits of the APSR. For example, the “bne” instruction only performs a jump if the zero (Z) flag is ''not'' set, i.e. when the result of the last math instruction (with an “s” appended) was ''not'' zero. The “beq” instruction is the opposite of that – it only performs a jump if the result ''was'' zero.

So, to perform the jump back to the beginning of the loop, add a label before the “subs” instruction, and put a “bne” instruction after the “subs” that jumps to this label if the counter has not reached zero yet:
<source lang=asm>
ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop
</source>

The actual loop consists only of the two instructions “subs” and “bne”. By placing two of those loops (with two different labels!) in between the blocks that turn the pins on and off, the blink frequency is lowered sufficiently such that it becomes visible:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

BlinkLoop:
ldr r1, =0x40010810
ldr r0, =0x100
str r0, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r2, =1000000
delay1:
subs r2, #1
bne delay1 @ Iterate delay loop

ldr r1, =0x40010810
ldr r0, =0x1000000
str r0, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r2, =1000000
delay2:
subs r2, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay”

You might notice that the registers r0-r2 are loaded with the same values over and over again. To make the code both shorter and faster, take advantage of the available processor registers, and load the values that don’t change ''before'' the loop. Then, just use them inside the loop:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkDelay2”

=== Using RAM ===
Until now, all data in the example codes was stored in periphery or processor registers. In all but the most simple programs, larger amounts of data have to be processed for which the thirteen general-purpose processor registers aren’t enough. For this, the microcontroller features a block of SRAM that stores 20 KiB of data. Accessing data in RAM works similar to accessing periphery registers – load the address in a processor register and use “ldr” and “str” to read and write the data. After reset, the RAM contains just random ones and zeroes, so before the first read access, some value has to be stored.

As the programmer decides what data to place where, they have to keep track which address in memory contains what piece of data. You can use the assembler to help keeping track by declaring what kind of memory blocks you need and giving them names. To do this, you must first tell the assembler that the next directives refer to data instead of instructions with the “.data” directive. Then, use the “.space” directive for each block of memory you need. To assign names to the blocks, place a label definition (using a colon) right ''before'' that. After the definitions, put a “.text” directive to make sure the instructions after that will properly go to program memory (flash):

<source lang=asm>
.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text
@ Instructions go here...
</source>

Here, a data block of 4 bytes is reserved and named “var1”. Another block of 1 byte is named “var2”. Note that just inserting these lines will not modify the assembler output – these are just instructions to the assembler itself. To access these memory blocks, you can use “var1” and “var2” just like literal addresses. Load them into registers and use these with “ldr” and “str” like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word 0x20000400
.word 0x080000ed
.space 0xe4

.data
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”
var2:
.space 1 @ Reserve 1 byte for memory block “var2”

.text

ldr r0, =var1 @ Get address of var1
ldr r1, =0x12345678
str r1, [r0] @ Store 0x12345678 into memory block “var1”

ldr r1, [r0] @ Read memory block “var1”
and r1, #0xFF @ Set bits 8..31 to zero
ldr r0, =var2 @ Get address of var2
strb r1, [r0] @ Store a single byte into var2

b .
</source>
Example name: “RAMVariables”

Note the use of “strb” - it works similar to “str”, but only stores a single byte. Since the processor register r1 is of course 32bit in size, only the lower 8 bits are stored, and the rest is ignored.

There is still something missing – nowhere in the code is there any address of the RAM. To tell the linker where the RAM is located, pass the option <code>-Tdata=0x20000000</code> to the <code>arm-none-eabi-ld</code> call to tell the linker that this is the address of the first byte of RAM. This program can't be flashed directly with OpenOCD, as OpenOCD doesn't recognize the RAM as such; GDB has to be used as explained above. When a linker script is used as described in the next chapters (using the NOLOAD attribute), OpenOCD can again be used directly.

If you run this program via GDB, you can use the commands <code>x/1xw &var1</code> and <code>x/1xb &var2</code> to read the data stored in memory. After this quick introduction a more abstract overview is indicated.

== Memory Management ==
If there is one thing that sets higher and lower level programming languages apart, it’s probably memory management. Assembly programmers have to think about memory, addresses, layout of program and data structures all the time. Assembler and linker provide some help which needs to be used effectively. Therefore, this chapter will explain some more fundamentals of the ARM architecture and how the toolchain works.

=== Address space ===
In the examples so far, addresses were used for periphery register accesses and jump instructions without really explaining what they mean, so it’s time to catch up with that. To access periphery registers and memory locations in any memory type (RAM, Flash, EEPROM…), an address is required, which identifies the desired location. On most platforms, addresses are simply unsigned integers. The set of all possible addresses that can be accessed in a uniform way is called an “address space”. Some platforms such as AVR have multiple address spaces (for Flash, EEPROM, and RAM+periphery) where each memory needs to be accessed in a distinct way and the programmer needs to know which address space an address belongs to – e.g. all three memory types have a memory location with address 123.

However, the ARM architecture uses only a single large address space where addresses are 32bit unsigned integers in the range of 0-4294967295. Each address refers to one byte of 8 bits. The address space is divided into several smaller ranges, each of which refers to a specific type of memory. For the STM32F103, this is documented in the datasheet in chapter 4. All addresses in all memory types are accessed in the same way – directly via the “ldr” and “str” instructions, or by executing code from a certain location, which can be achieved by jumping to the respective address with the “b” instruction. This also makes it possible to execute from RAM – simply perform a jump to an address that refers to some code located in RAM. Note that there are large gaps between the individual ranges in address space; attempting to access those usually leads to a crash.

While the addresses of periphery are fixed and defined by the manufacturer, the layout of program code and data in memory can be set by the programmer rather freely. Up until now, the example programs defined the flash memory contents in a linear fashion by listing the instructions on the order they should appear in flash memory. However, when translating multiple assembly source files into one program, the order in which the contents from those files appears in the final program isn’t defined a priori. Also, even though in the last example the memory blocks for RAM were defined ''before'' the code, the code actually comes first in address space. What makes all this work is the Linker.

=== The Linker ===
Usually the last step in translating source code into a usable program, the linker is an often overlooked, sometimes misunderstood but important and useful tool, if applied correctly. Many introductions into programming forego explaining its workings in detail, but as any trade, embedded development requires mastery of the tools! A good understanding of the linker can save time solving strange errors and allow you to implement some less common use cases, such as using multiple RAM blocks present in some microcontrollers, executing code from RAM or defining complex memory layouts as sometimes required by RTOSes.

[[File:ArmAsmTutorial_Linking.svg|300px|thumb|right|Translation of native applications using assembler, compiler and linker]]You have already used a linker – the command <code>arm-none-eabi-ld</code> calls the GNU linker that is shipped with the GNU toolchain. Until now, only one assembly source files was translated for each program. To translate a larger program that consists of three assembly files “file1.S”, “file2.S” and “file3.s”, the assembler would be called three times to produce three object code files “file1.o”, “file2.o” and “file3.o”. The linker would then be called to combine all three into a single output file.

When translating any of these assembly files, the assembler does not know of the existence of the other files. Therefore, it can’t know whether the contents of any other file will end up in flash memory before the currently processed file, and also can’t know the final location in flash memory of the machine code it is emitting and placing in the object file (ending .o). This means that the object file does not contain any absolute addresses (except for those of periphery registers, as these were specified explicitly). For example, when loading the address of the RAM data blocks (“ldr r0, =var1”) the assembler doesn’t know the address, only the linker does. Therefore, the assembler puts a placeholder in the object file that will be overwritten by the linker. A jump (“b” instruction) to a label defined in another assembly file works similarly; the assembler uses a placeholder for the address. For the jump instructions we used inside the same file (e.g. “b BlinkLoop”), a placeholder is not necessary, as the assembler can calculate the distance of the label and the instruction and generate the relative jump itself. However, if the target resides within a different section (see below), this isn’t possible, and a placeholder becomes necessary. As the contents of object files has no fixed address and can be moved around by the linker, these files are called relocatable.

On Unix Systems (including Linux), the Executable and Linkable Format (ELF) is used for both object files and executable program files. This format is also used by ARM, and the GNU ARM toolchain. Because it was originally intended to be used with operating systems, some of its concepts don’t perfectly map the embedded use case. The object (.o) files created by the assembler and linker, and also the final program (usually no ending, but in embedded contexts and also in above example commands, .elf is used) are all in ELF format. The specification of ELF for ARM can be found [https://developer.arm.com/architectures/system-architectures/software-standards/abi here], and the generic specification for ELF on which the ARM ELF variant is based can be found [http://www.sco.com/developers/gabi/2003-12-17/contents.html here].

ELF files are structured into sections. Each section may contain code, data, debug information (used by GDB) and other things. In an object file, the sections have no fixed address. In the final program file, they have one. Sections also have various attributes that indicate whether its contents is executable code or data, is read-only and whether memory should be allocated for it. The linker combines and reorders the sections from the object files (“input sections”) and places them into sections in the final program file (“output sections”) while assigning them absolute addresses.

Another important aspect are symbols. A symbol defines a name for an address. The address of a symbol may be defined as an absolute number (e.g. 0x08000130) or as an offset relative to the beginning of a section (e.g. “start address of section .text plus 0x130”). Labels defined in assembly source code define symbols in the resulting object file. For example, the “var1” label defined in the last example results in a symbol “var1” in the “prog1.o” file whose address is set to be equal to the beginning of “.data”. The symbol “var” is defined similarly, but with an offset of 4. After the linking process, the “prog1.elf” file contains a “.data” section with absolute address 0x20000000, and so the “var1” and “var2” symbols get absolute addresses as well.

As mentioned, the assembler puts placeholders in the object files when it doesn’t know the address of something. In ELF files, there placeholders are called “relocation entries” and they reference symbols by name. When the linker sees such a relocation entry in one of its input files, it searches for a symbol in the input files with a matching name and fills in its address. If no symbol with that name was found, it emits this dreaded error:

<pre>(.text+0x132): undefined reference to `Foo'</pre>

Google finds almost a million results for that message, but knowing how the linker operates makes it easy to understand and solve – since the symbol was not found in any object file, make sure it is spelled correctly and that the object file that contains it is actually fed to the linker.

=== Linker Scripts ===
A linker script is a text file written in a linker-specific language that controls how the linker maps input sections to output sections. The example project hasn’t explicitly specified one yet, which lets the linker use a built-in default one. This has worked so far, but results in a slightly mixed up program file (unsuitable symbols) and has some other disadvantages. Therefore, it’s time to do things properly and write a linker script. Linker scripts aren’t usually created on a per-project basis, but usually provided by the microcontroller manufacturer to fit a certain controller’s memory layout. To learn how they work, a quick introduction into writing one will follow. The full documentation can be found [https://sourceware.org/binutils/docs/ld/Scripts.html here].

It’s customary to name the linker script after the controller they are intended for, so create a text file “stm32f103rb.ld” or “stm32f103c8.ld” with the following contents:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
*(.data)
} >SRAM
}
</source>
Example name: “LinkerScriptSimple”

This is this minimum viable linker script for a microcontroller. If you are using a STM32F103C8, replace the 128K by 64K. The lines inside the “MEMORY” block define the available memory regions on your microcontroller by specifying their start address and size within the address space. The names “FLASH” and “SRAM” can be chosen arbitrarily, as they have no special meaning. This memory definition has no meaning outside of the linker script, as it is just an internal helper for writing the script; it can even be left out and replaced by some manual address calculations.

The interesting part happens inside the “SECTIONS” command. Each sub-entry defines an output section that will end up in the final program file. These can be named arbitrarily, but the names “.text” and “.data” for executable code and data storage respectively are usually used. The asterisk expressions “*(.text)” and “(*.data)” tell the linker to put the contents of the input sections “.text” and “.data” at that place in the output section. In this case, the names for the input sections and output sections are identical. The input section names “.data”, “.text” (and some more) are used by the assembler and C and C++ compilers by default, so even though they can be changed, it’s best to keep them. You can however name the output sections arbitrarily, for example:

<source>
SECTIONS {
.FlashText : {
*(.text)
} >FLASH

.RamData (NOLOAD) : {
*(.data)
} >SRAM
}
</source>

The commands “>FLASH” and “>SRAM” tell the linker to calculate the address of the output sections according to the respective memory declaration above: The first output section with a “>FLASH” command will end up at address 0x8000000, the next with “>FLASH” right after that section and so on. The “>SRAM” works the same way with the start address “0x20000000”. The “NOLOAD” attribute does not change the linker’s behavior, but marks the corresponding output section as “not-loadable”, such that OpenOCD and GDB will not attempt to write it into RAM – the program has to take care of initializing any RAM data anyways when running stand-alone.

To specify the filename of the linker script, use the “-T” option:
<source lang=sh>
arm-none-eabi-ld prog1.o -o prog1.elf -T stm32f103rb.ld
</source>
The <code>-Tdata</code> and <code>-Ttext</code> aren’t needed anymore, as the addresses are now defined in the linker script.

Since the linker script defines the sizes of the memory regions, the linker can now warn you when your program consumes too much memory (either flash or RAM):
<pre>arm-none-eabi-ld: prog1.elf section `.text' will not fit in region `FLASH'
arm-none-eabi-ld: region `FLASH' overflowed by 69244 bytes</pre>

==== Reserving memory blocks ====
Using the processor’s stack will be explained later, but you can already use the linker script to assign a memory block for it. It’s best to allocate memory for the stack at the ''beginning'' of SRAM, so put this before the “*(.data)” command:

<source>. = . + 0x400;</source>

Inside a linker script, the dot “.” refers to the current address in the output file; therefore, this command increments the address by 0x400, leaving an “empty” block of that size. The “.data” input section will be located after that, at address 0x20000400.

==== Defining symbols in linker scripts ====

As mentioned before, the controller requires a certain data structure called the “vector table” to reside at the very beginning of flash memory. It is defined in the assembler source file:

<source lang=asm>
.word 0x20000400
.word 0x080000ed
.space 0xe4
</source>

The “.word” directive tells the assembler to output the given 32bit-number. Just like processor instructions, these numbers are put into the current section (.text by default, .data if specified) and therefore end up in flash memory. The first 32bit-number, which occupies the first 4 bytes in flash memory, is the initial value of the stack pointer which will be explained later. This number should be equal to the address of the first byte ''after'' the memory block that was reserved for the stack. The reserved block starts at address 0x20000000 and has size 0x400, so the correct number is 0x20000400. However, if the size of the reserved block was modified in the linker script, the above assembly line needs to be adjusted as well. To avoid any inconsistencies, and to be able to manage everything related to the memory-layout centrally in the linker script, it is desirable to replace the number in the assembly source file with a symbol expression. To do this, define a symbol in the linker script:

<source>
.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
</source>
Example name: “LinkerScriptSymbols”

This will define a symbol “_StackEnd” to have the value of “.”, which is the current address, which at this point is 0x20000400. In the assembly source file, you can now replace the number with the symbol:

<source lang=asm>
.word _StackEnd
</source>

The assembler will put a placeholder in the object file, which the linker will overwrite with the value of 0x20000400. This modification will not change the output file, but avoids putting absolute addresses in source files. The name “_StackEnd” was chosen arbitrarily; since names that start with an underscore and a capital letter may not be used in C and C++ programs, there is no possibility of conflict if any C/C++ source is added later. Typically, all symbols that are part of the runtime environment and should be “invisible” to C/C++ code are named this way. The same rule applies to names starting with two underscores.

The second entry of the vector table is the address of the very first instruction to be executed after reset. Currently the address is hard-coded as the first address after the vector table. If you wanted to insert some other code before this first instruction, this number would have to be changed. This is obviously impractical, and therefore the number should be replaced by a label as well. Since the code executed at reset is commonly known as the “reset handler”, define it like that:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.word _StackEnd
.word Reset_Handler
.space 0xe4

.type Reset_Handler, %function
Reset_Handler:

@ Put code here
</source>

The “.type” directive tells the assembler that the label refers to executable code. The exact meaning of this will be covered later. Leave the “.space” directive alone for now.

==== Absolute section placement ====
The vector table needs to be at the beginning of flash memory, and the examples have relied on the assembler putting the first things from the source file into flash memory first. This stops working if you use multiple source files. You can use the linker script to make sure the vector table is always at the beginning of flash memory. To do that, you first have to separate the vector table from the rest of the code so that the linker can handle it specially. This is done by placing the vector table in its own section:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:
</source>
Example name: “LinkerScriptAbsolutePlacement”

The “.section” directive instructs the assembler to put the following data into the custom section “.VectorTable”. The “a” flag marks this section as allocable, which is required to have the linker allocate memory for it. To place the vector table at the beginning of flash memory, define a new output section in the linker script:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

This puts the .VectorTable input section into the equally-named output section. It is also possible to put it into .text alongside the code:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.text : {
*(.VectorTable)
*(.text)
} >FLASH

.data (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
*(.data)
} >SRAM
}
</source>

Even though both variants produce the same flash image, the first one is slightly nicer to work with in GDB. The modified LED-blinker application now looks like:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4

.text
.type Reset_Handler, %function
Reset_Handler:

ldr r1, =0x40021018
ldr r0, [r1]
orr r0, r0, #4
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =0x40010804
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #2
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =0x40010810 @ Load address of GPIOA_BSRR
ldr r1, =0x100 @ Register value to set pin to high
ldr r2, =0x1000000 @ Register value to set pin to low
ldr r3, =1000000 @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>

=== Program Structure ===
Because the vector table is usually the same for all projects, it is handy to move it into a separate file, for example called “vectortable.S”:
<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Assemble and link this source code with two assembler commands:
<source lang=sh>
arm-none-eabi-as -g prog1.S -o prog1.o
arm-none-eabi-as -g vectortable.S -o vectortable.o
arm-none-eabi-ld prog1.o vectortable.o -o prog1.elf -T stm32f103rb.ld
</source>

This will result in the dreaded “undefined reference” error. To alleviate this, use the “.global” directive in the main source file “prog1.S”:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
@ Code here ...
</source>

This will tell the assembler to make the symbol “Reset_Handler” visible globally, such that it can be used from other files. By default, the assembler creates a ''local'' symbol for each label, which can’t be used from other source files (same as ''static'' in C). The symbol is still there in the final program file, though - it can be used for debugging purposes.

== More assembly techniques ==
After having set up the project for using the linker properly, some more aspects of assembly programming will be introduced.

=== Instruction set state ===
As mentioned before, ARM application processors support both the T32 and A32/A64 “ARM” instruction sets, and are capable of dynamically switching between them. This can be used to encode time-critical program parts in the faster A32/64 instruction set, and less critical parts in the T32 “thumb” instruction set to save memory. Actually, reducing program size may improve performance too, because the cache memories may become more effective.

Even though the Cortex-M microcontrollers based on the ARMv7-M architecture do not support the A32/A64 instruction sets, some of the switching-logic is still there, requiring the program code to work accordingly. The switch between the instruction sets happens when jumping with the “bx” “Branch and Exchange” and “blx” “Branch with Link and Exchange” instructions. Since all instructions are of size 2 or 4, and code may only be stored at even addresses, the lowest bit of the address of any instruction is always zero. When performing a jump with “bx” or “blx”, the lowest bit of the target address is used to indicate the instruction set of the jump target: If the bit is “1”, the processor expects the code to be T32, else A32.

Another specialty of the “bx” and “blx” instructions is that they take the jump target address from a register instead as encoding it in the instruction directly. This called an indirect jump. An example of such a jump is:

<source lang=asm>
ldr r0, =SomeLabel
bx r0
</source>

Such indirect jumps are necessary if the difference of the jump target address and the jump instruction is too large to be encoded in the instruction itself for a relative jump. Also, sometimes you want to jump to an address that has been passed from another part of the program, which e.g. happens in C/C++ code when using function pointers or virtual functions.

In these cases, you need to make sure that the lowest bit of the address passed to “bx/blx” via a register has the lowest bit set, to indicate that the target code is T32. Otherwise, the code will crash. This can be achieved by telling the assembler that the target label refers to code (and not data) via the already mentioned “.type” directive:

<source lang=asm>
.type SomeLabel, %function
SomeLabel:
@ Some code...
</source>

That way, when you refer to the label to load its address into a register, the lowest bit will be set. Actually, using “.type” for all code labels is a good idea, even though it does not matter if you only refer to a label via the “b” instruction (including the conditional variant) which does not encode the lowest bit and does not attempt to perform an instruction set switch.

As was already shown, there is another case where the lowest bit matters: when specifying the address of the reset handler (and later, exception handler functions) in the vector table, the bit must be set, so the “.type” directive is necessary here too:

<source lang=asm>
.type Reset_Handler, %function
</source>

If you were writing code for a Cortex-A processor, you would use “.arm” instead of “.thumb” to have your code (or performance critical parts of it) encoded as A32. The “.type” directive would be used as well, and the assembler would clear the lowest bit in the address to ensure the code is executed as A32. For example:

<source lang=asm>
.cpu cortex-a8
.syntax unified

@ Small but slower code here
.thumb

.type Block1, %function
Block1:
ldr r0, =Block2
bx r0

@ Larger but faster code here
.arm

.type Block2, %function
Block2:
@ ...
</source>

The directive “.code 32” has the same meaning as “.arm”, and “.code 16” the same as “.thumb” (although the name is slightly misleading, as T32 instructions can be 32 bit as well). There is also “.type Label, %object” to declare some label refers to data in flash or RAM; this is optional, but helps in working with analysis tools (see below).

=== Constants ===
The previous examples contain a lot of numbers (esp. addresses), the meaning of which is not obvious to the reader - so called “magic numbers”. As code is typically read many times more than written/modified, readability is important, even for assembly code. Therefore, it is common practice to define constants that assign names to numbers such as addresses, and use names instead of the number directly.

The assembler actually does not provide any dedicated mechanism for defining constants. Instead, symbols as introduced before are used. You can define a symbol in any of the following ways:
<source lang=asm>
RCC_APB2ENR = 0x40021018
.set GPIOA_CRH, 0x40010804
.equ GPIOA_ODR, 0x4001080C
</source>

and then use it in place of the number:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

Replacing (almost) all numbers in the source code for the LED blinker by constants yields a source code like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_10MHz = 1
GPIOx_CRx_GP_PP_2MHz = 2
GPIOx_CRx_GP_PP_50MHz = 3

GPIOx_CRx_GP_OD_10MHz = 1|4
GPIOx_CRx_GP_OD_2MHz = 2|4
GPIOx_CRx_GP_OD_50MHz = 3|4

GPIOx_CRx_AF_PP_10MHz = 1|8
GPIOx_CRx_AF_PP_2MHz = 2|8
GPIOx_CRx_AF_PP_50MHz = 3|8

GPIOx_CRx_AF_OD_10MHz = 1|4|8
GPIOx_CRx_AF_OD_2MHz = 2|4|8
GPIOx_CRx_AF_OD_50MHz = 3|4|8

GPIOx_CRx_IN_ANLG = 0
GPIOx_CRx_IN_FLOAT = 4
GPIOx_CRx_IN_PULL = 8

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA

ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

mov r4, r3
delay1:
subs r4, #1
bne delay1 @ Iterate delay loop

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

mov r4, r3
delay2:
subs r4, #1
bne delay2 @ Iterate delay loop

b BlinkLoop
</source>
Example name: “BlinkConstants”

This is much more readable than before. In fact, you could even leave out the comments, as the code becomes more self-documenting. The addresses of periphery registers are defined individually, but the bits for the GPIO registers are the same for each GPIO module, so the names include an “x” to denote that they apply to all GPIO modules.

The “CRL”/“CRH” registers get a special treatment. Since the individual bits have little direct meaning, it would be pointless to name them. Instead, 15 symbols are defined to denote the 15 possible modes of operation per pin (combinations of input/output, open-drain vs. push-pull, analog vs. digital, floating vs. pull-resistors, and output driver slew rate). Each of the 15 symbols has a 4 bit value that needs to be written into the appropriate 4 bits of the register. To configure e.g. PA10 as General Purpose Open-Drain with 10 MHz slew rate:

<source lang=asm>
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffff0ff
orr r0, #(GPIOx_CRx_GP_OD_10MHz<<8)
str r0, [r1]
</source>

C-like arithmetic operators can be used in constant expressions, like + - * / and bitwise operators like | (or), & (and), << (left shift) and >> (right shift). Note that these calculations are always done by the assembler. In the example, or | is used to combine bit values.

Since these constants are actually symbols, they can collide with assembler labels, so you must not define a symbol with the same name as any label.

A different kind of constants are register aliases. Using the “.req” directive, you can define a name for a processor register:

<source lang=asm>
MyData .req r7
ldr MyData, =123
add MyData, 3
</source>

This can be useful for large assembly blocks where the meaning of register data is not obvious. It also allows you to re-assign registers without having to modify many lines of code.
=== The Stack ===
In computer science, a stack is a dynamic data structure where data can be added and removed flexibly. Like a stack of books, the last element that was put on top must be taken and removed first (LIFO-structure - Last In, First Out). Adding an element is usually called “push”, and reading & removing “pop”.

Many processor architectures including ARM feature circuitry to deal with such a structure efficiently. Like most others, ARM does not provide a dedicated memory area for this - it just facilitates using an area that the programmer reserved for this purpose as a stack. Therefore, a part of the SRAM needs to be reserved for the stack.

On ARM, the program stores processor registers on the stack, i.e. 32bit per element. The stack is commonly used when the contents of some register will be needed again later after it has been overwritten by some complex operation that needs many registers. These accesses always come in pairs:

* Some operation that writes to r0
* '''Push''' (save) r0 to the stack
* Some operation that overwrites r0
* '''Pop''' (restore) r0 from the stack
* Use the value in r0 which is the same as initially assigned

ARM’s instructions for accessing the stack are unsurprisingly called “push” and “pop”. They can save/restore any of the registers r0-r12 and r14, for example:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ … Some code that overwrites r0 …

pop { r0 } @ Restore value 1000000
@ Continue using r0 ...
</source>

It is also possible to save/restore multiple registers in one go:

<source lang=asm>
ldr r0, =1000000
ldr r1, =1234567
@ Use r0 and r1 ...
push { r0, r1 } @ Save values 1000000 and 1234567

@ … Some code that overwrites r0 and r1 …

pop { r0, r2 } @ Restore 1000000 into r0 and 1234567 into r2
@ Continue using r0 and r2...
</source>

It does not matter to which register the data is read back - in the previous example, the value that was held in r1 is restored into r2. In larger applications, many store-restore pairs will be nested:

<source lang=asm>
ldr r0, =1000000
@ Use r0 ...
push { r0 } @ Save value 1000000

@ Inner Code Block:

ldr r0, =123
@ Use r0 …

push { r0 } @ Save value 123

@ Inner-Inner Code Block that overwrites r0

pop { r0 } @ Restore value 123
@ Continue using r0 ...

pop { r0 } @ Restore value 1000000 into r0

@ Continue using r0 …
</source>

The “inner” push-pop pair works with value 123, and the “outer” push-pop pair works with value 1000000. Assuming that the stack was empty at the beginning, it will contain 1000000 after the first “push”, and both 1000000 and 123 after the second push. After the first “pop” it contains only 1000000 again, and is empty after the second “pop”.

At the beginning of a push-pop pair, the current contents of the stack is irrelevant - it may be empty or contain many elements. After the “pop”, the stack will be restored to its previous state. This makes it possible to (almost) arbitrarily nest push-pop-pairs - after any inner push-pop-pair has completed, the stack is in the same state as before entering the inner pair, so the “pop” part of the outer pair doesn’t even notice the stack was manipulated in between. This is why it is important to make sure that each “push” has a matching “pop”, and vice-versa.

As mentioned, an area of memory has to be reserved for the stack. Access to the stack memory is managed via the stack pointer (SP). The stack pointer resides in the processor register r13, and “sp” is an alias for that. As the name implies, the stack pointer contains a 32bit memory address - specifically, the address of the first byte in the stack that contains any saved data.

When storing a 32bit register value using “push”, the stack pointer is '''first''' decremented by 4 before the value is written at the newly calculated address. To restore a value, the address currently stored in the stack pointer is read from memory, after which the stack pointer is incremented by 4. This is called a “full-descending” stack (see the ARM Architecture Reference Manual, chapter B1.5.6). On ARMv7-A (Cortex-A), this behaviour can be changed, but on ARMv7-M, it is dictated by the exception handling logic, which will be explained later.

An implication of this is that if the stack is empty, the stack pointer contains the address of the first byte '''after''' the stack memory area. If the stack is completely full, it contains the address of the very first byte '''inside''' the stack memory area. This means that the stack grows '''downward'''. Since the stack is empty at program start, the stack pointer therefore needs to be initialized to the first address after the memory area. Before executing the first instruction, the processor loads the first 4 bytes from the flash into the stack pointer. This is why “_StackEnd” was defined and used to place the address of the first byte after the stack memory region into the first 4 bytes of flash.

The stack pointer must always be a multiple of 4 (see chapter B5.1.3 in the ARM Architecture Reference Manual). It is a common error (which is even present in the example projects by ST!) to initialize the stack pointer to the last address ''inside'' the stack memory area (e.g. 0x200003FF instead of 0x20000400), which is not divisible by four. This can cause the application to crash or “just” slow it down. Actually, the [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4127.html ARM ABI requires] the stack pointer to be a multiple of 8 for public software interfaces, which is important for e.g. the “printf” C function. So, when calling any external code, make sure the stack pointer is a multiple of 8.

In the previous examples, the stack memory area was defined with a size of 0x400, i.e. 1KiB. Choosing an appropriate stack size is critical for an application; if it is too small, the application will crash, if it is too large, memory is wasted that could be used otherwise. Traditionally, the stack is configured to reside at the ''end'' of available memory, e.g. 0x20005000 for the STM32F103. As the linker starts allocating memory for data (using “.data” in assembly or global/static variables in C) at the beginning of the memory, the stack is as far away from that regular data as possible, minimizing the chance of a collision. However, if the stack grows continuously, the stack pointer might end up pointing into the regular data area (“.data” or C globals) or heap memory (used by “malloc” in C). In that case, writing to the stack silently overwrites some of the regular data. This can result in all kinds of hard to find errors. Therefore, the example codes put the stack area at the ''beginning'' of RAM, and the regular data after that - if the stack grows too large, the stack pointer will reach values below 0x20000000, and any access will result in an immediate “clean” crash. It is probably easy to find the code location that allocates too much stack memory, and possibly increase the stack size. Using the Cortex-M3’s memory protection unit (MPU) enables even more sophisticated strategies, but that is out of scope for this tutorial.
=== Function calls ===
Many programming languages feature a “function” concept. Also known as a “procedures” or “subprograms”, functions are the most basic building blocks of larger applications, and applying them correctly is key for clean, reusable code. The assembler does not know about functions directly, so you have to build them yourself. A function is a block of code (i.e. a sequence of instructions) that you can jump to, does some work, and then jumps back to the place from which the first jump originated. This ability to jump back is the main difference from any other block of assembly code. To make this explicit, such a jump to a function is known as a “call” (as in “calling a function”). The location in code that starts the jump to the function is known as the “caller”, and the called function as “callee”. From the perspective of the caller, calling a function resembles a “user-defined” instruction - it performs some operation after which the code of the caller continues as before. To make the jump back possible, the address of the ''next'' instruction after the one that started the function call needs to be saved, so that the function can jump back to that location (without calling the function directly again).

This is done via the Link Register (LR), which is the processor register r14. Function calls are performed with the “bl” instruction. This instruction performs a jump, much like the well-known “b”, but also saves the address of the next instruction in LR. When the function is finished, it returns to the caller by jumping to the address stored in LR. As already mentioned, jumping to a location from a register is called an indirect jump, which is performed by the “bx” instruction. So, to return from a function, use “bx lr”:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:

bl EnableClockGPIOA @ Call function to enable GPIOA's peripheral clock

@ Some more code ...
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1]

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

</source>

Here, the code to enable the clock for GPIOA was packaged into a function. To enable this clock, only a single line is now required - “bl EnableClockGPIOA”.

When calling a function, the “bl” instruction automatically makes sure to set the lowest bit in LR such that the subsequent “bx lr” will not crash because of an attempted instruction set switch, which is not possible on Cortex-M. If you need to call a function indirectly, use “blx” with a register, and remember to ensure that the lowest bit is set, typically via “.type YourFunction, %function”. Usually, all the code of an application resides within functions, with the possible exception of the Reset_Handler. The order in which functions are defined in the source files does not matter, as the linker will always automatically fill in the correct addresses. If you want to put functions in separate source files, remember to use “.global FunctionName” to make sure the symbol is visible to other files.
==== Using the stack for functions ====
In large applications it is common for functions to call other functions in a deeply nested fashion. However, a function implemented as shown can’t do that - using “bl” would overwrite the LR, and so the return address of the outer function would be lost, and that function couldn’t ever return. The solution is to use the stack: At the beginning of a function that calls other functions, use “push” to save the LR, and at the end use “pop” to restore it. For example, the blink program could be restructured like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r5, =5 @ Number of LED flashes.
bl Blink
b .

.type Blink, %function
Blink:
push { lr }
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r1, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r3, =DelayLoopIterations @ Iterations for delay loop

BlinkLoop:
str r1, [r0] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

bl Delay

str r2, [r0] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

bl Delay

subs r5, #1
bne BlinkLoop

pop { lr }
bx lr

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

.type Delay, %function
Delay:
mov r4, r3
DelayLoop:
subs r4, #1
bne DelayLoop @ Iterate delay loop
bx lr

</source>
Example name: “BlinkFunctions”

The Reset_Handler just became much prettier. There now are functions for enabling the GPIOA clock, configuring PA8 as output, and one that delays execution so that the LED blinking is visible. The “Blink” function performs the blinking, but only for 5 flashes, after which it returns (an endless blink-loop wouldn’t be good for demonstrating returns). As you see, LR is saved on the stack to allow “Blink” to call further functions.

The two lines

<source lang=asm>
pop { lr }
bx lr
</source>

are actually longer than necessary. It is actually possible to directly load the return address from the stack into the program counter, PC:

<source lang=asm>
pop { pc }
</source>

This way, the return address that was saved on the stack is directly used for the jump back. Just the same way, you can use “push” and “pop” to save and restore any other registers while your function is running.
==== Calling Convention ====
Actually building a large program as shown in the last example is a bad idea. The “Delay” function requires 1000000 to reside in r4. The “Blink” function relies on “Delay” not overwriting r0-r2, and r5, and requires the number of flashes to be given via r5. Such requirements can quickly grow into an intricate web of interdependencies, that make it impossible to write larger functions that call several sub-functions or restructure anything. Therefore, it is common to use a calling convention, which defines which registers a function may overwrite, which it should keep, how it should use the stack, and how to pass information back to the caller.

When building an entire application out of your own assembly code, you can invent your own calling convention. However, it is always a good idea to use existing standards: The AAPCS defines a calling convention for ARM. This convention is also followed by C and C++ compilers, so using it makes your code automatically compatible with those. The Cortex-M interrupt mechanism follows it too, which would make it awkward to adapt code that uses some other convention to Interrupts. The specification of the calling convention is quite complex, so here is a quick summary of the basics:

* Functions may only modify the registers r0-3 and r12. If more registers are needed, they have to be saved and restored using the stack. The APSR may be modified too.
* The LR is used as shown for the return address.
* When returning (via “bx lr”) the stack should be exactly in the same state as during the jump to the function (via “bl”).
* The registers r0-r3 may be used to pass additional information to a function, called parameters, and the function may overwrite them.
* The register r0 may be used to pass a result value back to the caller, which is called the return value.

This means that when you call a function, you must assume registers r0-r3 and r12 may be overwritten but the others keep their values. In other words, the registers r0-r3 and r12 are (if at all) saved ''outside'' the function (“caller-save”), and the registers r4-r11 are (if at all) saved ''inside'' the function (“callee-save”).

A function that does not call any other functions is called a “leaf-function” (as it is a leaf in the call tree). If such a function is simple, it might not require to touch the stack at all, as the return value is just saved in a register (LR) and it might only overwrite the registers r0-r3 and r12, which the caller can make sure to contain no important data. This makes small functions efficient, as register accesses are faster than memory accesses, such as to the stack.

If all your functions follow the calling convention, you can call any function from anywhere and be sure about what it overwrites, even if it calls many other functions on its own. Restructuring the LED blinker could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

DelayLoopIterations = 1000000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =5
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r7, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
mov r7, r0 @ Number of LED flashes.

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r7, #1
bne BlinkLoop

pop { r4-r7, pc }

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
DelayLoop:
subs r0, #1
bne DelayLoop @ Iterate delay loop
bx lr
</source>
Example name: “BlinkFunctionCallingConvention”

The three small functions at the end only use registers r0 and r1, which they are free to overwrite. The “Delay” function expects the number of iterations as a parameter in r0, which it then modifies. Therefore, the “Blink” function fills r0 before every call to “Delay”. Alternatively, “Delay” could use a fixed iteration count, i.e. the “ldr” could be moved into “Delay”. As the “Blink” function must assume that “Delay” overwrites r0-r3 and r12, it keeps its own data in r4-r7, which are guaranteed to be retained according to the calling convention. Since “Blink”, in turn, must preserve these registers for the function that called it, it uses “push” and “pop” to save and restore them. Note the shortened syntax “r4-r7” in the instructions. The number of LED flashes is passed in r0 as a parameter; as this register will be overwritten, this number is moved to r7.

Alternatively, “Blink” could re-load the constants each time they are used in r1/r2, such that only one register (r4) needs to be saved as it is needed to count the number of flashes:

<source lang=asm>
.type Blink, %function
Blink:
push { r4, lr }

mov r4, r0

BlinkLoop:
ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

subs r4, #1
bne BlinkLoop

pop { r4, pc }
</source>
Example name: “BlinkFunctionCallingConvention2”

A third variant would not use any of the callee-save-registers (r4-r11) at all, and instead just save r0 before the function calls and restore it as needed

<source lang=asm>
.type Blink, %function
Blink:
push { lr }

BlinkLoop:
push { r0 }

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r2, [r1] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

ldr r1, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r2, =GPIOx_BSRR_BR8 @ Register value to set pin to low
str r2, [r1] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, =DelayLoopIterations @ Iterations for delay loop
bl Delay

pop { r0 }
subs r0, #1
bne BlinkLoop

pop { pc }

</source>
Example name: “BlinkFunctionCallingConvention3”

The frequent stack accesses would however make this slower. Be sure to always document the meaning (and units, if applicable) of parameters e.g. via comments.
=== Conditional Execution ===
As mentioned, the conditional variants of the “b” instruction (e.g. “bne”) can be used to execute certain blocks of code only if a certain condition is met. First, more ways to formulate conditions will be shown. Next, the ARM instruction “it” will be introduced, which makes executing small blocks of code conditionally more efficient.
==== Conditions ====
All conditions for conditional execution depend on the outcome of some mathematical operation. When instructions such as “adds”, “subs”, “ands” are used, they update the flags in the APSR register depending on the outcome, which are then read by the conditional variants of “b” to decide whether to actually perform the jump.

Often it is necessary to compare two numbers without actually doing a calculation. This can be done with the “cmp” instruction to which you can pass two registers or a register and a literal:

<source lang=asm>
cmp r0, #42
cmp r0, r1
</source>

The “cmp” instruction is very similar to “subs” - it subtracts the second operand from the first, but doesn’t save the result anywhere, i.e. the registers keep their values. Just the flags in the APSR are updated according to the result, just as with “subs”. For example, if both operands were equal, the result of the subtraction is zero, and the zero flag will be set. So, to test whether two numbers are equal:

<source lang=asm>
cmp r0, #42
beq TheAnswer

@ This is executed if r0 is not 42

TheAnswer:
@ This is executed if r0 is 42
</source>

The “bne” instruction is the opposite of “beq”.

The “tst” instruction works similarly to “cmp”, but instead of subtracting, perform a bitwise “and” operation - like the “ands” instruction, but without keeping the result. This way, you can test whether a bit in a register is set:

<source lang=asm>
tst r0, #4
beq BitNotSet

@ This is executed if bit 2 in r0 is set

BitNotSet:
@ This is executed if bit 2 in r0 is not set
</source>

A more useful use case for “tst” is to pass the same register twice. Applying “and” to the same value twice yields the same result as the input, so “tst” in this case effectively checks the properties of the input (negative/positive, zero):

<source lang=asm>
tst r0, r0
beq ValueZero

@ This is executed if r0 is not zero

ValueZero:
@ This is executed if r0 is zero
</source>

There is also the “teq” instruction which performs an exclusive or operation.

As mentioned, the suffixes “eq” and “ne” are called condition codes. ARM has 14 of those which define how the flags in the APSR form the condition. The details about how a subtraction (by “subs” or “cmp”) sets the flags in the APSR and how their interpretation by the different condition codes correlates to the mathematical result are somewhat complicated, involving the way the 2’s complement format works and relying on the fact that subtracting works by adding a negated number. Instead of diving into all the details, a table with a more high-level view and a practical interpretation of the condition should be more helpful:

{| class="wikitable"
! style="text-align:left;" | Code
! style="text-align:left;" | Meaning
! style="text-align:left;" | Unsigned / Signed
! style="text-align:left;" | Flags
! style="text-align:left;" | Condition after “cmp/subs r0, r1”
! style="text-align:left;" | Condition after “tst r0, r0”
|-
| style="text-align:center;" | EQ
| Equal
| style="text-align:center;" | U+S
| Z==1
| r0 = r1
| r0 = 0
|-
| style="text-align:center;" | NE
| Not equal
| style="text-align:center;" | U+S
| Z==0
| r0 ≠ r1
| r0 ≠ 0
|-
| style="text-align:center;" | MI
| Negative
| style="text-align:center;" | S
| N==1
| ---
| r0 < 0
|-
| style="text-align:center;" | PL
| Positive or Zero
| style="text-align:center;" | S
| N==0
| ---
| r0 ≥ 0
|-
| style="text-align:center;" | VS
| Overflow
| style="text-align:center;" | S
| V==1
| r0-r1 out of range¹
| ---
|-
| style="text-align:center;" | VC
| No Overflow
| style="text-align:center;" | S
| V==0
| r0-r1 in range¹
| ---
|-
| style="text-align:center;" | HS
| Unsigned higher or same
| style="text-align:center;" | U
| C==1
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LO
| Unsigned lower
| style="text-align:center;" | U
| C==0
| r0 < r1
| ---
|-
| style="text-align:center;" | HI
| Unsigned higher
| style="text-align:center;" | U
| C==1 and Z==0
| r0 > r1
| ---
|-
| style="text-align:center;" | LS
| Unsigned lower or same
| style="text-align:center;" | U
| C==0 or Z==1
| r0 ≤ r1
| ---
|-
| style="text-align:center;" | GE
| Signed greater or equal
| style="text-align:center;" | S
| N==V
| r0 ≥ r1
| ---
|-
| style="text-align:center;" | LT
| Signed less than
| style="text-align:center;" | S
| N!=V
| r0 < r1
| ---
|-
| style="text-align:center;" | GT
| Signed greater than
| style="text-align:center;" | S
| Z==0 and N==V
| r0 > r1
| ---
|-
| style="text-align:center;" | LE
| Signed less or equal
| style="text-align:center;" | S
| Z==1 or N!=V
| r0 ≤ r1
| ---
|}
1: Range meaning the numbers from -(2^31) until (2^31-1), inclusive

To determine which condition code you need, first think about whether the number is unsigned (range 0 to 2^32-1) or is using two's complement to represent signed numbers (range -2^31 to 2^31-1). Ignore all rows in the table with the wrong format.

If you want to compare two numbers, use the “cmp” instruction, and search for the desired condition within the “cmp”-condition column of the table. If you want to test a single number's properties, use the “tst”-column. Use the condition code from the first column with the conditional “b” instruction (“bne”, “beq”, “bmi”, “bpl”, “bhs”, ... ) right after the appropriate “cmp”/”tst” instruction.

Note that all the condition codes have a corresponding inverse code that has exactly the negated meaning. Most also have a swapped partner code, using which is equivalent to swapping the operands for cmp.
==== The IT instruction ====
Jumping is inefficient, so having many conditional jumps may slow down your program. The ARM architecture offers a way to make a few instructions conditional without requiring a jump via the “it” (if-then) instruction. It is used in place of a conditional jump after an instruction that set the flags (“cmp”, “tst”, “adds”…) and also needs a condition code. The next instruction right after the it will then only be executed when the condition is met, and skipped otherwise. You have to repeat the condition code and add it to that instruction; this is just to make the code clearer and avoid confusion.

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

it hi @ Make the next instruction conditional
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

This checks if r0 is higher than r1 (it isn’t), and only sets the pin PA8 to high if this condition is met. Up to 4 instructions can be made conditional like this; for each one, an additional “t” has to be appended to the “it” instruction:

<source lang=asm>
cmp r0, r1 @ Perform comparison

ittt hi @ Make the next instruction conditional
ldrhi r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldrhi r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
</source>

You can also add instructions that will be executed if the condition was ''not'' met (like an “else”-case in high-level-languages), by appending “e” instead of “t” to the “it” instruction. Since the “t” in “it” is fixed, the first instruction is always executed if the condition is met; only the next three instructions can be either a “then” case (“t”) or “else” case (“e”). You also have to provide the inverted condition code for the “else”-instructions:

<source lang=asm>
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low

ldr r0, =1 @ Load some date to compare
ldr r1, =2

cmp r0, r1 @ Perform comparison

ite hi @ Make the next two instructions conditional (if-then-else)
strhi r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high
strls r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low
</source>

There are several restrictions on which instructions may appear within an it-block. Most importantly, instructions that set the flags are forbidden here, as is the “b” instruction except for the last instruction in an “it” block. Directly jumping to one of the conditional instructions is forbidden too.

In T32 code, only the conditional “b” instruction is capable of encoding a condition code together with some operation, so the “it” instruction is provided to make any instruction conditional. On A32, most instructions include a condition code and can therefore be conditional, and the “it” instruction is actually ignored by the assembler here. You can and should still put “it” into code intended for A32, as this makes it compatible with T32. This is one of the reasons why A32 is more time-efficient, and T32 more space-efficient.

Conditional instructions sometimes make surprisingly compact programs. For example, the euclidean algorithm for calculating the greatest common divisor (gcd) of two numbers can be written in ARM assembly like this:

<source lang=asm>
gcd:
cmp r0, r1
ite gt
subgt r0, r0, r1
suble r1, r1, r0
bne gcd
</source>

While the C equivalent is actually longer:

<source lang=asm>
int gcd(int a, int b) {
while (a != b) {
if (a > b)
a = a - b;
else
b = b - a;
}
return a;
}
</source>

The usage of conditional instructions is also [https://developer.arm.com/docs/dui0801/latest/condition-codes/example-showing-the-benefits-of-conditional-instructions-in-a32-and-t32-code faster] than using conditional jumps. Note that the final “bne” instruction is independent of the “if-then” block; it just directly uses the result of “cmp”.
=== 8/16 bit arithmetic ===
So far, all numbers had 32 bit. However, especially for space reasons, smaller numbers are needed with 8 or 16 bit. Cortex-M3 doesn't provide any instructions for calculating 8 or 16 bit numbers directly. Instead, after loading such a number from memory into a processor register, it has to be extended into 32bit to allow the 32bit instructions to work properly. When storing the result back, only the lower 8/16 bit are used. If 8/16bit overflow behavior is required (i.e. overflow at -128/127 for 8bit signed, 0/256 for 8bit unsigned, -32768/32767 for 16bit signed, 0/65536 for 16bit unsigned) for calculations, the numbers have to be truncated after each calculation. This actually makes it slightly less efficient to deal with smaller numbers.

A 16bit value (“halfword”) can be read from memory with the ldrh instruction:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
</source>

“ldrh” loads 16bit from memory, writes them into the lower 16 bits of the destination register (here: r1), and sets the upper 16bits to zero. If the value is signed, it has to be sign-extended so that it can be used with 32bit-calculations:

<source lang=asm>
ldr r0, =SomeAddress
ldrh r1, [r0]
sxth r1, r1
</source>

The “sxth” instruction copies the sign bit (i.e. bit 15) into the upper 16 bits (“sign-extension”); this makes sure that negative 16bit-numers keep their value when interpreted as 32 bits. The “ldrsh” instruction combines both “ldrh” and “sxth”. “ldrb”, “sxtb”, “ldrsb” are for loading and sign-extending 8bit-values and the combination of both, respectively.

To simulate 8/16bit overflow behaviour after a mathematical operation, use uxtb/uxth for unsigned 8/16 bit numbers, or sxtb/sxth for signed 8/16 bit numbers:

<source lang=asm>
add r0, #1
uxth r0, r0
</source>

The “uxth”/”uxtb” instructions copy the lower 16/8 bits of a register into another one, setting the upper 16/24 bits to zero. This way, if r0 contained 65535 before, the result will be 0 instead of 65536 after using “uxth”.

This is a common trap when coding in C - when using e.g. the "uint16_t" type for local variables such as loop counters, this implicitly requests 16bit overflow behavior, requiring the truncating after each calculation, even though the overflow may actually never happen. This is why e.g. uint16_fast_t should be used for local variables, as this is 32 bit on ARM, which is faster.
=== Alignment ===
There are certain restrictions on the address when accessing data in memory using the “str”/”ldr” variants:
* The “ldrd”/”strd”/”ldm”/”stm” instructions, which can load/store multiple registers at once, always require the address to be a multiple of 4. If it isn’t, the program will crash.
* The “ldr”/”str” instructions require the address to be a multiple of 4, and “strh”/”ldrh” require it to be a multiple of 2. If it isn’t, the behaviour depends on the ARM version:
** On ARMv6-M and before, the program will crash.
** On ARMv7-M:
*** If the CCR.UNALIGN_TRP is set to zero (the default), the access will be slow
*** If the CCR.UNALIGN_TRP bit is set to one, the program will crash, emulating the ARMv6-M behaviour

For “strb”/”ldrb” there are no such requirements.

The number of which the address needs to be a multiple of is called the “alignment” (e.g. 2-byte-alignment, 4-byte-alignment, …). An access with an address that is a multiple of 2/4 as specified above is called an “aligned access”; others are called “unaligned access” (which are slow or cause a crash).

Even though slow accesses may be acceptable, it is still a good idea to make sure all accesses are always correctly aligned in case the code is ported to an ARM version or operating system that requires it. The addresses of periphery registers are already aligned correctly, so there is no need to worry. When placing data in RAM however, you should make sure that the addresses of the individual elements that are accessed via one of the “ldr” variants are aligned properly. For example, if a previous example code was modified like this:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

The address of “var1” will not be a multiple of 4, and an access via “ldr” would be unaligned. This could be improved by adding a space of 3 bytes in between:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.space 3
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

This would require you to keep in mind all the other things in memory that were declared before, which is impractical especially if multiple assembly files are used. Therefore, the assembler offers the “.align” directive:

<source lang=asm>
.data
var2:
.space 1 @ Reserve 1 byte for memory block “var2”
.align 2
var1:
.space 4 @ Reserve 4 bytes for memory block “var1”

.text
@ Instructions go here...
</source>

When using “.align X”, the assembler makes sure that the next address will be a multiple of 2^X, so in this case, a multiple of 2^2=4. The assembler will therefore insert 0 to 2^X-1 bytes of space. The section containing the directive in the object code file will also be marked to require that alignment, such that the linker will automatically place it at the appropriate location in address space.
=== Offset addressing ===
The various “ldr”/”str” instructions can optionally perform some calculation on the address before executing the memory access. What is shown for “ldr” here works for “str” and the variants for halfwords and bytes equivalently. There are several variants for this. This first one adds a fixed offset that is encoded within the instruction itself to the address:

<source lang=asm>
ldr r0, [r1, #8]
</source>

This adds 8 to r1 and uses the result as the address to access. The number can also be negative. This variant is useful for accessing members of a heterogeneous container organized like a C struct or the registers in a periphery module. For example, you can load the base address of a periphery module into a register, and then access the various registers using offset-addressing without having to load each address individually:

<source lang=asm>
GPIOA=0x40010800
GPIOx_CRH = 0x04
GPIOx_BSRR = 0x10

GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA

ldr r1, =GPIOA

ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2

ldr r0, =GPIOx_BSRR_BS8 @ Register value to set pin to high
str r0, [r1, #GPIOx_BSRR] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

b .
</source>
Example name: “OffsetAddressing”

This way, you can avoid repeated loads of similar addresses. This variant is also capable of writing the newly calculated address back into the address register by appending a “!”:

<source lang=asm>
ldr r0, [r1, #8]!
</source>

This will add 8 to r1, write the result into r1, and also use it as an address from which to load 4 bytes and store them into r0. The variant

<source lang=asm>
ldr r0, [r1], #8
</source>

works just the opposite - r1 is used as an address from which to load the data, and “r1+8” is written back to r1. The next variant adds two registers to obtain the memory address:

<source lang=asm>
ldr r0, [r1, r2]
</source>

This loads the data from the address calculated by “r1+r2”. The second register (here: r2) can also be optionally shifted left by a fixed number of bits in the range 0-3:

<source lang=asm>
ldr r0, [r1, r2, lsl #2]
</source>

This shifts r2 left by two bits (i.e. multiplies it by 4), adds it to r1, and uses that as the address (r2 itself is not modified).
=== Iterating arrays ===
The offset addressing mechanism is perfectly suited to iterating arrays. This could be used to make an array defining a sequence of LED flashes that is iterated by the LED blinker application. Such an array would contain the duration of each on-and off-cycle (as passed to the “Delay” function) and be placed in flash memory:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =BlinkTableEnd @ Move address of "BlinkTableEnd" into r8

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7], #4 @ Load delay iterations from table and increment address
bl Delay

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
.word 2500000, 1000000, 2500000, 1000000, 2500000, 1000000
.word 1000000, 1000000, 1000000, 1000000, 1000000, 1000000
BlinkTableEnd:
</source>
Example name: “BlinkPattern”

The “.word” directive is used to place a sequence of 32bit- numbers into flash memory. The label “BlinkTable” will refer the the start address of the array, and “BlinkTableEnd” to the first address ''after'' the array. These two addresses are loaded into registers before the loop. The “.align” directive is used to make sure the 32bit-words are stored at properly aligned addresses. Inside the loop, the “ldr” instruction is used to load a 32bit-word from the array and pass it to the “Delay” function. The r7 register is advanced by 4 bytes to the next 32bit-word. This is done twice, for the on-and off-time. At the end of the loop, the address register is compared with the address of “BlinkTableEnd” - until that address has been reached, the loop will continue.

Another possibility is to keep the base address of the array in a register, and increment another register that contains the offset:

<source lang=asm>
.type Blink, %function
Blink:
push { r4-r9, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
ldr r7, =BlinkTable @ Move address of "BlinkTable" into r7
ldr r8, =0
ldr r9, =18

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldr r0, [r7, r8, lsl #2] @ Load delay iterations from table
add r8, #1
bl Delay

cmp r8, r9
blo BlinkLoop

pop { r4-r9, pc }
</source>
Example name: “BlinkPattern2”

Here, r8 is incremented in steps of 1 to denote the index in the array. The “lsl” syntax for “ldr” is used to multiply r8 by 4 (since each word is 4 bytes in size) and add it to r7, which contains the array’s base address. At the end of the loop, r8 is compared with 18, which is the number of entries in the array. This variant is actually less efficient, as it needs to keep both the base address and the index in registers and also has to increment the index in each iteration.
=== Literal loads ===

Regardless of architecture, any processor obviously needs to work with addresses in its own address space a lot. ARM can do calculations with its 32bit addresses just fine, but there is a bottleneck: The instruction set itself. To work with any address, it needs to be initially loaded into a processor register, but ARM instructions are only 16 or 32bit in size - not enough space for an arbitrary 32bit number plus the instruction encoding. Allowing even larger instructions (e.g. 40 bit) would complicate matters, so ARM instead uses several tricks to deal with this problem, which will be discussed here.

The “ldr r0, =1234” syntax allows you to load any arbitrary 32bit numbers, but is not actually a machine code instruction, but is translated by the assembler into one. In this chapter, the actual instructions for loading immediate numbers are discussed.

==== The “mov”-instruction ====
The most basic way of loading an immediate number into a register is the “mov” instruction:

<source lang=asm>
mov r0, #1234
</source>

This allows you to load any 16bit number (0 to 2^16-1) into a register. “mov” also includes some clever encodings that allow you to load certain commonly-used patterns:

* Any 32bit number that consists of one byte of arbitrary bits (i.e. 8 adjacent arbitrary bits) at any location, and zeros otherwise, e.g. 0x00000045, 0x00045000, 0x7f800000.
* Any 32bit number that consists of the same byte repeated 2 or 4 times in fixed places, as in 0x23002300, 0x00230023, 0x23232323
* The bit-wise negated result of any of these two patterns, e.g. 0xffffffba, 0xfffbafff, 0x807fffff or 0xdcffdcff. The assembler will actually use the “mvn” instruction for this, which works identically to “mov”, but negates the value.

By specifying a number that falls into one of these patterns, the assembler will automatically use the appropriate encoding. The first two ways of encoding numbers are not only available with “mov”, but also several other mathematical instructions that expect some immediate value: “add”, “and”, “bic”, “cmn”, “cmp”, “eor”, “mov”, “mvn”, “orn”, “orr”, “rsb”, “sbc”, “sub”, “teq”, “tst”. In the ARM Architecture Reference Manual, check the description of the instructions and look out for “ThumbExpandImm” to see whether it supports the first two patterns above.

You can also use the “mvn” instruction directly, e.g.:

<source lang=asm>
mov r0, #0xf807ffff
mvn r0, #0x07f80000
</source>

both lines are identical and write the number 0xf807ffff into r0.
==== The “movt” instruction ====
While supporting many common patterns, this does not allow arbitrary 32 bit numbers. One way to load any 32bit number is to split the number into two 16bit halves, and use both “mov” and “movt” to combine these two half-words into one register:

<source lang=asm>
mov r0, #0xabcd
movt r0, #0x1234
</source>

The “movt” instruction loads the given number into the upper 16 bits of the register, so this example loads 0x1234abcd into r0. The order is important, as “mov” overwrites the upper 16 bits with zeros, but “movt” keeps the lower 16 bits. If a single “mov” can't fit the desired number, the combination of “mov” and “movt” is the fastest way of loading any 32bit number. As two 32bit instructions are needed, this consumes 8 bytes of program memory. If you want to load the address of a symbol into a register, you need to tell the assembler to split it automatically. This can be achieved by prefixing the symbol with “:lower16:” or “:upper16:”, e.g.:

<source lang=asm>
movw r0, #:lower16:GPIOA_BSRR
movt r0, #:upper16:GPIOA_BSRR
</source>

Note that “movw” needs to be specified in this case to explicitly tell the assembler to use the “mov” variant that accepts 16bit numbers (which it otherwise does automatically when a direct value is given).
==== PC-relative loads ====
The other way of loading arbitrary 32bit values into registers is to place the value directly in flash memory, and load it from there using “ldr”:

<source lang=asm>
@ Some code …
mov r0, … address of Literal ...
ldr r1, [r0]
@ More code …
Literal:
.word 0x12345678
</source>

However, there is a Chicken-And-Egg problem - the address of “Literal” is a 32bit number itself, so how to load it into r0? Luckily, there is a register that contains a number close to the one needed - the program counter (PC, r15) indicates the address of the instruction currently being executed. By reading it and adding a small offset that fits into the instruction itself, the address of “Literal” can be obtained, provided that “Literal” is located close enough. Consider this example of the EnableClockGPIOA function:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
add r1, pc, #12
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
.word RCC_APB2ENR
</source>

The 32bit-value “RCC_APB2ENR” is stored in flash memory. The “add” instruction is used to add the offset 12 to the address of the instruction itself to obtain the address of said 32bit-value, which is then loaded via “ldr”. The offset 12 is actually not easy to calculate and even depends on the alignment of the “add” instruction itself (hence the “.align” to ensure a consistent example). The assembler is capable of doing the calculation on its own, for which the “adr” instruction is used:

<source lang=asm>
.align 2
.type EnableClockGPIOA, %function
EnableClockGPIOA:
adr r1, LiteralRCC_APB2ENR
ldr r1, [r1]
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

The label LiteralRCC_APB2ENR refers to the address of the 32bit-value in memory. “adr” is actually a variant of “add” that instructs the assembler to calculate the offset and place it into the instruction itself, which then lets the processor add it to PC and write the result to r1. This address is then used by “ldr”.

The “adr” instruction is useful when the address of some literal is explicitly needed; for example, in the blinker program, it can be used to obtain the addresses of the array:

<source lang=asm>
adr r7, BlinkTable @ Move address of "BlinkTable" into r7
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r8
</source>

However, for loading a single value, the address is actually not needed. In this case, “adr” and “ldr” can be combined:

<source lang=asm>
.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, LiteralRCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.align 2
LiteralRCC_APB2ENR:
.word RCC_APB2ENR
</source>

This special variant of “ldr” lets the assembler calculate to offset as with “adr”, adds it to “PC” at runtime and loads the data found at the address into r1. This is much easier than the first variant, as all calculations are done automatically. It is still somewhat cumbersome having to write three lines just to obtain a single 32bit value. Therefore, the assembler offers this already introduced syntax:

<source lang=asm>
ldr r1, =RCC_APB2ENR
</source>

This is a special command for the assembler. If possible, the assembler will use the “mov” or “mvn” instruction to load the value. If the value won’t fit, it will be put into flash memory, and a “ldr” instruction as above will be used. In this case, the “ldr rX, =...” syntax is equivalent to the combination of specifying a label for the value, the “.word” directive and “ldr rX, <Label>”. Therefore, this syntax is usually the best way to load immediates.

The assembler places the literals at the end of the file. If the file is long, the offset will be too long for the “ldr” and “adr” instructions and the assembler will emit an error. You can instruct the assembler to place all literals that have been declared so far at a specific point using the “.ltorg” directive. It is recommended to place an “.ltorg” after each function (after the “bx lr”) - just make sure that execution will never reach there. If a single function is so long that an “.ltorg” at the end is too far away from “ldr”/”adr” at the beginning, you can place an “.ltorg” somewhere in the middle and jump over it with “b”.

In summary, the following rules can help make literal loads more efficient
* Avoid literal loads if possible; try to calculate needed values from other values that have already been loaded, possibly by using offset-addressing in “ldr”/”str”
* When accessing multiple registers of a single periphery module, load its base address once and use offset addressing to access the individual registers
* If you need a pointer to a location in flash memory, try using “adr”
* If speed is important, use “movw”+”movt” to load the value
* Else, use “ldr rX, =...” to have the assembler choose the optimal encoding
* Place “.ltorg” after each function

The “ldr … =” instruction can also be used to load any immediate 32bit value into the PC to cause a jump to that address, simply by specifying “pc” as the target register. If you perform an ordinary branch (via “b” or “bl”) to some function whose address is too far away from the current code location, the linker will insert a “wrapper” function that does exactly that to perform the “far” jump. That function is called a “veneer”.
=== The SysTick timer ===
An important aspect of many embedded systems is to control timing of technical processes. In the blinker example, the timing of the LED flashes was handled by having the processor execute dummy instructions to pass time. It is however virtually impossible to accurately predict the runtime of any piece of code on a complex processor such as ARM ones, and the runtime may vary among multiple runs and depending on the actual micro controller and its configuration. For a simple LED blinker this may be acceptable, but not for e.g. a closed loop controller for some mechanical actor. Therefore, almost all micro controllers and also application processors feature one or more hardware timers, which allow to measure time independently of the execution speed of the software. Timer features vary widely among different processors, but that basic idea is to increment or decrement some digital counter at each clock cycle and trigger some event when it reaches a certain value.

All ARMv7-M processors feature the so-called “SysTick”-Timer as part of the processor core itself. This is a rather simple 24bit-timer that counts from a configurable value back to zero, then resets to that value and triggers an event. This timer is frequently used as a time base for RTOS or other runtime libraries. The timer uses three periphery registers: “RVR” contains the value from which to count down. “CVR” contains the current value, and “CSR” contains some status and control bits. The timer can be used for the “Delay” function like this:

<source lang=asm>
SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

@ Parameters: r0 = Number of iterations
.type Delay, %function
Delay:
ldr r1, =SCS
add r0, r0, r0, lsl #1

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

DelayLoop:
ldr r0, [r1, #SCS_SYST_CSR]
tst r0, #0x10000
beq DelayLoop

ldr r0, =0
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>

The SysTick is part of the “System Control Space”, SCS. The SCS base address is defined as a symbol, and the relative addresses of the registers as well. The count value is stored in “RVR”, after which “CVR” has to be set to zero. The timer is started by writing “5” into the “CSR” register. The loop repeatedly reads the “CSR” register and continues until bit 16 is set. The “tst” instruction is used to perform an “and” operation with the register contents and an immediate value without keeping the result while just updating the flags. At the end, the “CSR” register is set to zero to disable the timer. The “add” instruction at the beginning is used to multiply the count value by 3: r0 is shifted left by one, i.e. multiplied by two, and then added to itself, as in r0*2^1+r0. This is a common trick to quickly multiply by constants. By including this multiplication, the duration is the same as with the previous “Delay” variant, which, on this microcontroller, uses about 3 cycles per loop iteration.

Managing timing this way (or any other kind of “Delay” function) is still not very accurate. The time needed to call the function, start the timer, return, and set the pins is added to the actual duration and may also vary each time. The timing errors accumulate over time - a clock implemented this way will quickly go wrong. The proper way to achieve accurate timing is to start the timer once, let it run continuously, and react to its events. The internal clock source used by the microcontroller is also quite inaccurate (up to 2.5% deviation), which can be improved upon by a quartz crystal (typical accuracy of e.g. 0.005%), which will be covered later. Reacting to events instead of calling a function that executes dummy code requires restructuring the program code, without using any kind of “Delay” function.

To do that, the timer is started once at program startup and kept running. After setting the LED pin, wait for the timer event, and repeat. In the last example, the values 3000000 and 7500000 are used for the timer register (3x1000000 and 3x2500000, respectively). Changing the timer value while it is running continuously is problematic, so one fixed value should be used; to achieve variable blinker duration, multiple timer events need to be counted. The greatest common denominator of the two numbers is 1500000, so to achieve the two different times, 2 and 5 timer events need to be registered, respectively. Since these numbers fit into a single byte, the table entries and corresponding access instructions are changed to byte. A function “StartSysTick” is implemented to start the timer once, and a function “WaitSysTick” to wait for a given number of timer events:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
bl EnableClockGPIOA
bl ConfigurePA8
ldr r0, =TimerValue
bl StartSysTick
bl Blink
b .

.type Blink, %function
Blink:
push { r4-r8, lr }
ldr r4, =GPIOA_BSRR @ Load address of GPIOA_BSRR
ldr r5, =GPIOx_BSRR_BS8 @ Register value to set pin to high
ldr r6, =GPIOx_BSRR_BR8 @ Register value to set pin to low
adr r7, BlinkTable @ Move address of "BlinkTable" into r8
adr r8, BlinkTableEnd @ Move address of "BlinkTableEnd" into r9

BlinkLoop:
str r5, [r4] @ Set BS8 in GPIOA_BSRR to 1 to set PA8 high

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

str r6, [r4] @ Set BR8 in GPIOA_BSRR to 1 to set PA8 low

ldrb r0, [r7], #1 @ Load delay iterations from table and increment address
bl WaitSysTick

cmp r7, r8
blo BlinkLoop

pop { r4-r8, pc }

.align 2
.type BlinkTable, %object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2, 2
BlinkTableEnd:
.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =5
str r0, [r1, #SCS_SYST_CSR]

bx lr

@ r0 = Number of timer events to wait for
.type WaitSysTick, %function
WaitSysTick:
ldr r1, =SCS

WaitSysTickLoop:
ldr r2, [r1, #SCS_SYST_CSR]
tst r2, #0x10000
beq WaitSysTickLoop

subs r0, #1
bne WaitSysTickLoop

bx lr

</source>
Example name: “BlinkSysTick”

This way, the blinker frequency will be as stable and accurate as possible with the given clock source.
=== Exceptions & Interrupts ===
Exceptions and interrupts play an important role in low-level development. They provide a facility for hardware to notify the software of events, such as received data blocks or a timer event. On ARM, interrupts are a sub-group of exceptions – there are some “system-level” exceptions mostly for dealing with processor errors and providing operating system support, while interrupts are “special” exceptions for events signaled by periphery modules. When writing “regular” microcontroller software, you will mostly work with interrupts.

Exceptions (and interrupts) interrupt normal program flow, and cause the processor to execute some other piece of code which is called the exception handler or Interrupt Service Routine (ISR) (even for the “system-level” exceptions that are not interrupts). After dealing with the indicated event, the ISR typically returns and normal program flow resumes. As exceptions can interrupt the program anytime, data (and periphery) may be in any kind of inconsistent state, so special care must be taken to avoid corrupting program state in an ISR. The ARMv7-M processor (including the Cortex-M3) provide sophisticated support for exceptions, with configurable priorities and nested exception calls. This chapter will only cover the basics for using exceptions.

On ARMv7-M, exception handlers are implemented as regular functions, for example:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
@ Handle event ...
bx lr
</source>

Like any other function, it has a label, returns with “bx lr”, and is also made globally visible to other source files using “.global”. The “.type … %function” is required here for the same reason as for the already-mentioned “Reset_Handler”. Exception handlers can be located anywhere in flash memory, among the other regular functions. To tell the processor where the exception handlers for the various exception types are located, the vector table needs to be adjusted. Until now, the vector table was defined as:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0xe4
</source>

Recall that the first 32bit-word in flash memory contains the initial stack pointer (defined via “.word _StackEnd”) and the second word contains the address of the first instruction of the program (defined via “.word Reset_Handler”). Actually, resetting the controller is an exception too, and the code to be executed after reset (or start-up) is the handler for the reset exception (hence the name “Reset_Handler”). The next 228 bytes of flash memory contain 57 32bit-addresses of the handlers of the other exceptions, including interrupts. The “.space” directive just fills those with zeroes. To tell the processor the address of an exception handler, the appropriate entry in this table needs to be set to that address. In chapter 10.1.2, table 63 of the controller’s reference manual, the format of the vector table, and which exception’s address should go where, is defined. Only the interrupts up until position 42 actually exist on the STM32F103RB/C8, as defined in chapter 2.3.5 of the datasheet; everything from “TIM8_BRK” is only present on larger controllers. According to the table, the SysTick exception handler’s address needs to be put at location 0x3C relative to the beginning of flash memory. Since the first 8 bytes are already occupied, 0x34 bytes of space are needed after those first 8 bytes.

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.section .VectorTable, "a"
.word _StackEnd
.word Reset_Handler
.space 0x34
.word SysTick_Handler
.space 0xac
</source>

With this modification, the SysTick_Handler function is now declared as the handler for the SysTick exception. By default, the SysTick timer does not trigger an exception. To do that, you have to set bit 2 in the SCS_SYST_CSR register. By placing the logic for the blinker into the timer's ISR, you get an interrupt-based blinker:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

RCC_APB2ENR = 0x40021018
RCC_APB2ENR_IOPAEN = 4
GPIOA_CRH = 0x40010804

GPIOA_BSRR = 0x40010810
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18
TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

ldr r0, =TimerValue
bl StartSysTick
SleepLoop:
wfi
b SleepLoop

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq Return

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs Return

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne Return

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs SkipRestart

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

SkipRestart:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA_BSRR @ Load address of GPIOA_BSRR
str r1, [r0]

Return:
bx lr

.align 2
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC_APB2ENR
ldr r0, [r1]
orr r0, r0, #RCC_APB2ENR_IOPAEN
str r0, [r1] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA_CRH
ldr r0, [r1]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr

</source>
Example name: “BlinkSysTickInterrupt”

The regular program flow now consists only of initializing the periphery, timer, and the first step of the blinker (setting the pin high). After that, the processor should do nothing but wait for exceptions, which is achieved by a simple endless loop. The “wfi” instruction suspends the processor; when an exception occurs, the processor will wake up, execute the ISR, and return execution after the “wfi”. Therefore, “wfi” is usually put in an endless loop as shown. This technique can reduce the processor’s power consumption significantly, as it is only running when something needs to be done, as indicated via interrupts. The ISR first checks whether the interrupt flag in the timer register is set - this is necessary, since exceptions can sometimes occur “spuriously”, i.e. without an actual event causing it. The decision whether to set or reset the pin state is taken based on the lowest bit of the table index, such that the output alternates between 1 and 0.

The code inside the ISR needs to know which step in the blinking sequence is currently active, and how many timer events have already occurred inside the current step. Therefore, two 1-byte-variables are stored in RAM. To access them, offset addressing is used, where r0 contains the base address of the variables in memory, and the offsets inside “ldrb” and “strb” are set accordingly. The last number of the blink sequence table is omitted, since it is actually superfluous, because no action is taken after the last delay has elapsed. Because the table size is now odd, an “.align” directive after it is required. Always putting “.align” after outputting data is a good idea anyways.

Since exceptions can occur at any point in regular program flow, the processor registers may contain some data that will be used after the exception handler returns. Therefore, if the exception handler writes anything into the registers, they need to be restored when returning from the exception. Upon exception entry, the Cortex-M3/4 processors automatically store the registers r0-r3, r12, r14 (LR) and APSR (including the flags) on the stack. The link register is filled with a special “dummy” value, and when the exception handler returns via “bx lr” using this value, the processor restores said registers to their previous state. This effectively means that you can implement exception handlers like any other function, i.e. freely overwrite r0-r3, r12 and the flags and push/pop r4-r11 and the LR if needed.
=== Macros ===
The assembler provides a few mechanisms to make assembly-language development easier. One of those are macros, which allow you to define snippets of assembly code that you can then insert easily whenever you need them. While looking similar to function invocations, the code inside the macro is actually copied each time the macro is used, so don’t overuse them. Macros are started with “.macro” and end at the next “.endm” directive. For example, the following macro sets the LED pin to 0 or 1:

<source lang=asm>
.macro SETLED value
ldr r0, =GPIOA_BSRR
ldr r1, =(((!\value) << 24) | (\value<<8))
str r1, [r0]
.endm

SETLED 0
SETLED 1
.endm
</source>

The macro name is defined as “SETLED”, and a single parameter with name “value” is given. By typing “\type”, the value of the parameter is substituted in the macro body. Some bit-shifting is used to calculate the right bit pattern to write into BSRR to set or reset the pin accordingly.
=== Weak symbols ===
As explained before, labels defined in assembly files get translated into symbols in the object code files, which are resolved by the linker. Sometimes it is desirable to provide a “default” or “fallback” implementation of some function (or data block) which is only used when no other implementation is given. This can be achieved by marking the “fallback” variant with “.weak”:

<source lang=asm>
.type Function1, %function
.global Function1
.weak Function1
Function1:
@ Default implementation …
…
bl Function1 @ Call the function
</source>

With this code alone, “Function1” will be used normally. If you put another function with the same name in a different assembly source file, that second variant will be used.
=== Symbol aliases ===
It is also possible to define aliases for symbols using “.thumb_set”, which sets the address accordingly. For example:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

When trying to call “Function2”, the linker will automatically fill in the address of “Function1”. This can also be combined with “.weak” to define a weak alias:

<source lang=asm>
.type Function1, %function
.global Function1
Function1:
@ Some Code

.weak Function2
.thumb_set Function2, Function1
…
bl Function2 @ Call the function
</source>

If you now define another “Function2” in a different assembly source file, that will be used. If not, “Function1” will be called, which is the target of the alias definition. This is useful if you want to define one default implementation for several different functions, for each of which you need one “.weak” and one “.thumb_set” directive.
=== Improved vector table ===
The techniques from the last three sections can be used to improve the definition of the vector table. The way it was defined before is not very flexible; to insert new entries, you have to calculate the new gap sizes and offsets. First, define a default handler ISR that is called by exceptions for which no other ISR is defined, and a macro that defines an alias for one exception with the default handler as the target, and finally a table of all exceptions by using the macro:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.macro defisr name
.global \name
.weak \name
.thumb_set \name, Default_Handler
.word \name
.endm

.global VectorTable
.section .VectorTable, "a"
.type VectorTable, %object
VectorTable:
.word _StackEnd
defisr Reset_Handler
defisr NMI_Handler
defisr HardFault_Handler
defisr MemManage_Handler
defisr BusFault_Handler
defisr UsageFault_Handler
.word 0
.word 0
.word 0
.word 0
defisr SVC_Handler
defisr DebugMon_Handler
.word 0
defisr PendSV_Handler
defisr SysTick_Handler
defisr WWDG_IRQHandler
defisr PVD_IRQHandler
defisr TAMPER_IRQHandler
defisr RTC_IRQHandler
defisr FLASH_IRQHandler
defisr RCC_IRQHandler
defisr EXTI0_IRQHandler
defisr EXTI1_IRQHandler
defisr EXTI2_IRQHandler
defisr EXTI3_IRQHandler
defisr EXTI4_IRQHandler
defisr DMA1_Channel1_IRQHandler
defisr DMA1_Channel2_IRQHandler
defisr DMA1_Channel3_IRQHandler
defisr DMA1_Channel4_IRQHandler
defisr DMA1_Channel5_IRQHandler
defisr DMA1_Channel6_IRQHandler
defisr DMA1_Channel7_IRQHandler
defisr ADC1_2_IRQHandler
defisr USB_HP_CAN1_TX_IRQHandler
defisr USB_LP_CAN1_RX0_IRQHandler
defisr CAN1_RX1_IRQHandler
defisr CAN1_SCE_IRQHandler
defisr EXTI9_5_IRQHandler
defisr TIM1_BRK_IRQHandler
defisr TIM1_UP_IRQHandler
defisr TIM1_TRG_COM_IRQHandler
defisr TIM1_CC_IRQHandler
defisr TIM2_IRQHandler
defisr TIM3_IRQHandler
defisr TIM4_IRQHandler
defisr I2C1_EV_IRQHandler
defisr I2C1_ER_IRQHandler
defisr I2C2_EV_IRQHandler
defisr I2C2_ER_IRQHandler
defisr SPI1_IRQHandler
defisr SPI2_IRQHandler
defisr USART1_IRQHandler
defisr USART2_IRQHandler
defisr USART3_IRQHandler
defisr EXTI15_10_IRQHandler
defisr RTCAlarm_IRQHandler
defisr USBWakeUp_IRQHandler

.text

.type Default_Handler, %function
.global Default_Handler
Default_Handler:
bkpt
b.n Default_Handler
</source>

There are a few empty entries in the table that are not used by the processor. At the beginning, there is still the definition for the initial stack pointer and the “Reset_Handler”. If you replace your “vectortable.S” by this code, you get a “proper” vector table. The “SysTick_Handler” will continue to work as before, and if you need to define any other ISR, for example for USART1, just define a function by the exact name “USART1_IRQHandler”. The address of this function will automatically be put in the vector table. If an exception without a corresponding ISR occurs, the “Default_Handler” will be called, which uses the “bkpt” instruction to force a breakpoint via the attached debugger. This helps debugging missed exceptions while avoiding to define several individual dummy handler functions.
=== .include ===
Having to put the register and bit definitions (“RCC_APB2ENR”, “RCC_APB2ENR_IOPAEN”, ...) in each assembly source file is redundant and error-prone. Instead, you can put them into a separate file (e.g. called “stm32f103.inc”) and use the “.include” directive to reference it:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

@ Normal code ...
</source>

The assembler will read the code from the included file and pretend it was written instead of the “.include” line. This can help improve code structure. While working on the project structure, you can also restructure the definitions for the GPIO registers to facilitate offset addressing:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000
</source>

The next example incorporates these changes in addressing the registers.
=== Local Labels ===
Having to invent unique labels for all jump targets inside functions (e.g. for conditional code and loops) can be tedious. When using a disassembler (see below), each label will appear as its own functions. Therefore, the GNU assembler supports local labels. These are labels whose name consist only a number. Local names need not be unique; several labels called e.g. “1” may exist in one file. To perform a jump to a local label, use the number and append a “f” or “b” to indicate whether to jump forward or backward. Local labels can not be exported with the “.global” directive. The interrupt-based blinker can be modified like this using local labels:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500000

.data
Variables:
BlinkStep:
.space 1
TimerEvents:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =Variables
ldr r1, =0
str r1, [r0, #(BlinkStep-Variables)]
ldr r1, BlinkTable
str r1, [r0, #(TimerEvents-Variables)]

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b

.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =Variables

ldrb r1, [r0, #(BlinkStep-Variables)]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldrb r3, [r0, #(TimerEvents-Variables)]
subs r3, #1

itt ne
strbne r3, [r0, #(TimerEvents-Variables)]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0, #(TimerEvents-Variables)]

1:
strb r1, [r0, #(BlinkStep-Variables)]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartSysTick, %function
StartSysTick:
ldr r1, =SCS

str r0, [r1, #SCS_SYST_RVR]
ldr r0, =0
str r0, [r1, #SCS_SYST_CVR]

ldr r0, =7
str r0, [r1, #SCS_SYST_CSR]

bx lr
</source>
Example name: “BlinkLocalLabels”
=== Initializing RAM ===
The blinker program uses 2 byte-variables in memory, which have to be initialized to some value at startup. For large programs with many variables, this quickly becomes hard to maintain and also inefficient. Assembler and linker can help producing an “image” of how the RAM contents should look like after initializing, and place this image in flash memory alongside the normal program data. At startup, this image can be simply copied 1:1 into RAM in a loop. Most programs contain many variables that will be initialized with zero, so placing a (possibly large) block of zeroes in flash memory is wasteful; therefore, an additional loop is used to initialize all zero-variables to zero. Both techniques are also employed by C and C++ compilers, so implementing the initialization code is required there too. First, change the declaration of your variables by using “.byte”, “.hword” and “.word” and include the desired initialization value. Variables that should be initialized by zero get to be placed after a “.bss” directive to put them into the equally-named section. They don’t get an initialization value but just reserved space by using “.space”:

<source lang=asm>
.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1
</source>

From the assembler’s point of view, the initialization data - in this case, just one byte of value “2” - will directly end up in RAM. However, this is not possible on microcontrollers, as the RAM always contains random data on startup and isn’t automatically initialized. To achieve that, change the linker script as follows:

<source>
MEMORY {
FLASH : ORIGIN = 0x8000000, LENGTH = 128K
SRAM : ORIGIN = 0x20000000, LENGTH = 20K
}

SECTIONS {
.VectorTable : {
*(.VectorTable)
} >FLASH

.text : {
*(.text)
. = ALIGN(4);
} >FLASH

.stack (NOLOAD) : {
. = . + 0x400;
_StackEnd = .;
} >SRAM

.data : {
_DataStart = .;
*(.data);
. = ALIGN(4);
_DataEnd = .;
} >SRAM AT >FLASH

_DataLoad = LOADADDR(.data);

.bss (NOLOAD) : {
_BssStart = .;
*(.bss);
. = ALIGN(4);
_BssEnd = .;
} >SRAM
}
</source>
Example name: “BlinkInitRAM”

The stack got put in its own section with the “NOLOAD” attribute, since it doesn’t need initializing. The data is now put in the “.data” section. The initial data for that section is put into flash memory via the “>SRAM AT >FLASH” construct. The addresses of symbols inside the “.data” section are still the addresses in RAM, so accesses to the symbols from assembly code still work. The symbol “_DataStart” is assigned the beginning of the initialized data in RAM, and “_DataEnd” the end. The “LOADADDR” function is used to get the beginning of the initialization data in flash, and assign it to “_DataLoad”. The “.bss” section contains all the variables that should be zero-initialized, and the symbols “_BssStart” and “_BssEnd” are set to its beginning and end address, respectively. It is marked with “NOLOAD” as well as we don't want to store (potentially many) zeroes in the linked program file, and we will explicitly initialize it (see below). As the beginning and size of the stack are already a multiple of 4, the beginning of “.data” is as well. The size of .data might not be a multiple of 4 however, so an “.=ALIGN(4)” command is inserted right before the definition of “_DataEnd”. This adds 0-3 dummy bytes by incrementing the location counter to make sure the address is a multiple of 4. The same thing is done right before “_BssEnd” and also at the end of the “.text” section, to make sure that “_BssEnd” and “_DataLoad” are multiples of 4 as well.

The only thing left is the actual initialization of the RAM. To do that, change the “Reset_Handler” as follows:

<source lang=asm>
.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TimerValue
bl StartSysTick
1:
wfi
b 1b
.ltorg
</source>

The explicit initialization of the variables was removed. Instead, the addresses for “_DataStart”, “_DataEnd” and “_DataLoad” that were defined in the linker script are loaded. Then, a short loop repeatedly loads a word from flash (i.e. starting with “_DataLoad”) and stores it into RAM (starting at “_DataStart”). The address pointers are incremented by the “ldr”/”str” instructions after the access. The pointer for the RAM location is compared with the end of the RAM area (“_DataEnd”) to decide whether to jump back to the beginning of the loop. To start the loop, a jump directly to the comparison is performed; this avoids the need to do the comparison at the beginning and inside of the loop. The second loop performs the zero-initialization of the area between “_BssStart” and “_BssEnd”; it works similarly, but does not need to load any data.

Unfortunately, the program as shown can’t be translated - as the two variables now reside in two different sections (“.data” and “.bss”), the offset addressing in the “SysTick_Handler” doesn’t work anymore. Therefore, direct addressing has to be used:

<source lang=asm>
.type SysTick_Handler, %function
.global SysTick_Handler
SysTick_Handler:
ldr r0, =SCS
ldr r0, [r0, #SCS_SYST_CSR]
tst r0, #0x10000
beq 2f

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA @ Load address of GPIOA_BSRR
str r1, [r0, #GPIOx_BSRR]

2:
bx lr
</source>

=== Peripheral interrupts ===
Interrupts, i.e. exceptions called by periphery modules, need a little extra code compared to the “core” exceptions including the SysTick. The Cortex-M’s interrupt controller (the NVIC) contains several registers for configuring these interrupts. It is possible to configure the priority and manually trigger interrupts, but for most applications, the only necessary thing to do is enabling the desired interrupt. This is done via the registers “NVIC_ISER0” through “NVIC_ISER15”, which are documented in the ARMv7M Architecture Reference Manual in chapter B3.4.4. Each of those registers contains 32 bits with which 32 of the interrupts can be enabled. The STM32F103RB/C8 has 43 interrupts, so only two of the possible 16 registers are present. The number of interrupts is given in chapter 2.3.5 of the controller’s datasheet. So, to enable some interrupt x, the bit “x mod 32” in register NVIC_ISER''y'' with y=x/32 has to be set. This register’s address is 0xE000E100+y*4. Given an interrupt’s number in r0, the following function does just that:

<source lang=asm>
NVIC_ISER0 = 0xE000E100

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>
Example name: “BlinkTIM1”

The “and” instruction calculates “x mod 32”, and the following left-shift (“lsls”) calculates the value where bit “x mod 32” is one, and all others are zero. To calculate the offset address “y*4”, i.e. “(x/32)*4”, the register is first shifted right by 5 bits and then shifted back left by 2 bits. This is the same as shifting 3 bits right and zeroing out the lower 2 bits; but two shift instructions actually consume less program memory space. Finally, the calculated value is written into the register by using offset addressing.

In addition to enabling the interrupt in the processor core’s NVIC, it also has to be enabled in the periphery module. Many periphery modules support several different events, each of which has to be enabled in the periphery’s register individually. Depending on the controller, these can be mapped to one single processor interrupt (and hence, one single ISR) or multiple ones, and need to be configured in the NVIC appropriately.

This example uses the STM32’s periphery timer TIM1 instead of the SysTick timer:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.include "stm32f103.inc"

TimerValue=1500
TimerPrescaler=1000

.data
TimerEvents:
.byte 2

.bss
BlinkStep:
.space 1

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

bl EnableClockGPIOA
bl EnableClockTIM1
bl ConfigurePA8

ldr r1, =GPIOx_BSRR_BS8
ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

ldr r0, =TIM1_UP_IRQn
bl EnableIRQ
bl StartTIM1
1:
wfi
b 1b
.ltorg

.type TIM1_UP_IRQHandler, %function
.global TIM1_UP_IRQHandler
TIM1_UP_IRQHandler:
ldr r0, =TIM1
ldr r2, =(~(1 << TIMx_SR_UIF))

ldr r1, [r0, #TIMx_SR]
bics r1, r2
beq 2f

str r2, [r0, #TIMx_SR]

ldr r0, =BlinkStep

ldrb r1, [r0]

cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 2f

ldr r0, =TimerEvents
ldrb r3, [r0]
subs r3, #1

itt ne
strbne r3, [r0]
bne 2f

add r1, #1
cmp r1, #(BlinkTableEnd-BlinkTable)
bhs 1f

ldr r2, =BlinkTable
ldrb r3, [r2, r1]
strb r3, [r0]

1:
ldr r0, =BlinkStep
strb r1, [r0]

ands r1, #1
ite eq
ldreq r1, =GPIOx_BSRR_BS8
ldrne r1, =GPIOx_BSRR_BR8

ldr r0, =GPIOA
str r1, [r0, #GPIOx_BSRR]

2:
bx lr

.align 2
.type BlinkTable,%object
BlinkTable:
.byte 2, 2, 2, 2, 2, 2
.byte 5, 2, 5, 2, 5, 2
.byte 2, 2, 2, 2, 2
BlinkTableEnd:

.align 2

.type EnableClockGPIOA, %function
EnableClockGPIOA:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_IOPAEN)
str r0, [r1, #RCC_APB2ENR] @ Set IOPAEN bit in RCC_APB2ENR to 1 to enable GPIOA
bx lr @ Return to caller

.type EnableClockTIM1, %function
EnableClockTIM1:
ldr r1, =RCC
ldr r0, [r1, #RCC_APB2ENR]
orr r0, r0, #(1 << RCC_APB2ENR_TIM1EN)
str r0, [r1, #RCC_APB2ENR] @ Set TIM1EN bit in RCC_APB2ENR to 1 to enable TIM1
bx lr @ Return to caller
.ltorg

.type ConfigurePA8, %function
ConfigurePA8:
ldr r1, =GPIOA
ldr r0, [r1, #GPIOx_CRH]
and r0, #0xfffffff0
orr r0, #GPIOx_CRx_GP_PP_2MHz
str r0, [r1, #GPIOx_CRH] @ Set CNF8:MODE8 in GPIOA_CRH to 2
bx lr
.ltorg

@ r0 = Count-Down value for timer
.type StartTIM1, %function
StartTIM1:
ldr r0, =TIM1
ldr r1, =(1 << TIMx_CR1_URS)
str r1, [r0, #TIMx_CR1]

ldr r1, =TimerPrescaler
str r1, [r0, #TIMx_PSC]

ldr r1, =TimerValue
str r1, [r0, #TIMx_ARR]

ldr r1, =(1 << TIMx_DIER_UIE)
str r1, [r0, #TIMx_DIER]

ldr r1, =(1 << TIMx_EGR_UG)
str r1, [r0, #TIMx_EGR]

dsb

ldr r1, =(1 << TIMx_CR1_CEN)
str r1, [r0, #TIMx_CR1]

bx lr
.ltorg

@ r0 = IRQ Number
.type EnableIRQ, %function
EnableIRQ:
ldr r1, =NVIC_ISER0

movs r2, #1
and r3, r0, #0x1F
lsls r2, r2, r3

lsrs r3, r0, #5
lsls r3, r3, #2

str r2, [r1, r3]

bx lr
.ltorg
</source>

The corresponding stm32f103.inc file with the added definitions for the timer registers is:

<source lang=asm>
GPIOA = 0x40010800

GPIOx_CRH = 0x4
GPIOx_BSRR = 0x10
GPIOx_BSRR_BS8 = 0x100
GPIOx_BSRR_BR8 = 0x1000000

GPIOx_CRx_GP_PP_2MHz = 2

SCS = 0xe000e000
SCS_SYST_CSR = 0x10
SCS_SYST_RVR = 0x14
SCS_SYST_CVR = 0x18

RCC = 0x40021000
RCC_APB2ENR = 0x18
RCC_APB2ENR_IOPAEN = 2
RCC_APB2ENR_TIM1EN = 11

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

TIM1 = 0x40012C00

TIMx_CR1 = 0
TIMx_CR1_ARPE = 7
TIMx_CR1_URS = 2
TIMx_CR1_CEN = 0

TIMx_DIER = 0xC
TIMx_DIER_UIE = 0

TIMx_SR = 0x10
TIMx_SR_UIF = 0

TIMx_EGR = 0x14
TIMx_EGR_UG = 0

TIMx_PSC = 0x28
TIMx_ARR = 0x2C

TIM1_UP_IRQn = 25

NVIC_ISER0 = 0xE000E100
</source>

The source code enables the timer’s clock in the RCC before configuring it. The timer supports both a freely configurable prescaler for dividing the clock and a freely configurable maximum value, both of which are set by the StartTIM1 function. The TIMx_DIER_UIE bit is set to enable the interrupt for the so-called “update event”, which is triggered whenever the timer reaches the maximum value. A delicate sequence of register accesses is required to start the timer with the right configuration but without triggering the interrupt right away: To apply the modified settings immediately, the “TIMx_EGR_UG” bit is set to trigger an “artificial” update event. To prevent this from also triggering the interrupt, the “TIMx_CR1_URS” bit is set and cleared before and after, respectively. The timer is started by setting the “TIMx_CR1_CEN” bit at the end. Before that, a “dsb” instruction is inserted. This “Data Synchronization Barrier” waits until all write accesses before that have been completely processed - usually, the processors pipeline is working on several instructions at once. Because the timer configuration needs to be truly finished before starting the timer, this instruction is required. There are some other situations where the processor is too fast for the periphery and needs to be temporarily halted by a “dsb”. If some periphery-accessing code works in step-by-step mode while debugging, but not when executing normally, a well-placed “dsb” might help.

The ISR “TIM1_UP_IRQHandler” is used for the timer. It checks the “TIMx_SR_UIF” bit to verify an update event has actually happened. In that case, the register is overwritten with the value 0xFFFFFFFE, i.e. all bits are written with “1” except the UIF bit. Writing ones has no effect on the bits in this register, and writing a zero clears the respective bit. Therefore, this write access clears the UIF bit but keeps the others. These interrupt flags must always be cleared as soon as possible in the ISR, or the periphery might trigger the interrupt again immediately. The rest of the ISR stays the same.
=== Analysis tools ===
When working on a low level directly with linker scripts and assembly code, it is frequently necessary to directly verify the translation output, as you can’t rely on a compiler doing it right automatically, and flashing the program each time to see whether it works isn’t the most efficient way. This was, in fact, important in creating the example codes for this tutorial. The “binutils” package, of which assembler and linker are part of, offers a few tools that help with analyzing the assembler’s and linker’s output.
==== Disassembler ====
As the name implies, a disassembler is the opposite of an assembler - it turns binary machine code back into a (more or less) readable textual representation. If you feed an ELF file generated by the assembler or linker into the disassembler, it will read the header information to tell apart data (i.e. constants) and code, get names of symbols (and therefore, labels) and can even tell which instructions were generated from which assembly source file, if it was assembler with debug information (i.e. the “-g” flag was used). If you disassemble a binary flash image, the disassembler doesn’t have all those information and will produce a much less readable output and attempt to decode data bytes as instructions.

The disassembler from binutils is called “objdump”. Invoking it on the blinker looks like this:
<pre>
$ arm-none-eabi-objdump -d -s prog1.elf

prog1.elf: file format elf32-littlearm

Contents of section .VectorTable:
8000000 00040020 ed000008 ed010008 ed010008 ... ............
8000010 ed010008 ed010008 ed010008 00000000 ................
8000020 00000000 00000000 00000000 ed010008 ................
8000030 ed010008 00000000 ed010008 49010008 ............I...
8000040 ed010008 ed010008 ed010008 ed010008 ................
8000050 ed010008 ed010008 ed010008 ed010008 ................
8000060 ed010008 ed010008 ed010008 ed010008 ................
8000070 ed010008 ed010008 ed010008 ed010008 ................
8000080 ed010008 ed010008 ed010008 ed010008 ................
8000090 ed010008 ed010008 ed010008 ed010008 ................
80000a0 ed010008 ed010008 ed010008 ed010008 ................
80000b0 ed010008 ed010008 ed010008 ed010008 ................
80000c0 ed010008 ed010008 ed010008 ed010008 ................
80000d0 ed010008 ed010008 ed010008 ed010008 ................
80000e0 ed010008 ed010008 ed010008 ............
Contents of section .text:
80000ec 0f481049 104a03e0 52f8043b 40f8043b .H.I.J..R..;@..;
80000fc 8842f9d3 0d480e49 4ff00002 01e040f8 .B...H.IO.....@.
800010c 042b8842 fbd300f0 47f800f0 4bf84ff4 .+.B....G...K.O.
800011c 80710848 01600848 00f058f8 30bffde7 .q.H.`.H..X.0...
800012c 00040020 04040020 f0010008 04040020 ... ... .......
800013c 08040020 10080140 60e31600 4ff0e020 ... ...@`...O..
800014c 006910f4 803f1dd0 1a480178 b1f1110f .i...?...H.x....
800015c 18d21948 0378013b 1cbf0370 12e001f1 ...H.x.;...p....
800016c 0101b1f1 110f02d2 144a535c 03701148 .........JS\.p.H
800017c 017011f0 01010cbf 4ff48071 4ff08071 .p......O..qO..q
800018c 0f480160 70470202 02020202 05020502 .H.`pG..........
800019c 05020202 02020200 0a490868 40f00400 .........I.h@...
80001ac 08607047 08490868 20f00f00 40f00200 .`pG.I.h ...@...
80001bc 08607047 04040020 00040020 92010008 .`pG... ... ....
80001cc 10080140 18100240 04080140 4ff0e021 ...@...@...@O..!
80001dc 48614ff0 00008861 4ff00700 08617047 HaO....aO....apG
80001ec 00befde7 ....
Contents of section .data:
20000400 02000000 ....
Contents of section .ARM.attributes:
0000 41200000 00616561 62690001 16000000 A ...aeabi......
0010 05436f72 7465782d 4d330006 0a074d09 .Cortex-M3....M.
0020 02 .
Contents of section .debug_line:
0000 98000000 02001e00 00000201 fb0e0d00 ................
0010 01010101 00000001 00000100 70726f67 ............prog
0020 312e5300 00000000 000502ec 00000803 1.S.............
0030 15012121 22212f2f 21222121 30212f21 ..!!"!//!"!!0!/!
0040 222f302f 21232130 21036120 2f2f362f "/0/!#!0!.a //6/
0050 030c2e32 030a2e2f 212f2222 222f2221 ...2.../!/"""/"!
0060 21222121 222f2f22 21212321 222f212f !"!!"//"!!#!"/!/
0070 30212303 0d9e2121 2f212421 212f2f21 0!#...!!/!$!!//!
0080 03422035 030c2e03 0d2e0311 2e36030b .B 5.........6..
0090 2e30212f 222f2202 01000101 3b000000 .0!/"/".....;...
00a0 02002400 00000201 fb0e0d00 01010101 ..$.............
00b0 00000001 00000100 76656374 6f727461 ........vectorta
00c0 626c652e 53000000 00000005 02ec0100 ble.S...........
00d0 0803d000 01210201 000101 .....!.....
Contents of section .debug_info:
0000 22000000 02000000 00000401 00000000 "...............
0010 ec000008 ec010008 00000000 08000000 ................
0020 12000000 01802200 00000200 14000000 ......".........
0030 04019c00 0000ec01 0008f001 00082100 ..............!.
0040 00000800 00001200 00000180 ............
Contents of section .debug_abbrev:
0000 01110010 06110112 01030e1b 0e250e13 .............%..
0010 05000000 01110010 06110112 01030e1b ................
0020 0e250e13 05000000 .%......
Contents of section .debug_aranges:
0000 1c000000 02000000 00000400 00000000 ................
0010 ec000008 00010000 00000000 00000000 ................
0020 1c000000 02002600 00000400 00000000 ......&.........
0030 ec010008 04000000 00000000 00000000 ................
Contents of section .debug_str:
0000 70726f67 312e5300 2f746d70 2f746573 prog1.S./tmp/tes
0010 7400474e 55204153 20322e32 392e3531 t.GNU AS 2.29.51
0020 00766563 746f7274 61626c65 2e5300 .vectortable.S.

Disassembly of section .text:

080000ec <Reset_Handler>:
80000ec: 480f ldr r0, [pc, #60] ; (800012c <Reset_Handler+0x40>)
80000ee: 4910 ldr r1, [pc, #64] ; (8000130 <Reset_Handler+0x44>)
80000f0: 4a10 ldr r2, [pc, #64] ; (8000134 <Reset_Handler+0x48>)
80000f2: e003 b.n 80000fc <Reset_Handler+0x10>
80000f4: f852 3b04 ldr.w r3, [r2], #4
80000f8: f840 3b04 str.w r3, [r0], #4
80000fc: 4288 cmp r0, r1
80000fe: d3f9 bcc.n 80000f4 <Reset_Handler+0x8>
8000100: 480d ldr r0, [pc, #52] ; (8000138 <Reset_Handler+0x4c>)
8000102: 490e ldr r1, [pc, #56] ; (800013c <Reset_Handler+0x50>)
8000104: f04f 0200 mov.w r2, #0
8000108: e001 b.n 800010e <Reset_Handler+0x22>
800010a: f840 2b04 str.w r2, [r0], #4
800010e: 4288 cmp r0, r1
8000110: d3fb bcc.n 800010a <Reset_Handler+0x1e>
8000112: f000 f847 bl 80001a4 <EnableClockGPIOA>
8000116: f000 f84b bl 80001b0 <ConfigurePA8>
800011a: f44f 7180 mov.w r1, #256 ; 0x100
800011e: 4808 ldr r0, [pc, #32] ; (8000140 <Reset_Handler+0x54>)
8000120: 6001 str r1, [r0, #0]
8000122: 4808 ldr r0, [pc, #32] ; (8000144 <Reset_Handler+0x58>)
8000124: f000 f858 bl 80001d8 <StartSysTick>
8000128: bf30 wfi
800012a: e7fd b.n 8000128 <Reset_Handler+0x3c>
800012c: 20000400 .word 0x20000400
8000130: 20000404 .word 0x20000404
8000134: 080001f0 .word 0x080001f0
8000138: 20000404 .word 0x20000404
800013c: 20000408 .word 0x20000408
8000140: 40010810 .word 0x40010810
8000144: 0016e360 .word 0x0016e360

08000148 <SysTick_Handler>:
8000148: f04f 20e0 mov.w r0, #3758153728 ; 0xe000e000
800014c: 6900 ldr r0, [r0, #16]
800014e: f410 3f80 tst.w r0, #65536 ; 0x10000
8000152: d01d beq.n 8000190 <SysTick_Handler+0x48>
8000154: 481a ldr r0, [pc, #104] ; (80001c0 <ConfigurePA8+0x10>)
8000156: 7801 ldrb r1, [r0, #0]
8000158: f1b1 0f11 cmp.w r1, #17
800015c: d218 bcs.n 8000190 <SysTick_Handler+0x48>
800015e: 4819 ldr r0, [pc, #100] ; (80001c4 <ConfigurePA8+0x14>)
8000160: 7803 ldrb r3, [r0, #0]
8000162: 3b01 subs r3, #1
8000164: bf1c itt ne
8000166: 7003 strbne r3, [r0, #0]
8000168: e012 bne.n 8000190 <SysTick_Handler+0x48>
800016a: f101 0101 add.w r1, r1, #1
800016e: f1b1 0f11 cmp.w r1, #17
8000172: d202 bcs.n 800017a <SysTick_Handler+0x32>
8000174: 4a14 ldr r2, [pc, #80] ; (80001c8 <ConfigurePA8+0x18>)
8000176: 5c53 ldrb r3, [r2, r1]
8000178: 7003 strb r3, [r0, #0]
800017a: 4811 ldr r0, [pc, #68] ; (80001c0 <ConfigurePA8+0x10>)
800017c: 7001 strb r1, [r0, #0]
800017e: f011 0101 ands.w r1, r1, #1
8000182: bf0c ite eq
8000184: f44f 7180 moveq.w r1, #256 ; 0x100
8000188: f04f 7180 movne.w r1, #16777216 ; 0x1000000
800018c: 480f ldr r0, [pc, #60] ; (80001cc <ConfigurePA8+0x1c>)
800018e: 6001 str r1, [r0, #0]
8000190: 4770 bx lr

08000192 <BlinkTable>:
8000192: 0202 0202 0202 0205 0205 0205 0202 0202 ................
80001a2: .

080001a3 <BlinkTableEnd>:
...

080001a4 <EnableClockGPIOA>:
80001a4: 490a ldr r1, [pc, #40] ; (80001d0 <ConfigurePA8+0x20>)
80001a6: 6808 ldr r0, [r1, #0]
80001a8: f040 0004 orr.w r0, r0, #4
80001ac: 6008 str r0, [r1, #0]
80001ae: 4770 bx lr

080001b0 <ConfigurePA8>:
80001b0: 4908 ldr r1, [pc, #32] ; (80001d4 <ConfigurePA8+0x24>)
80001b2: 6808 ldr r0, [r1, #0]
80001b4: f020 000f bic.w r0, r0, #15
80001b8: f040 0002 orr.w r0, r0, #2
80001bc: 6008 str r0, [r1, #0]
80001be: 4770 bx lr
80001c0: 20000404 .word 0x20000404
80001c4: 20000400 .word 0x20000400
80001c8: 08000192 .word 0x08000192
80001cc: 40010810 .word 0x40010810
80001d0: 40021018 .word 0x40021018
80001d4: 40010804 .word 0x40010804

080001d8 <StartSysTick>:
80001d8: f04f 21e0 mov.w r1, #3758153728 ; 0xe000e000
80001dc: 6148 str r0, [r1, #20]
80001de: f04f 0000 mov.w r0, #0
80001e2: 6188 str r0, [r1, #24]
80001e4: f04f 0007 mov.w r0, #7
80001e8: 6108 str r0, [r1, #16]
80001ea: 4770 bx lr

080001ec <Default_Handler>:
80001ec: be00 bkpt 0x0000
80001ee: e7fd b.n 80001ec <Default_Handler>
</pre>

This is a lot of information. The “-d” flag tells objdump to disassemble code sections, and the “-s” flag lets it output data sections. At first, it prints the contents of “.VectorTable”. Each line is prefixed with the address of where this data is found in memory. Then, the 32bit data blocks from the vector table are output. The disassembler prints the bytes in the order they appear in memory, which, since the Cortex-M3 uses little endian, is reversed - for example, the printed “ed000008” actually refers to the address “0x080000ed”, which is the address of the “Reset_Handler” with the lowest bit set to one, as it is a thumb function. Most of the addresses in the vector table reflect the address of the default handler, 0x080001ec, except for the zero-entries and the SysTick_Handler. The contents of the “.text” section is the hexadecimal representation of the machine code, and hardly readable. The “.data” section contains a single “two” - this is the “02” put into “TimerEvents”. The contents of “.ARM.attributes:” and the various “.debug” sections is not very interesting, as it does not end up on the controller, and is only read by the various analysis tools to provide nicer output.

After that comes the actual disassembly. This is a list of all the instructions in the code section. The list is grouped by the symbols found in the input file. For C Code, each symbol usually matches one function, so each block in the disassembly represents one C function. In assembly code, if you put non-local labels into a function, that function will be split into multiple blocks by the disassembler, making it harder to read - the main reason for using local labels. Each instruction is translated into one line inside the blocks. The first column is the address where that instruction is found. The next column contains the hexadecimal representation of the 2 or 4 bytes that make up the machine code of that instruction, i.e. the actual content of flash memory. After that comes a textual representation of that instruction, as inferred by the disassembler. If the instruction contains some number, the disassembler sometimes outputs a semicolon followed by some interpretation of that number. If the instruction employs PC-relative addressing, that interpretation will be the absolute address. As many instructions have multiple spellings, there can be discrepancies between the original code and the disassembly. The disassembler will also output data, such as the “BlinkTable” and the literal pools, as such. Using the “.type” directive is helpful in that case so that the disassembler does not attempt to interpret the data bytes as code.

objdump can also be used to disassembly raw binary files that can be obtained by reading back the flash memory of some controller. To do this, use this command line:

<source lang=sh>
$ arm-none-eabi-objdump -b binary -m arm -D prog1.bin -Mforce-thumb --adjust-vma=0x08000000
</source>

The address of the binary in flash memory is specified so that the printed instruction addresses are correct. However, as the disassembler can’t tell data and code apart, the result will be of limited use. If you have to analyze a binary without having an ELF file or the source code, a more sophisticated disassembler such as IDA Pro is helpful. If you have the code and only need the disassembler to identify potential problems with the project (esp. the linker script), objdump is usually sufficient.
==== readelf ====
The “readelf” program is a powerful utility that can read and output various information from ELF files. The most useful option is the “-S” flag, which lets readelf print a summary of the sections in the respective file, e.g.:

<pre>
$ arm-none-eabi-readelf -S prog1.elf
There are 15 section headers, starting at offset 0x11268:

Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .VectorTable PROGBITS 08000000 010000 0000ec 00 A 0 0 1
[ 2] .text PROGBITS 080000ec 0100ec 000104 00 AX 0 0 4
[ 3] .stack NOBITS 20000000 020000 000400 00 WA 0 0 1
[ 4] .data PROGBITS 20000400 010400 000004 00 WA 0 0 1
[ 5] .bss NOBITS 20000404 010404 000004 00 WA 0 0 1
[ 6] .ARM.attributes ARM_ATTRIBUTES 00000000 010404 000021 00 0 0 1
[ 7] .debug_line PROGBITS 00000000 010425 0000db 00 0 0 1
[ 8] .debug_info PROGBITS 00000000 010500 00004c 00 0 0 1
[ 9] .debug_abbrev PROGBITS 00000000 01054c 000028 00 0 0 1
[10] .debug_aranges PROGBITS 00000000 010578 000040 00 0 0 8
[11] .debug_str PROGBITS 00000000 0105b8 00002f 01 MS 0 0 1
[12] .symtab SYMTAB 00000000 0105e8 0006a0 10 13 45 4
[13] .strtab STRTAB 00000000 010c88 000550 00 0 0 1
[14] .shstrtab STRTAB 00000000 0111d8 000090 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
</pre>

For each section, one line is output. The sections “.strtab”, “.shstrtab”, “.symtab” and “NULL” are an integral part of ELF and always present. The “.debug” sections are present if the source was assembled with the “-g” flag. The “.ARM.attributes” section defines for which ARM processor the contained code was translated. These sections don’t end up on the microcontroller. The remaining sections were defined in the linker script: “.VectorTable” contains the addresses of the exception handlers, “.text” contains the program code and constant data for flash memory, “.stack” the stack in RAM, “.data” contains variables in RAM and “.bss” contains zero-initialized variables in RAM. For these sections, the column “Type” contains either “PROGBITS” or “NOBITS” that tells you whether the section in the ELF file actually contains some data - this is only the case for “.VectorTable”, “.text” and “.data”. The sections “.bss” and “.stack” only reserve memory that is written at runtime, but the ELF file doesn’t contain data to be written in these sections. The column “Addr” defines where this section begins in the address space. The most useful column is “Size”: If you sum up the sizes of the sections “.VectorTable”, “.text” and “.data”, you can obtain the used flash memory. By summing ob “.data”, “.stack” and “.bss”, you get the used amount of RAM. Note that “.data” is counted twice, as the initialization data is stored in flash.
==== nm ====
The “nm” utility prints the symbols defined in an ELF file, for example:

<pre>
$ arm-none-eabi-nm prog1.elf
080001ec W ADC1_2_IRQHandler
20000404 b BlinkStep
08000192 t BlinkTable
080001a3 t BlinkTableEnd
20000408 B _BssEnd
20000404 B _BssStart
…
</pre>

This can be helpful in analyzing errors in linker scripts where symbols might get assigned wrong addresses.
==== addr2line ====
The “addr2line” utility reads the debug information from an ELF file to determine which line in which source file produced the instruction found at a particular given address. For example:

<source lang=sh>
$ arm-none-eabi-addr2line 0x080000f0 -e prog1.elf
/tmp/test/prog1.S:24
</source>

Here, line 24 of “prog1.S” contains the assembler command that produced the instruction that ends up at address 0x080000f0.

==== objcopy ====
The “objcopy” utility allows you to translate program files between different formats. It is useful to convert the ELF files to both the Intel Hex format and a simple binary representation. For example,

<source lang=sh>
arm-none-eabi-objcopy -O ihex prog1.elf prog1.hex
</source>

produces a “.hex” file that contains an image of the flash contents in hexadecimal form. With

<source lang=sh>
arm-none-eabi-objcopy -O binary prog1.elf prog1.bin
</source>

a binary file is created which contains an exact 1:1 image of the flash contents. Some flashing tools require these formats instead of ELF, and viewing the binary file with a hex editor can be interesting as well.

=== Interfacing C and C++ code ===
Since assembly is rarely used to implement entire complex projects, but mostly for few time-critical or especially low-level routines that are part of larger code bases written in a high-level-language, interfacing C and assembly code is an important topic, which will be covered here. While it is possible to write the main project structure in assembly and integrate some C modules, it is usually done the other way round. Most of the code shown is already ready to be included in C programs. Most of this topic works the same way for C++, apart from C++ exceptions (not to be confused with ARM processor exceptions) - but these are rarely used on embedded targets anyways.

If you compile C, C++ and assembly code into individual .o object files, you can link these together using “ld” as before. However, C and C++ code usually requires access to the respective standard library, and “ld” doesn’t link these by default - therefore it is necessary to substitute “ld” for a call to “gcc” or “g++” for C or C++, respectively. This will call “ld” internally and pass the required libraries.
==== Environment setup for C and C++ ====
Many C projects use a reset handler and vector table implemented in assembly, although writing them in C is possible too. As required by the C standard, C programs start with the “main()” function, so the (assembly) reset handler should setup the environment such that it is ready for C, and then call “main”. The C code might then later call some assembly functions or inline assembly. When using C++ code, or some GCC extension for C code, it is required to call some additional functions before calling “main”. This is used by C++ to call the constructors of global objects. The C and C++ compilers emit a table of function pointers to functions that should be called at startup. This table has to be put into flash memory by modifying the linker script as follows:

<source>
.text : {
*(.text)
. = ALIGN(4);

_InitArrayStart = .;
*(SORT(.preinit_array*))
*(SORT(.init_array*))
_InitArrayEnd = .;
} >FLASH
</source>

The table of function pointers is sorted to keep the order needed by the compiler. The symbols “_InitArrayStart” and “_InitArrayEnd” mark beginning and end of that table. A reset handler that performs the memory initialization as before and calls the table of initialization functions could look like this:

<source lang=asm>
.syntax unified
.cpu cortex-m3
.thumb

.text
.type Reset_Handler, %function
.global Reset_Handler
Reset_Handler:
ldr r0, =_DataStart
ldr r1, =_DataEnd
ldr r2, =_DataLoad

b 2f
1: ldr r3, [r2], #4
str r3, [r0], #4
2: cmp r0, r1
blo 1b

ldr r0, =_BssStart
ldr r1, =_BssEnd
ldr r2, =0

b 2f
1: str r2, [r0], #4
2: cmp r0, r1
blo 1b

ldr r4, =_InitArrayStart
ldr r5, =_InitArrayEnd

b 2f
1: ldr r0, [r4], #4
blx r0
2: cmp r4, r5
blo 1b

bl main
1: bkpt
b 1b
.ltorg
</source>

Note that for iterating the table, registers r4 and r5 are used, since the called functions may not overwrite those. The “blx” instruction is needed to perform the indirect function call. When everything is set up, the main function is called. For embedded programs, the main function should never return (i.e. contain an endless loop). If it does, that’s an error, and to make it easier to find, an endless loop with a forced breakpoint is put right after the call to “main”.
==== Calling functions ====
To call assembly functions from C code and vice-versa, the assembly functions should observe the calling convention, as mentioned before. C functions can be called just like assembly functions from assembly code, by placing the parameters in register r0-r3 and on the stack, calling the function using “bl” and retrieving the return value from r0. To call an assembly function from C code, you need to declare it in C first just like a C function. For example, to call a function that takes 2 integer arguments and returns an integer:

<source lang=c>
int AssemblyFunction (int a, int b);
</source>

If you now define a function named “AssemblyFunction” in your assembly code and export it via “.global”, you can call it from C code just like any function.
==== Accessing global variables ====
Global variables defined in C can be accessed from assembly code just like variables defined in assembly code, by using the variable’s name. To access an assembly variable from C code, you need to declare it first by specifying the type. For example, to declare an integer variable:

<source lang=c>
extern int AssemblyVariable;
</source>

If you now define a variable named “AssemblyVariable” in your assembly code and export it via “.global”, you can access it from C code just like any variable. The “extern” is required to make sure the C code doesn’t attempt to declare another variable of the same name.
=== Clock configuration ===
By default, STM32 controllers use an internal RC-oscillator with 8 MHz as a clock source for the core and periphery. This oscillator is too inaccurate for implementing a clock or using serial interfaces such as UART, USB or CAN. To obtain a more accurate clock, an external quartz crystal is usually applied. Many STM32 boards feature an 8 MHz crystal. To use it, some initialization code is required that activates the microcontroller’s built-in crystal-oscillator circuit and switches the clock input to that. The STM32 controllers also include a PLL which can multiply some input clock by a configurable factor before feeding it to the processor core and peripherals. This way, a precise and fast clock can be achieved - the STM32F103 supports up to 72 MHz core frequency. Unfortunately, flash memory is not capable of keeping up with such a high frequency. Therefore, when enabling a fast clock, the flash memory needs to be configured to use wait states depending on the frequency.

The following function configures the flash wait states, enables the crystal oscillator, configures the PLL to multiply the input clock by a factor of 9, and use that as the system clock. The prescaler for the internal bus APB1 is set to 2. Assuming an 8 MHz crystal, this achieves the maximum performance possible with this microcontroller - 72 MHz for the core and APB2 domain, 36 MHz for APB1. If a different crystal is used, the PLL factors have to be adjusted.

<source lang=asm>
RCC = 0x40021000

RCC_CR = 0x0
RCC_CR_PLLRDY = 25
RCC_CR_PLLON = 24
RCC_CR_HSERDY = 17
RCC_CR_HSEON = 16
RCC_CR_HSION = 0

RCC_CFGR = 0x04
RCC_CFGR_PLLMUL = 18
RCC_CFGR_USBPRE = 22
RCC_CFGR_PLLXTPRE = 17
RCC_CFGR_PLLSRC = 16
RCC_CFGR_PPRE2 = 11
RCC_CFGR_PPRE1 = 8
RCC_CFGR_HPRE = 4
RCC_CFGR_SWS = 2
RCC_CFGR_SW = 0

FLASH=0x40022000
FLASH_ACR=0
FLASH_ACR_PRFTBE = 4
FLASH_ACR_HLFCYA = 3
FLASH_ACR_LATENCY = 0

.type ConfigureSysClock, %function
.global ConfigureSysClock
ConfigureSysClock:
@ Turn on HSE
ldr r0, =RCC
ldr r1, =((1 << RCC_CR_HSION)|(1 << RCC_CR_HSEON))
str r1, [r0, #RCC_CR]

@ Configure (but not start yet) PLL
@ Mul = 9, Prediv = 1, APB1 Prescaler = 2, APB2 Prescaler = 1, AHB Prescaler = 1
ldr r2, =(((9-2)<<RCC_CFGR_PLLMUL)|(1 << RCC_CFGR_USBPRE)|(1 << RCC_CFGR_PLLSRC)|(4 << RCC_CFGR_PPRE1))
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
orr r1, #(1 << RCC_CR_PLLON)

@ Wait for HSE ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_HSERDY)
beq 1b

@ Turn on PLL
str r1, [r0, #RCC_CR]

@ Pre-Calculate value for RCC_CFGR
orr r2, #(2 << RCC_CFGR_SW)

@ Wait for PLL ready
1: ldr r3, [r0, #RCC_CR]
ands r3, #(1 << RCC_CR_PLLRDY)
beq 1b

@ Set flash wait states to 2
ldr r0, =FLASH
ldr r3, =((1<<FLASH_ACR_PRFTBE)|(2<<FLASH_ACR_LATENCY))
str r3, [r0, #FLASH_ACR]
ldr r0, =RCC

@ Switch system clock to PLL
str r2, [r0, #RCC_CFGR]

@ Pre-Calculate value for RCC_CR
bic r1, #(1 << RCC_CR_HSION)

@ Wait for switch to PLL
1: ldr r3, [r0, #RCC_CFGR]
and r3, #(3 << RCC_CFGR_SWS)
cmp r3, #(2 << RCC_CFGR_SWS)
bne 1b

@ Turn off HSI to save power
str r1, [r0, #RCC_CR]

bx lr
.ltorg
</source>

Many projects perform the clock configuration by the reset handler before calling the main function. If you want to follow that practice, place a “bl ConfigureSysClock” as the first instruction in the “Reset_Handler” - this way, all the setup will run with the higher clock frequency, making start-up faster. This and the completed startup code from the previous chapters is implemented in the “startup.S” file in the example repository. If you use it, put your code in the “main” function, where RAM and system clock will already be initialized. This is shown in the “BlinkStartup” example.

=== Project template & makefile ===
To quickly start your own project, a project template is supplied in the examples repository under the directory [https://github.com/Erlkoenig90/ArmAsmTutorial/tree/master/ProjectTemplate-STM32F103RB ProjectTemplate-STM32F103RB]. Put your own application code in the program.S file. The startup.S and vectortable.S contain the reset handler with RAM initialization and the vector table with default handler, respectively. A linker script is included too.

The project also contains a makefile. This allows you to quickly translate your project without having to type the assembler and linker commands. Simply type
<source lang=sh>
make
</source>
To translate the code and produce program.elf, program.bin and program.hex files. All “.S” files in the directory will be automatically translated. Writing makefiles is a complex topic on its own with a lot of information already available on the web, so no further explanations on that will be made here.

PHILIPS VP5500 VoIP Telefon

2013-07-24T13:13:37Z

Nicolas: Änderung 77693 von 46.105.114.75 (Diskussion) rückgängig gemacht.

= Verwandte Artikel =
* Konfiguration: http://www.mikrocontroller.net/topic/170483#new
* Kernel/GUI Entwicklung: http://www.mikrocontroller.net/topic/172616#new
* http://das-labor.org/wiki/VP5500
* (Hack) http://spritesmods.com/?art=vpx500
* http://vp6500.bd8.nl/

= Allgemeines =

Bei den diesem Text zugrundeliegenden Geräten handelt es sich um videofähige VoIP-Phones des Herstellers Philips, die vom früheren holländischen Anbieter KPN vertrieben wurden und nach dessen Ausscheiden aus dem Markt nun günstig erhältlich sind.
Ihr geringer Preis, die SoC-Architektur und das darauf laufende OS macht sie dabei für all jene interessant, die ihre eigenen Ideen und Projekte auf der Basis eines solchen Gerätes verwirklichen wollen, und gibt den oft fabrikneuen Exemplaren eine Art zweite Chance.

== Features ==

=== VP5500 ===
* Kamera-Auflösung 640x480 Pixel
* 30 Bilder pro Sekunde
* Kamera um 240° drehbar
* 5,6 cm (2,2") TFT-Display, 176x220 Pixel, 65000 Farben
* Audio-/Video-Ausgang - 4fach 2,5mm Klinke-Cinch
* integrierter Li-Ion Akku, 3,7 V-/1100 mAh
* englische und niederländische Menüsprache
* Software Qtopia Version 2.1.0
* Maße (LxBxH): 134x49x24 mm.

[[Datei:Philips_VP5500.jpg]]

=== VP6500 ===

* Kamera-Auflösung 640x480 Pixel
* 30 Bilder pro Sekunde
* Kamera um 240° drehbar
* 5,6 cm (2,2") TFT-Display, 176x220 Pixel, 65000 Farben
* Audio-/Video-Ausgang - 4fach 3,5mm Klinke-Cinch
* 2x Philips Multilife AA/R6NM 1800mAh NiMH-Akkus, je 1,2V, Ladeempfehlung: 15h mit 180mAh
* englische, deutsche und niederländische Menüsprache
* Software Qtopia Version 2.1.0
* Maße (LxBxH): 134x49x24 mm.
* Gewicht Mobilteil: 170 g (inkl. 2 NiMH-Akkus AA)
[[Bild:VP6500_kl.jpg]]

= Das VPx500 'rooten' =

Um vollen Zugang zum System auf dem Gerät zu erhalten gibt es mehrere Möglichkeiten.

== per telnet ==
Sobald das Telefon im Netzwerk angemeldet ist (egal ob DHCP oder statisch) kann über irgendein Terminal vom Netzwerk aus via telnet und der Telefon-IP auf das Gerät zugegriffen werden.
User: root / PW: toor

== per serieller Schnittstelle ==
Dazu muß der [[#UART]] angeschlossen werden. Während des Bootvorgangs drückt man immer wieder ziellos irgendwelche Tasten bis man am Prompt des [http://www.lartmaker.nl/lartware/blob/ blob-Bootloaders] ist.
Dann kann man
boot root=/dev/mtdblock2 init=/bin/sh
eingeben, um an eine minimale Shell zu kommen.

In dieser setzt man dann mit
passwd
das Passwort für ''root''. Danach kann man sich als ''root'' mit dem gesetzten Passwort einloggen.

== per DNS-Hack ==
Um den Rootzugriff zu aktivieren, wird dem Telefon ein Softwareupdate vorgegaukelt. Es versucht, auf den Updateserver von KPN zuzugreifen (den es nicht mehr gibt). Glücklicherweise geschieht dies nicht über eine feste IP, sondern über einen Hostnamen, so dass man an dieser Stelle sich durch einen eigenen DNS den Domainname auf einen eigenen Server umleiten kann. Daher ist im eigenen Netzwerk eine Umleitung des DNS erforderlich (oder ein DHCP-Server, der die Adresse des DNS bekanntgibt. Stichworte 'Static DNS', ..).
Das Gerät holt sich dann ein Updatepaket, das den Telnet-Zugang aktiviert.

Dann lässt sich per Terminal (Linux. Windows: ....) die Kommandozeile des Geräts aufrufen:

telnet 123.456.789.012

Passwort ist "toor".
Das Ändern des root-Passwortes erfolgt mit dem Befehl passwd.

Unter Applications>Registration sind Netzwerk- und VoIP-Einstellungen zu finden.

=== Anleitung für Fritz-Box-Benutzer ===
Die Fritzboxen bieten bisher leider keine Möglichkeit, den verwendeten DNS
direkt im Webinterface zu ändern, über Umwege geht es aber doch:
* Über das Webinterface der Fritzbox die Einstellungen sichern
* Exportdatei im Texteditor öffnen (am besten nicht Notepad, da der die Unix-Zeilenumbrüche nicht versteht - zur Not geht auch Wordpad)
* nach overwrite_dns1 suchen (gibt es zwei mal) und da den DNS 84.38.68.30 (von blooza http://www.mikrocontroller.net/topic/170483#1651124 bereitgestellt) oder 188.40.123.50 (von sprites http://spritesmods.com/?art=vpx500 bereitgestellt) eintragen
* am Anfang der Exportdatei VOR "**** CFGFILE:ar7.cfg" eine Zeile 'NoChecks = yes' einfügen, damit die Fritzbox die nun nicht mehr passende Checksumme ignoriert.
* Einstellungen zurück in die Fritzbox übertragen.

Wenn die Telefone entsperrt sind, den DNS wieder entfernen (Es ist
vielleicht eine ganz gute Idee, wenn man da zwei unabhängige DNS-Server
einträgt, so umgeht man auch gleich providerseitige DNS-Sperren)

==== Alternative ====
Wer sich den nicht ganz ungefährlichen Weg mit den anpassungen an der Fritzbox erspahren will, wechselt seine Fritzbox in den Expertenmodus. Danach kann man unter System - Netzwerk bei dem Punkt "IP adresse der Fritzbox ändern" bei vielen modellen den internen DHCP abschalten. Dann unter windows z.B. tftpd32 (http://tftpd32.jounin.net/) starten, in dessen DHCP-Server die Fritzbox als Gateway und den entsprechenden "modding-dns" eintragen und das Telefon neustarten.

War das Telefon schonmal angemeldet, versucht es die selbe IP vom neuen DHCP zu erzwingen, was tftpd32 nicht wirklich mag. Daher im tftpd32 die lease-ip bei 2 beginnen lassen und eine range von 250 eintragen, damit die vom Telefon verlangte IP auch im Adresspool des dhcp servers vorhanden ist.

=== Anleitung fli4l ===
Mit dem [http://www.fli4l.de fli4l] ist das ganze ganz einfach.

In der Datei config/dns_dhcp.txt den DNS Redirect für die zwei DNS-Name Konfigurieren.

<file>
DNS_REDIRECT_N='2' # number of redirected domains
DNS_REDIRECT_1='ntp.xs4all.nl' # 1st redirected domain
DNS_REDIRECT_1_IP='188.40.123.50' # IP of redirected domain

DNS_REDIRECT_2='vpcm-001.cust.kpn.net'
DNS_REDIRECT_2_IP='188.40.123.50'
</file>

= Hardware =

== verwendete Komponenten ==

* MCU: [http://www.mikrocontroller.net/attachment/72258/datasheet.pdf Freescale MC9328MX21] (ARM9) @ 266MHz ([http://en.wikipedia.org/wiki/I.MX21 Wikipedia_englisch])
** gehört zur ARM9E-Familie: ARMv5TEJ -"IntructionSet" (ARM926EJ-S)
** [http://www.freescale.com/files/32bit/doc/ref_manual/MC9328MX21RM.pdf MC9328MX21 Applications Processor Reference Manual]
* PC-to-TV-Konverter-Chip: [http://www.mikrocontroller.net/attachment/73579/Data_Sheets.zip FS455LF]
* WLAN: Marvell 88w8385, als Modul von [http://www.mikrocontroller.net/attachment/72802/WM-G-MR-01-v27__01192006.pdf WM-G-MR-01] (VP5500) / Philips [http://www.mikrocontroller.net/attachment/73812/BGW211_Preliminary_Datasheet_v1.1.pdf BGW211], on-board (VP6500)
* SDRAM: 2 x [http://www.mikrocontroller.net/attachment/72461/K4S56163LF.pdf K4S56163LF] - 4M x 16Bit x 4 Banks im VP5500
* SDRAM: 1 x [http://www.mikrocontroller.net/attachment/73520/Samsung_64MB_K4M51323PC_1_8V.pdf K4M51323PC_1_8V] - 4M x 32Bit x 4 Banks im VP6500
* Flash: 2 x ws128j0pbfw00 [http://www.mikrocontroller.net/attachment/72462/S29WS064J.pdf S29WS128J/064J] 128/64 Megabit (8/4 M x 16-Bit) CMOS 1.8 Volt-only Simultaneous Read/Write, Burst Mode Flash Memory im VP5500
* Flash: 1 x [http://www.mikrocontroller.net/attachment/73521/S29GL512N.pdf S29GL512N] - 512 Megabit, 3.0 Volt-only Page Mode Flash Memory im VP6500
* Kamera: dc-4626.a5 by chicony
* Display: Samsung LTS220QC (HD66772 Controller)

== Messungen ==

=== Stromaufnahme ===

==== VP5500 ====
FIXME

==== VP6500 ====

Konfiguration: VP6500 mit aktivierter serieller Konsole an Labornetzteil, Spannung 3.67V (Bei weniger bootet es anscheinend aufgrund von Stromspitzen nicht richtig und vermeldet auf der seriellen Konsole ein 'battery low' und schaltet sich danach selbst ab. Diese Spannung sollte noch so gerade 'safe' sein, geht man davon aus, daß da ein Step-Up im Innern am werkeln ist und noch ein geringer Abfall über die Schottky-Diode einzurechnen ist).

===== Telefon =====
* Booten: bis zu 420mA
* Einbuchen: ca. 400mA
* mit abgeschaltetem Display, aber eingebucht und laufend (idle): ca. 80mA
* mit angeschaltetem Display, eingebucht und idle: ca. 190mA
* mit 100% Prozessor-Last (von serieller Konsole ausgeführt: "while true; do true; done"): ca. 60mA mehr als idle (Scheint aber auch größere Sprünge für andere CPU-Last zu geben)
* im 'Deep Sleep' (wird ein paar Minuten nach Einschalten erreicht): <10mA (!), mit kurzen Wachphasen mit erheblichem Stromverbrauch (>100mA). Beachte: Die serielle Konsole ist in diesem Modus auch nicht mehr aktiv, das Telefon ist aber nicht abgestürzt (Tastendruck aktiviert die Konsole wieder).
* Telefonieren ca 500mA
* Telefonieren mit Webcam ca 550mA

===== Ladegerät =====
* Phone nicht in der Ladeschale: < 0.2W
* Phone wird geladen: ca 4W

== System-Takte ==
# cat /proc/systclk
System clocks state:
Ref clock : 32768Hz (int, premult by 512)
MPLL clock: 263999905Hz (computed: 264000000Hz)
SPLL clock: 163840000Hz (computed: 163840000Hz)
CPU clock : 263999905Hz (PERSC = 0)
Bus clock : 132000000Hz (BCLKDIV = 1, computed: 131999952Hz)
CSI clock : unknown (cannot read register CSICR1)
USB clock : 20480000Hz (USB_DIV = 7)
Wait State: CS0U[WSC] = 10, CS0U[WSC] = 48
loops_per_jiffy = 665058

== Innenleben ==

=== Zerlegen des VP5500 ===
<gallery>
Datei:01_offen_von_Hinten.JPG|1. Geöffnetes Akkufach
Datei:02_Blende_entfernt.JPG|2. hintere Schwarze Blende entfernt
Datei:03_Rückseite_entfernt.JPG|3. Schalenrückteil entfernt
Datei:04_aufhebeln_Vorderteil.JPG|4. Druck nach außen
Datei:05_vorderteil_entfernt.JPG|5. entferntes Vorderteil
Datei:06_Elektronik_entfernt.JPG|6. entfernte Elektronik
</gallery>

# das Akkufach muss geöffnet und die Schrauben entfernt werden
# hinter der rückseitigen schwarzen Blende befinden sich zwei Schrauben, die entfernt werden müssen
#* Hierzu am besten mit einem schmalen kleinen Schraubendreher von der Stirnseite aus zwischen die Plastikteile fahren und vorsichtig aufhebeln und den Schraubendreher dabei weiter unter die Blende bewegen.
# Nun kann einfach das rückseitige Schalenteil abgeschaubt werden
# das Entfernen des Frontschalenteils ist etwas tricky:
#* von oben und unten lässt sich ganz gut ein Spalt zur Seite aufweiten
#* etwas unter der Mitte hängt es aber auf beiden Seiten. Dort befinden sich kleine Plastikbügel, die recht leicht zerbrechen.
#* Mit einem sehr schmalen Schraubendreher in eine der Lücken fahren (anfangen auf der Seite ohne Tasten) und den Schraubendreher nach innen drücken, so das der Druck in der Seite nach außen wirkt.
#* Mit etwas Geschick bekommt man das so ohne Bruch ab, es ist aber auch nicht kritisch, wenn der Bügel ein wenig anbricht)
# Die Platine zu entfernen ist nicht ganz so schwierig.
#* Zuerst die Seite auf der keine Knöpfe sind:
#* vorsichtig diese Seite leicht anheben. Am unteren Ende ist auf die Kontakte zu achten
#* dann versuchen die Platine seitlich nach oben aus dem Gehäuse zu ziehen
#* dabei auf das Lautsprecherkabel und den Kamerakonnektor achten
# Zusammenbau genauso, nur umgedreht ;)
#* nicht die Lautsprecher- und Kamera-Stecker vergessen

=== Bilder vom VP5500 Innenleben ===
<gallery widths="240" >

Datei:Oberseite.jpg | Ansicht der Oberseite
Datei:Oberseite_beschriftet.jpg | Oberseite mit Beschriftung der Bauteile
Datei:Drumherum.jpg | Übersicht über die Komponenten
</gallery>

<gallery widths="240" >
Datei:Kontakte_Oberseite.jpg | UART-Schnittstelle Oberseite
Datei:Kontakte_Unterseite.jpg | JTAG-Schnittstelle Unterseite
</gallery>

=== Testpins am VP5500 ===
Original Liste von [http://www.mikrocontroller.net/topic/170483#1646514 Tino] [[Media:Philips vp5500 Liste Testpins.pdf|herrunterladen]].

[[Bild:Philips vp5500 testpins 1.jpg|thumb| Testpins auf der Rückseite]]
[[Bild:Philips vp5500 testpins 2.jpg|thumb| Frontseite: Die ICs wurden zum Messen ausgelötet.]]

{| class="wikitable"
! Testpin Nr. !! verbunden zu !! Funktion
|-
| 1 || LED || unbekannt
|-
| 2 || Kamera Pin 8 an CN1 || unbekannt
|-
| 3 || Pin 4 an IC1 || unbekannt
|-
| 4 || z.B. IC5 Pin V 18 || VDDA
|-
| 5 || GND || GND
|-
| 6 || NC || NC
|-
| 7 || Pin E 16 an IC5 || SAP_CLK
|-
| 8 || Pin 1 an CN 3 || Lautsprecher
|-
| 9 || Pin 2 an CN 3 || Lautsprecher
|-
| 10 || Pin an CN 4 || ?
|-
| 11 || Pin an CN 4 || ?
|-
| 12 || Pin an CN 4 || ?
|-
| 13 || Pin L 16 an IC5 || UART2_TXD
|-
| 14 || Pin C 12 an IC5 || USBH1_RXDM
|-
| 15 || Pin H 12 an IC5 || USBH1_TXDP
|-
| 16 || Pin B 19 an IC5 || CSPI2_SS2
|-
| 17 || Pin B16 an IC5 || SSI2_FS
|-
| 18 || Pin F 18 an IC5 || KP_ROW0
|-
| 19 || Pin K 18 an IC5 || KP_COL4
|-
| 20 || Pin L 19 an IC5 || UART3_TXD
|-
| 21 || Pin T 14 an IC5 || RESET_IN
|-
| 22 || geht an + des Kondensators neben IC6 || unbekannt
|-
| 23 || Pin D 13 an IC 9 || CLKOUT
|-
| 24 || Pin E 13 an IC 9 || CLKIN_P
|-
| 25 || Pin A 9 an IC 9 || DAC_D
|-
| 26 || Pin A 8 an IC 9 || DAC_A
|-
| 27 || Pin A 7 an IC 9 || DAC_B
|-
| 28 || Pin A 6 an IC 9 || DAC_C
|-
| 29 || Pin L 13 an IC 5 || UART1_TXD
|-
| 30 || Pin T 16 an IC 5 || BOOT1
|-
| 31 || Pin K 10 an IC 5 || UART1_RXD
|-
| 32 || Pin U 17 an IC 5 || BOOT2
|-
| 33 || Pin 9 an IC 23 || unbekannt
|-
| 34 || Pin D 19 an IC5 || CSPI2_SCLK
|-
| 35 || Pin C 14 an IC5 || TIN
|-
| 36 || Pin C 19 an IC5 || CSPI2_SS1
|-
| 37 || Pin D 18 an IC5 || CSPI2_SS0
|-
| 38 || Pin E 19 an IC5 || CSPI2_MOSI
|-
| 39 || Pin H 19 an IC5 || PWMO
|-
| 40 || Pin J 9 an IC 7 und IC 8 || VDD
|-
| 41 || Pin J 19 an IC 5 || KP_COL2
|-
| 42 || Pin K 16 an IC 5 || KP_COL3
|-
| 43 || Pin J 11 an IC 5 || KP_ROW2
|-
|44 || Pin J 17 an IC 5 || KP_COL1
|-
|45 || Pin G 19 an IC 5 || KP_ROW4
|-
|46 || Pin G 17 an IC 5 || KP_ROW3
|-
|47 || Pin D 5 an IC 10 und IC 11 || ACC
|-
|48 || Pin G 16 an IC 5 || KP_ROW1
|-
|49 || Pin J 18 an IC 5 || KP_COL0
|-
|50 || Pin V 18 an IC 5 || VDDA
|-
|51 || Pin 2 an CN KB-Stecker || ?
|-
|52 || Pin 2 an IC 15 || ?
|-
|53 || Pin 4 am LCD Stecker || ?
|-
|54 || Pin E 17 an IC 5 || CSPI2_MISO
|-
|55 || Pin 1,2,3,10,13 am LCD Stecker || ?
|-
|56 || Pin 3 an IC 16 || ?
|-
|57 || Pin U 10 an IC 5 || PC_PWRON
|-
|58 || Pin 1 IC 18 || ?
|-
|59 || an Diode über IC 16 || ?
|-
|60 || Pin 3 an IC 25 || ?
|-
|61 || Ladekontakt positiv || Ladegerät +
|-
|62 || GND || GND
|-
|63 || Akku Mittelkontakt || Akkustand? Temperatur?
|-
|64 || Transistor unter IC 16 || ?
|-
|65 || Pin 2 an IC 22 || ?
|-
|66 || Pin 1,12,30 an IC 24 || VSS
|-
|67 || Kondensator + unter IC 20 || ?
|-
|68 || Pin 5 an IC 21 || ?
|-
|69 || Pin 2 an IC 4 || ?
|-
|70 || Pin 25 an IC24 || MCLK
|-
|71 || Prozessor Pin W 14 || QVDD
|}

=== Testpins am VP6500 ===
Original Listen von [http://www.mikrocontroller.net/topic/170483#1658720 Tino] herunterladen. [[Media:V6500_Back.xls|Rückseite]], [[Media:V6500_Front.xls|Vorderseite]].

[[Bild:V6500_Back.jpg|thumb| Testpins auf der Rückseite]]

{| class="wikitable"
! Testpin Nr. !! verbunden zu !! Funktion
|-
| 1 || Pin 21 an BGW211EG || POR_N
|-
| 2 || Pin A 14 an i.MX21 || TOUT
|-
| 3 || Pin 46 an BGW211EG || JTAG_TDI
|-
| 4 || Pin 44 an BGW211EG || JTAG_TCLK
|-
| 5 || Pin 42 an BGW211EG || JTAG_TDO
|-
| 6 || Pin 47 an BGW211EG || JTAG_TRST_N
|-
| 7 || PIN 45 an BGW211EG || JTAG_TMS
|-
| 8 || Pin 43 an BGW211EG || JTAG_RTCLK
|-
| 9 || Pin C 14 an i.MX21 || TIN
|-
| 10 || VDD || VDD
|-
| 11 || Pin K 10 an i.MX21 || UART1_RXD
|-
| 12 || Pin L 13 an i.MX21 || UART1_TXD
|-
| 13 || GND || GND
|-
| 14 || Pin U 17 an i.MX21 || BOOT2
|-
| 15 || Pin V 16 an i.MX21 || BOOT0
|}

[[Bild:V6500_Front.jpg|thumb| Frontseite: Die ICs wurden zum Messen ausgelötet.]]

{| class="wikitable"
! Testpin Nr. !! verbunden zu !! Funktion
|-
| 1 || Pin 29 von BGW211EG & F 16 an i.MX21 || RESET_N / CSPI1_SS1
|-
| 2 || Pin 28 an BGW211EG & J 12 an i.MX21 || SPI_DAT_MOSI / CSPI1_MOSI
|-
| 3 || Pin 27 an BGW211EG & F 19 an i.MX21 || SPI_SS_N / CSPI1_SS0
|-
| 4 || Pin 26 an BGW21EG & F 17 an i.MX21 || SPI_DAT_MISO / CSPI1_MISO
|-
| 5 || Pin 25 an BGW211EG & H 10 an i.MX21 || SPI_CLK / CSPI1_SCLK
|-
| 6 || Pin 24 an BGW211EG & H 11 an i.MX21 || SPI_EXT_INT / CSPI1_RDY
|-
| 7 || Pin 41 an BGW211EG || UART_TX
|-
| 8 || Pin 40 an BGW211EG || UART_RX
|-
| 9 || Transistor unter Testpunkt || ?
|-
| 10 || GND || GND
|-
| 11 || LED || ?
|-
| 12 || Pin 8 am Kamerastecker || ?
|-
| 13 || Pin 18 am Kamerastecker || ?
|-
| 14 || Pin E 13 & H12 an FS455 || CLKIN / PREF
|-
| 15 || Pin D 13 an FS455 || CLKOUT
|-
| 16 || Pin A 11 an FS455 || XTAL_OUT
|-
| 17 || VDD von FS455 || VDD
|-
| 18 || Pin 15 & 16 am Displayconnector || ?
|-
| 19 || geht an 2 Dioden links daneben || ?
|-
| 20 || Pin M 19 an i.MX21 || UART1_CTS
|-
| 21 || Pin M 18 an i.MX21 || UART1_RTS & GND ??
|-
| 22 || VDDan i,MX21 || VDD
|-
| 23 || Pin G 10 an i.MX21 || USB_BYP
|-
| 24 || Pin A 19 an i.MX21 || SSI3_FS
|-
| 25 || Pin D 17 an i.MX21 || SSI2_CLK
|-
| 26 || VDD an i.MX21 || VDD
|-
| 27 || Pin T 17 an i.MX21 || SD1_D3
|-
| 28 || Pin A 15 an i.MX21 || SAP_TXDAT
|-
| 29 || GND || GND
|-
| 30 || Pin T 14 an i.MX21 || RESET_IN
|-
| 31 || Pin R 19 an i.MX21 || TRST
|-
| 32 || Pin P 19 an i.MX21 || TMS
|-
| 33 || Pin N 17 an i.MX21 || TCK
|-
| 34 || Pin K 11 an i.MX21 || TDO
|-
| 35 || Pin P 18 an i.MX21 || TDI
|-
| 36 || Pin 13 an TLV320 || OUTP2
|-
| 37 || Pin 14 an TLV320 || OUTMV
|-
| 38 || Pin 15 an BGW211EG || VDD
|-
| 39 || Pin an Klinkenbuchse || ?
|-
| 40 || Pin an Klinkenbuchse || ?
|-
| 41 || ? || ?
|-
| 42 || LED Tastatur || ?
|-
| 43 || LED Tastatur || ?
|-
| 44 || Pin 1,8 an 20XN2512 & Key ON || PowerON
|-
| 45 || Pin 7 an BDR72K || ?
|-
| 46 || Pin 2 an BDR72K || ?
|-
| 47 || Pin 6 an MRRBGB3 || ?
|-
| 48 || LED Tastatur || ?
|-
| 49 || Pin 10 an MRRBG3 || ?
|-
| 50 || LED Tastatur || ?
|-
| 51 || geht an Widerstand auf Rückseite ? || ?
|-
| 52 || Pin L 13 an i.MX21 || UART1_TXD
|-
| 53 || geht an Diode und Kondensator auf der Rückseite || ?
|-
| 54 || Pin C 14 an i.MX21 || TIN
|-
| 55 || geht an Widerstand auf der Rückseite || ?
|-
| 56 || Pin 7 an BDR72K || ?
|-
| 57 || Pin K 10 an i.MX21 || UART1_RDX
|-
| 58 || Pin U 17 an i.MX21 || BOOT2
|-
| 59 || Pin T 16 an i.MX21 || BOOT1
|-
| 60 || geht an Widerstand auf der Rückseite || ?
|-
| 61 || Pin V 16 an i.MX21 || BOOT0
|-
| 62 || LED Tastatur || ?
|-
| 63 || LED Tastatur || ?
|-
| 64 || Pin 25 an TLV320 || MLCK
|-
| 65 || Pin4 an 69W2440D || ?
|-
| 66 || VDD von TLV320 || VDD
|-
| 67 || geht an Widerstand auf der Rückseite || ?
|-
| 68 || geht an Widerstand auf der Rückseite || ?
|-
| 69 || Pin 65 an MRRBG3 || ?
|-
| 70 || GND || GND
|-
| 71 || LED Tastatur || ?
|-
| 72 || LED Tastatur || ?
|-
| 73 || Akku positiv || Plus Akku
|-
| 74 || Ladekontakt positiv || Ladekontakt positiv
|-
| 75 || Akku positiv || Plus Akku
|}

== UART ==
TIN muss auf low gezogen werden, um die Schnittstelle zu aktivieren.
Jedoch startet dann das Telefon nicht vollständig (Fix siehe [[#Betrieb mit aktivierter serieller Schnittstelle]], zum rooten langt es jedoch, siehe [[#per serieller Schnittstelle]])

* Spannungs-Pegel: 3.3V
* Baudrate: 115200 bps
* Stopbits: 1
* Flussteuerung: keine
=== VP5500 ===
[[Bild:VP5500_seriell_highlight.svg|100px|UART-Pins VP5500(Frontseite der Platine)]]

Die serielle Schnittstelle ist unten vom Akkufach aus zugänglich.
Obiges Bild kennzeichnet die für die serielle Kommunikation benötigten Pins.

=== VP6500 ===
[[Bild:VP5600-Serialport-Preliminary.jpg|100px|UART-Pins VP6500]]

Die serielle Schnittstelle ist unten vom Akkufach aus zugänglich.
Obiges Bild kennzeichnet die für die serielle Kommunikation benötigten Pins. 
VCC liefert anscheinend die ungeregelte Akkuspannung, Boot-Pins nicht verifiziert.

== JTAG ==
[[Bild:jtag.jpg Belegung der JTAG-Pins (Rückseite der Platine)]]

== Hardware Modifikationen ==

Nachdem wir ja bereits wissen wie das [[#Zerlegen_des_Telefons]] geht, steht der Nachrüstung von Bauelementen und Schnittstellen generell nichts mehr im Wege.

=== Buchse für UART ins VP5500 einbauen ===
<gallery>
Datei:10_Pins.JPG|1. unbearbeitete Pinreihe
Datei:11_Pins_bearbeitet.JPG|2. präperierte Pinreihe
Datei:09_Pads.JPG|3. präperierte Pads
Datei:12_Pins_ausrichten.JPG|4. Pinreiheausrichten
Datei:13_Pins_angelötet_1.JPG|5. angelötete Pinreihe
Datei:14_Pins_angelötet_2.JPG|6. angelötete Pinreihen
Datei:08_Mittelteil.JPG|7. Mittelteil mit ausgefeiltem Loch
Datei:15_Mittelteil_zusammengebaut.JPG|8. wieder zusammengesteckt
Datei:16_in_Aktion.JPG|9.verbundene serielle Schnittstelle
</gallery>

Zunächst muss das Telefon zerlegt werden ([[#Zerlegen des Telefons]])
Um das Gehäuse nicht zu beeinträchtigen, habe ich mich dazu entschieden, alles so zu lassen wie es ist und nur kleine Buchsen einzubauen.

# Hierfür habe ich einreihige gedrehte IC-Sockel genutzt
# deren Beine abgezwickt, und etwas Lötzinn aufgetragen (mit der langen Reihe kann man die kurzen, schmalen Teile super handhaben)
# ebendso auf die Pads ein wenig eingezinnt
# ausrichten und festlöten
# eine Reihe
# die zweite Reihe
# bei der Gehäuseöffnung über den Pins habe ich mit einer feinen Schlüsselfeile den Rand wenig aufgeweitet. An der Gummimatte hab ich nix geändert.
# fertig
# und im Einsatz

=== Anschluss für UART des VP6500 zugänglich machen ===
Vorteil beim VP6500: es muss dazu nicht zerlegt werden, da sich die Kupferpads unter den Akkus im Akkufach verstecken. Dies ermöglicht eine lötfreie Variante des seriellen Anschlusses. Die Pinbelegung ist unter [[#VP6500_2|UART]] beschrieben.
<gallery>
Datei:01_Kuli-Molex.jpg|1. Benötigte Teile: Stück Plastik + Molex Stecker
Datei:02_gefeiltes_Plastik_18_mm.jpg|2. Länge der Aussparung im Batteriefach: 18 mm
Datei:03_gefeiltes_Plastik_4_mm.jpg|3. Breite der Aussparung im Batteriefach: 4 mm
Datei:04_gefeiltes_Plastik_passt.jpg|4. Solange feilen bis es passt
Datei:05_gefeiltes_Plastik_Kerben.jpg|5. Padabstand markiert
Datei:06_gekerbtes_Plastik_Molex.jpg|6. So bekommt man die Federn aus den Steckern
Datei:07_Federelement_roh.jpg|7. Frisch aus dem Stecker
Datei:08_Federelement_offen.jpg|8. Aufgebogen
Datei:10_Federelement_unter_Plastik.jpg|9. In Plastikführung
Datei:09_Federelement_umgebogen.jpg|10. Umgebogen
Datei:11_halbfertig_passt.jpg|11. Kerbung angepasst?
Datei:12_ganz_fertig1.jpg|12. Mit Heißkleber sichern
</gallery>
*1. Da wir 4 Pole anschließen wollen, benötigen wir 4 federnde Teile aus "Molex"-Steckern, wie man sie aus CPU-Lüftern kennt. (Entweder hat man einen 4-poligen für die modernen PWM geregeltn zur Hand den man ausschlachten kann, oder man nimmt zwei dreipolige alte auseinander.)
*2./3. Ferner braucht man ein Stück Plastik mit den Maßen 4 mm mal 18 mm, welches als Träger für die Federkontakte dienen soll. Die Höhe ist nicht so entscheidend. Man kann z.B. den Clip eines Kugelschreibes passend zuschneiden und feilen.
*4. Plastikträger so weit zufeilen, dass er in die Aussparung passt (Tip: Bindfaden darum knoten um ihn wieder entfernen zu können)
*5. Abstände zwischen den Kontakten die kontaktiert werden müssen (TIN, GND, RX, TX : siehe [[#UART|UART]]) markieren und in die Unterseite des Trägers Führungskerben für die Federn sägen. Ebenso braucht man kleine Aussparungen um die Metallzungen von der Platine in das Batteriefach zu führen.
*6./7. Aus den Molex-Steckern bekommt man die Federn sehr einfach raus, indem man mit einem kleinen flachen Schraubenzieher vorsichtig in die Aussparungen sticht und hinten am Kabel zieht.
*8. Die einzelnen Metallzungen nun noch auf 90° aufbiegen.
*9. In die Führungskerbe einschieben.
*10. Metallzunge umbiegen. (Zange)
*11. Sind alle Kerbungen und Aussparungen richtig abgemessen und gut zugefeilt, dann passt alles in die Lücke im Batteriefach, ohne dass sich die verschiedenen Metallteile berühren.
*12. Mit etwas Schrumpfschlauch und Heißkleiber kann man dem Verrutschen der Metallzungen vorbeugen und die Kurzschlußwahrscheinlichkeit senken. Ferner dient der Heißkleber als Kraftüberträger, damit die Batterien die leicht federnden Metallzungen fest auf die Kontakte auf der Platine drücken können.
*13. Batterien einsetzen. Ohne diese wird das ganze irgendwann doch wieder rausfallen und die Verbindung von Anfang an unzuverlässig sein.

Problembehandlung:
* Kein Kontakt: (Man kann z.B. die Masse auf Durchgang prüfen.)
** Träger nicht tief genug in die Lücke gedrückt: Mit schmalem Schraubenzieher nachdrücken. Meistens auf der Seite notwendig wo die Metallzungen nach oben kommen.
** Träger wird nach aussen gedrückt (Metallzungen federn ja): Mehr Heißkleber um mehr Druck durch die Batterien zu bekommen.
** Träger tief drin, aber trotzdem keine oder unzuverlässige Verbindung: Die Metallzungen auf der Unterseite haben verschiedene "Dicken", oder die Kerben sind unterschiedlich tief. Ein Tropfen Lötzinn auf die zu niedrigen Metallzungen erledigt dies. (Anmerkung: Der Autor musste überall ein wenig Lötzinn auftragen um die notwendige Dicke und sichere Verbindung zu erreichen.)

= Nutzung für Voice over IP (VoIP, SIP) =

== weiterführende Links ==

*http://de.wikipedia.org/wiki/IP-Telefonie
*http://de.wikipedia.org/wiki/Session_Initiation_Protocol
*http://de.wikipedia.org/wiki/H.323
*http://de.wikipedia.org/wiki/Softphone

Benutzer ist 103

Mit [http://ekiga.org Ekiga] konnte so per 103@xxx.xxx.xxx.xxx bei ersten Tests eine Sprachverbindung zum Telefon aufgebaut werden.

== SIP Einstellungen ==

Die SIP Einstellungen können alternativ auch direkt in der Datei
/user_data/data/hpr0userparam.cfg
vorgenommen werden.

=== Einstellung für 1und1 ===

* SIP1:
** Display Name: ...
** Username: 49#VORWAHLOHNE0#NUMMER#
** Telephone Number: 0#VORWAHLOHNE0#NUMMER#
* Auth:
** Auth Username: 49#VORWAHLOHNE0#NUMMER#
** Password: *********
* Server:
** sip.1und1.de:5060
* Proxy:
** sip.1und1.de:5060
* RTP:
** 30000 und 30019
* STUN:
** stun.1und1.de
* STUN Server Port:
** 3478
* SIP2:
** UDP: 5060
** TCP: 5060

=== Einstellung für Vodafone NGN (Arcor NGN) ===

* SIP1:
** Display Name: ...
** Username: VORWAHLUNDRUFNUMMER
** Telephone Number: (leer lassen)
* Auth:
** Auth Username: VORWAHLUNDRUFNUMMER
** Password: *********
* Server:
** arcor.de:5060
* Proxy:
** VORWAHL.sip.arcor.de:5060
* RTP:
** 10000 und 10001
* STUN:
** (X) use rport
* SIP2:
** UDP: 5060
** TCP: 5060
(getestet von Micha mit EasyBox 802; an EasyBox n-WLAN ausschalten)

=== Einstellung für Sipgate ===

*SIP1
**Display Name: Sipgate Username
**User Name: your SIPgate-ID
**Telephone Number: Sipgate-Telefonnummer
*Auth
**Authentification UserName: your SIPgate-ID
**Password: Sipgate Passwort
*Server
**SIP register address:port: sipgate.de:5060
*Proxy
**SIP proxy1 address:port: sipgate.de:5060
*RTP Audio Channel
**RTP: 22800
**RTCP: 22801
*RTP Video Channel
**RTP: 22804
**RTCP: 22805
*STUN: [X] use rport
*SIP2
**SIP Port Listen
**for UDP: 5062
**for TCP: 5062
**for TCP TLS: 5053

SIP Outbound muss leer sein.

=== Alternative Einstellung für Sipgate ===

Hinter meiner Freetzbox hat die obige Einstellung nix gebracht - incoming calls wurden nicht signalisiert. Folgendes tut dagegen:

*SIP1
**Display Name: Sipgate Username
**User Name: your SIPgate-ID
**Telephone Number: Sipgate-Telefonnummer
*Auth
**Authentification UserName: your SIPgate-ID
**Password: Sipgate Passwort
*Server
**SIP register address:port: sipgate.de:5060
*Proxy
** leer
*SIP2
**Symmetric Mode [X]
**SIP Port Listen
**for UDP: 5062
**for TCP: 5062
**for TCP TLS: 5053
*STUN: [x] use rport

SIP Outbound muss leer sein.

Ein Videotelefonat von Sipgate zu Sipgate zwischen zwei VP6500 wurde erfolgreich getestet. Gegebenenfalls muss die eigene Videoübertragung noch durch Drücken der Taste '''Video''' gestartet werden.

Bei mir hat es nur per UDP funktioniert, aber mit starken Delays (>1s)

=== Einstellung für Ekiga.net ===

*SIP1
**Display Name: Irgendwas
**User Name: username
**Telephone Number: leer
*Auth
**Authentication User Name: username
**Password: password
*Server
**SIP register: ekiga.net:5060
**Protocol: ( ) TCP (*) UDP
**Expire Timer: 3600
**Keep Alive: 0
*Proxy
**alle leer
*STUN
**( ) use rport
**STUN Server IP address: stun.ekiga.net
**STUN Server port: 3478
*SIP2
**(*) Symmetric Mode
**UDP: 5060
**TCP: 5060
**TCP TLS: 5061
*OBproxy
**alle leer

Nach der (kostenlosen) Registrierung bei ekiga.net und der Anmeldung des Telefons kann man unter der 500 einen Audio und Video(!) Test machen. 
Weitere features: https://www.ekiga.net/index.php?page=services 
Ekiga teilt keine Festnetz Rufnummern zu, daher ist ein Anruf von/zu Festnetz Telefonen nicht möglich. 
Bei Ekiga.net angemeldete Geräte können aber problemlos untereinander telefonieren, sogar mit Video. Da man vom Mainscreen des VP5500/6500 aus direkt nur numerische Kontakte (herkömmliche Telefonnumern) wählen kann, Ekiga.net Telefonnummern aber aus [Benutzername]@ekiga.net bestehen, legt man über das Menü des VP5500/6500 einfach einen neuen Kontakt (Telefonbuch) an. Als Video-Rufnummer trägt man einfach [Benutzername]@ekiga.net ein, wobei [Benutzername] der Name des Ekiga-Accounts ist, den man erreichen will. Zwischen der Eingabe von Buchstaben, Zahlen und Sonderzeichen kann man dabei mit der [#]-Taste des VP5500/6500 umschalten.
Sollte bei einem Telefonat über Ekiga Video einmal nicht funktionieren, hilft eventuell die manuelle Aktivierung der Videofunktion mittels der Video-Taste auf der Tastatur des VP6500

=== Einstellung für Fritzbox 7170/7270 und andere mit SIP-Registrar ===
Die Anmeldung eines Telefons auf der Fritzbox starten (System/Ansicht/"Expertenansicht aktivieren", dann Telefonie/Telefoniegeräte/"Neues Gerät einrichten", "Telefon", "Bitte auswählen"/"LAN/WLAN (IP-Telefon)") und sich eine Nummer geben lassen, dann in Registration auf dem VPx500 wechseln und die Einstellungen wie unten vornehmen. Anschließend will das Telefon diese Einstellungen aktivieren, vorher noch auf der FB die Anmeldung starten. 
 
Im Beispiel will die FB die Nummer 621 vergeben: 
'''Reiter SIP1''' 
Display Name: egal <leer lassen> 
User Name: 621 
Telephone Number: 621 
 
'''Reiter Auth''' 
Authentication UserName: 621 
Password: [hier das gleiche, wie auf der FB eingeben] 
 
'''Reiter Server''' 
SIP register address:port 
192.168.2.1:5060 (IP Bitte auf Euer Netz anpassen) oder alternativ: fritz.box:5060 
Protocol: UDP 
ExpireTime: 3600 
Keep Alive: 300 
 
'''Reiter SIP2''' 
[ ] Symmetric Mode 
SIP Port Listen 
for UDP: 5060 
 
'''Anmerkungen dazu:''' 
Protokoll: UDP 
Bei TCP hat das VP6500 nach Minuten oder Stunden immer wieder die Verbindung zur FB verloren. 
Keep Alive: 300 
Die Keep Alive Time habe ich eingestellt, weil ich die TCP Probleme umgehen wollte. Ich denke nicht, dass es zu Problemen bzgl. Akku kommt. Da es so aber perfekt funktioniert, habe ich es gelassen. Das UDP Protokoll kommt zudem mit weniger Netzwerk-Traffic aus. 
Anmeldung: 
Die Anmeldung wurde von der Fritz!Box nicht immer erfolgreich bestätigt. Einfach Weiter klicken und die restlichen Einstellungen vornehmen. Spätestens nach einem Reboot des VPx500 funktioniert alles einwandfrei. 

Falls eure Fritz!Box keine Möglichkeit bietet, ein IP-Telefon anzumelden, empfehle ich euch mal in das http://wiki.ip-phone-forum.de/skript:speedport2fritz einzulesen. 
Bzw. mal im IP-Phone-Forum nach SIP-Registrar suchen. 

Der SIP Listen Port 5060 hat mich viel Zeit gekostet - stand noch von sipgate direkt auf 5062 und das VP5500 hat sich dann zwar an der fritzbox registriert aber keine eingehenden Anrufe empfangen...

=== Fritz!Box Hinweis ===
Hinter meiner Fritz!Box konnte ich auf dem Port 5060 keine Incoming Calls bekommen, da die Box auf diesen Port für ihre eigenes System hört.
Geholfen hat mit dann ein Wechsel auf Port 5061 im Reiter SIP2

== Eigener VoIP Server mit Asterisk ==

*http://www.das-asterisk-buch.de

===Asterisk auf einer Fritz!Box===

*http://www.asterisk-kompakt.de/artikel/45-asterisk-auf-fritzbox-phone.html

== Videogespräche zu anderen Clients / Softphones ==
=== Ekiga Softphone ===

Damit Videos klappen, müssen Ekiga und VP6500 über mindestens einen übereinstimmenden Videocodec verfügen, klar. Im Falle des VP6500 ist es wohl so, daß ausschließlich H.263 in verschiedenen Ausprägungen zur Verfügung steht.

Ekiga kommt jedoch zunächst unter Ubuntu 9.10 nur mit H.261 und Theora Codec. Wir müssen daher den H.263-Codec in Ekiga zusätzlich einhängen.

Leider ist der H.263-Codec nicht ganz frei zugänglich, sodaß wir eine Fremdquelle benötigen, um den Codec mit dem Paketmanager installieren zu können. Daher muß Bojos Ekiga-Plugin-PPA wie in

https://launchpad.net/~bojo42/+archive/ekiga

beschrieben als Paketquelle hinzugefügt werden. Den zugehörigen Schlüssel nicht vergessen!
Wenn die neue Quelle bekannt gemacht ist, können im Paketmanager nun die

; Pakete
: libopal3.6.1-plugins-h263-1998
: libopal3.6.1-plugins-ilbc
: libavcodec-dev

installiert werden. Darauf achten, daß alle Abhängigkeiten sauber erfüllt sind.
Das H.263-Plugin läßt sich nur installieren, wenn libstdc++6 >= 4.4.0 vorhanden ist, was meines Wissens erst ab Ubuntu 9.10 der Fall ist.
Nach der Installation dieser Komponenten kann Ekiga neu gestartet werden. Es sollte nun unter Bearbeiten->Einstellungen->Video->Codecs zusätzlich den H.263-Codec anbieten. Durch Verschieben nach oben kann man diesen beim Handshake priorisieren.
Nach Integration des H.263-Codecs in Ekiga konnte ich mit zwei Sipgate-Accounts störungsfrei, sogar über den gleichen DSL-Anschluß, videofonieren.

== DEMO MODE ==

Um den DEMO MODE zwischen einem VP55 und VP65 herzustellen muss man die Dateien
mit den Einstellungen vom VP55 auf das VP65 übertragen. Diese liegen in
/usr/local/data/demo/ und es sind vier Dateien. Danach ist die SSID und der KEY
bei beiden gleich eingerichtet und die beiden Geräte verbinden sich miteinander.

= Software =

== Vorsicht Fallen! ==
Es ist nicht schwer, sich den Zugang zum Telefon abzuschneiden, wenn man nicht aufpasst.

== Betrieb mit aktivierter serieller Schnittstelle ==

Das 5500 und das 6500 scheint nicht komplett zu starten wenn man TIN auf low hat
und die serielle Schnittstelle benutzt. Man kann das Gerät dann nicht normal bedienen. Dies lässt sich ändern indem man in der Datei

/usr/local/startup/daemon.sh

ziemlich am Anfang das TINDETECT="TRUE" ändert auf TINDETECT="FALSE"

Danach startet er auch mit aktiver serieller Verbindung komplett durch
und das Gerät ist ganz normal bedienbar.

== Grundlagen ==
Bestimmte Aktionen werden immer wieder benötigt.
Diese sollen hier kurz beschrieben werden.

Es werden dennoch grundlegende Kenntnisse von Kommandozeilen vorausgesetzt.

==== Dateien bearbeiten mit vi ====
Auf dem Telefon ist der minimalistische Editor vi installiert mit dem Dateien über Telnet bearbeitet werden können.
Für eine genaue Bedienung bitte Google benutzen.
Die wichtigsten Bedienelemente werden hier kurz erläutert.

Datei Öffnen mit 'vi Dateipfad'
vi kennt zwei Modi: Kommando- und Einfüge-Modus.

'''i''' - wechselt in den Einfüge-Modus, in dem geschrieben werden kann 
'''a''' - anhängen (hinter dem aktuellem Zeichen in den Einfüge-Modus) 
'''[Esc]''' wechselt zurück in den Kommandomodus.
In diesem kann mittels Pfeil- und Bildlauftasten navigiert werden.
:x - löscht das Zeichen an Cursor position
:d''n''d - löscht ''n'' Zeile(n) in den Zeilenbuffer (ohne ''n'' = eine Zeile)
:y''n''y - kopiert ''n'' Zeile(n) in den Zeilenbuffer (ohne ''n'' = eine Zeile)
:p - fügt Inhalt des Zeilenbuffer '''unter''' der aktuellen Zeile ein
:<nowiki>:</nowiki>q! - schließt ohne zu speichern
:<nowiki>:</nowiki>w - speichert
:<nowiki>:</nowiki>wq - speichert und beenden

==== Dateien auf das Telefon laden ====
Um Daten von einem http-Server zu laden, benutzt man
wget url
Die Datei wird dann in das aktuelle Verzeichnis geladen, weswegen vorher in das Zielverzeichnis wechseln.

Um Daten von einem ftp-Server zu laden, benutzt man ftp.
Auch hier muss vorher in das zielverzeichnis gewechselt werden.
ftp hostname
dann gegebenenfalls die Zugangsdaten eingeben und mittels 'cd' und 'ls' in das Entsprechende Verzeichnis auf dem FTP-Server wechseln
und anschließend mittels
get dateiname
die Datei herunterladen.

==== Dateien vom Telefon herunterladen ====
Auch hier bietet sich ein FTP an.
Mittels 'ftp hostname' verbinden, Benutzerdaten eingeben, in das entsprechende FTP-Server-Verzeischnis wechseln und mittels
put localeDatei
eine Lokale Datei hochladen.

==== Alternative: Dropbear ====

Wurde der dropbear-ssh server installiert [[#Dropbear (SSH-Server) installieren]] können mittels eines Programms mit SCP-Unterstützung (zB. [http://winscp.net/eng/docs/lang:de WinSCP] für Windows) sehr komfortabel Dateien ausgetauscht werden.

Mit WinSCP können auch Dateien direkt bearbeitet werden.
Der Client lädt die Datei herunter, öffnet einen Editor und lädt die Datei wieder herauf, wenn diese geändert wurde.

== Grundkonfiguration, die das Leben erleichtert ==
=== .bashrc ===
Die installierte Shell ist die bash. Einstellungen bezieht sie aus der (versteckten) Datei ".bashrc". Hier kann man sinnvolle Ergänzungen vornehmen, damit sich das Telefon ein bißchen mehr wie ein gewohntes Linux-System verhält:

<syntaxhighlight lang="c">
alias vim=vi
PS1="[\u@\h] \w $ "
</syntaxhighlight>

== Backup ==
=== Backup des Flash ===
If you want to make a backup of your root partition, you can do as
follows:

<syntaxhighlight lang="c">
echo '#!/bin/sh' > /tmp/backup.sh
echo 'cat /dev/mtdb2 2>/dev/null' >> /tmp/backup.sh
chmod 700 /tmp/backup.sh
micro_inetd 31337 /tmp/backup.sh
</syntaxhighlight>

This'll make your device listen for incoming connections on port 31337.
On your host system you may then simply run
nc ip.of.your.phone 31337 > fon_rootfs
et voilà, you got your rootfs packed into a file.

Note that the backed up file is not ext2, but a jffs2 formatted
filesystem. These can't be handled by a simple "mount -o loop" as you'd
have thought... so here's how you mount it:

<syntaxhighlight lang="c">
modprobe jffs2
modprobe mtdram total_size=65536 erase_size=128
modprobe mtdblock
mkdir /tmp/phone-root
mknod /tmp/phone-mtdb2 b 31 0
dd if=/your/backup/file of=/tmp/phone-mtdb2
mount -t jffs2 /tmp/phone-mtdb2 /tmp/phone-root
</syntaxhighlight>

[[#Dateisystem herunterladen]] describes another way to dump the filesystem for closer examination.

=== Komplettes Backup ===
Die Datei http://www.mikrocontroller.net/attachment/73323/S91backup_pipe auf das Gerät laden und als ausführbar markieren.

cd /etc/rc.d/init.d
wget http://www.mikrocontroller.net/attachment/73323/S91backup_pipe
chmod +x /etc/rc.d/init.d/S91backup_pipe

Damit wird eine Art Backup-Server mit dem Boot gestartet.
wenn man dies nicht möchte kann man das Skript natürlich auch an jeden beliebigen anderen Ort legen und per Hand starten.

Nun kann man von einem Rechner aus mittels nc (netcat) die Bereiche sichern:

nc 192.168.1.3 31337 > fon_rootfs
nc 192.168.1.3 31338 > fon_udata
nc 192.168.1.3 31339 > fon_usettings
nc 192.168.1.3 31340 > fon_bootld
nc 192.168.1.3 31341 > fon_kernel

Die 192.168.1.3 natürlich mit der IP des Gerätes austauschen, die fon_*
Dateinamen könnt ihr natürlich auch frei vergeben.

fon_udata ist die /user_data Partition, fon_usertings dementsprechend
die /user_settings Partition.

Der bootld Bereich enthält auch die Parameter. Um das später (falls
überhaupt nötig) mittels blob zu restaurieren müsste die Datei noch in
zwei Teile aufgeteilt werden. Der erste enthält dann den reinen
Bootloader-Bereich, der zweite die Parameter. Wer's wirklich braucht für
den kann ich noch ne Anleitung zum Aufteilen geben. Wirklich Sinnvoll
ist das aber nicht, hat man den Bootloader erstmal mit was anderem
überschrieben kann man ihn ja auch nicht mehr zum Wiederherstellen
benutzen....

=== Zurückspielen der Backups ===

1) Man braucht ein (die) Backup-Image(s).

2) Serielle Verbindung zum Telefon

3) Akku abstecken, wieder anstecken.

4) Telefon einschalten und im Terminalprogram auf die Tasten kloppen, so
das man im Bootloader landet. Dabei muss man recht schnell sein. Es
sollte dann ein Prompt kommen:

<syntaxhighlight lang="c">
blob>
</syntaxhighlight>

5) Nun gibt man ein
xdownload param
Wobei 'param' der Teil ist, den man wiederherstellen will:
* blob - Bootloader (fon_bootld)
* param - Parameter Bereich (Bootloader oder Kernel?) (fon_bootld)
* kernel - Der Kernel (fon_kernel)
* ramdisk - Das Root-Filesystem / (fon_rootfs)
* ramdisk2 - Das /user_data Filesystem (fon_udata)
* ramdisk3 - Das /user_settings Filesystem (fon_usettings)

Beim Backup ist blob + param in einer Datei, müsste man also ggf.
erstmal aufsplitten.

6) Er wartet dann auf den Upload. Nun startet man im Terminalprogram den
Upload des Backup-Images, dazu verwendet man das X-Modem Protokoll.

7) Kaffee trinken, auf's Klo gehen, mit Frau/Freundin/Mutter ein Gespräch
anfangen.

8) Irgendwann ist der Upload fertig. Dauert halt lange. Man landet
wieder am "blob>" prompt. Nun gibt man
flash param
ein.

9) Er schreibt nun das, was man hochgeladen hat, in das Flash.

10) "boot" eingeben. Da Telefon bootet nun normal.

Achtung: Wenn im Backup nicht die Änderung gemacht wurde damit das
Telefon auch bei angeschlossener serieller Schnittstelle startet, kommt
man nicht weiter als wie bis zur Sanduhr. Dann einfach die serielle
abstecken (Also den TIN pin wieder freigeben) und das Telefon neustarten
(Akku kurz ab- und wieder anstöpseln)

=== Dateisystem herunterladen ===
Zum unkomplizierten Durchsuchen des Dateisystems kann es nützlich sein, dieses vom Gerät zu kopieren.

Folgende Befehle erzeugen wie beim Backup des Flash einen kleinen Server, dessen Output auf anderer Seite mittels mittels nc abgeholt werden kann:

<syntaxhighlight lang="c">
echo '#! /bin/sh' > /tmp/backup.sh
echo 'cd /' >> /tmp/backup.sh
chmod 700 /tmp/backup.sh
echo 'tar cf - bin boot dev etc home lib mnt opt root sbin tmp trace upgrade user_data user_settings usr var 2>/dev/null' >> /tmp/backup.sh
micro_inetd 31340 /tmp/backup.sh
</syntaxhighlight>

Die lange Liste mit Unterverzeichnissen ist notwendig um /proc zu überspringen, was Probleme mit tar verursachen würde.

Auf einem anderen Linux system (oder cygwin) kann mittels
<syntaxhighlight lang="c">
nc telefon-Ip 31340 > file.tar
</syntaxhighlight>
die Datei abgerufen werden.

Der Vorgang dauert aber ein ganz paar Minuten.

Heraus kommt ein Tar-Archiv, was alle Dateien des Gerätes enthält - inclusive der temporären Dateien der Ram-Disks.

== Erkunden des Systems mit Bordmitteln ==

=== Ausgabe von <tt>dmesg</tt> auf einem VP6500 ===
<pre>
6>NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
NetWinder Floating Point Emulator V0.95 (c) 1998-1999 Rebel.com
VFS: Mounted root (jffs2 filesystem).
Freeing init memory: 68K
PCB version: ind3 v2
Driver SYSTCLK: SYSTCLK-1.12 (REFERENCED)
Driver GPIO-1.59 (REFERENCED)
****p_gpio_init_low_bat****
GPIO: p_gpio_it_init at 762
Driver FRAMEBUF-1.12 (REFERENCED)
Driver SPI-1.20 (REFERENCED) Debug level 3

u32_spi1_MinLenghtForDMAInTX set to 300

u32_spi1_MinLenghtForDMAInRX set to 300
Driver LCD-1.20 (REFERENCED)
Driver TVLINK-1.45 (REFERENCED)
Reset from Software Reset.
Motorola PostProcessor Linux driver ver 0.64 - Copyright (C) 2003 Motorola Inc
pp: hw ver = 2
prp_dbg=0
Motorola PreProcessor Linux driver ver 0.0 - Copyright (C) 2003 Motorola Inc
hmp4d: base_port=0x10026800 irq=50
hmp4d: module inserted
hmp4e: base_port=0x10026c00 irq=49
hmp4e: Compatble HW found with ID: 0x004c1882
hmp4e: module inserted. Major = 249
SPI2:: drv_Init :PID of driver: 134
SPI2:: drv_Init :ScanList not provided. Will use the default scan list.
SPI2:: drv_Init :ScanChannelList :1 6 11 14 2 7 12 3 8 13 4 9 5 10
SPI2:: drv_Init :setting PhyType to: Rf-to-Rf
SPI2:: drv_Init :Ref.Clock parameter not provided
SPI2:: drv_Init :Configure target for a reference clock of 'default=40' Mhz.
SPI2:: drvRegEtherDev :Interface Name is: eth%d
SPI2:: drv_Init :HEOCSIWPOWON: Powering on...
SPI2:: drvPhase2Init :Protocol Firmware will be loaded by driver ...
SPI2:: drvPhase2Init :Initializing HHAL (PhgHhalInitialize)...
Divider : 8

OCR2 : e4015308 (12582912)
Reset : 3 / 27 (c497cc00 / e401531c)

Reset : 3 / 27 (c497cc00 / e401531c)
GPIO: p_gpio_init_gpio_status at 1262
GPIO: POWER_FAIL signal NOT detected at GPIO driver init carry on !!!
GPIO: CHARGE_IN at init
GPIO: LOW_BAT_OUT at init
GPIO: No accessory plugged at init. - Set Video on jack
GPIO: camera to front at init
**ChargeStatusPmb=========gpio_Read_ChargeStatus_Ready=1
SPI2:: PhgOsal_linux_init_thread :assigning thread name and deamonize() ..
SPI2:: drvPhase2Init :Success
SPI2:: drvPhase2Init : registering callbacks with HHAL..
SPI2:: drvHhalEventIndicationHandler :
SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVNT_INIT_COMPLETE; setting CARRIER_ON
SPI2:: drvPhase2Init :calling PhgHhalQueueMgmtReq()!
PhgHhalQueueMgmtReq:1172:HHAL got Init message
PhgHhalQueueMgmtReq:1217:HHAL done Init message
SPI2:: drvHhalEventIndicationHandler :PS wake (0) in Drvmain
SPI2:: drvHhalEventIndicationHandler :
SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVENT_DISCONNECT; setting CARRIER_OFF
SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00001002, IFF_UP=0
SPI2:: drvHhalEventIndicationHandler :device was already closed
SPI2:: drvMgmtCfmHndler :Using MAC Address: 00:08:c6:86:8b:99
SPI2:: drvPhase2Init :init etherdev; stopping queue, setting CARRIER_OFF
SPI2:: drv_Init :Philips WLAN Drv - loaded - in state: 1
SPI2:: drvInit :Philips WLAN Drv - loaded
SPI2:: drvOpen :opening net device
SPI2:: drvOpen :Device is not associated!
SPI2:: drvOpen :Carrier flag is already set to CARRIER_OFF
SPI2:: drvOpen :Disabling again netqueue
requested reg.domain code setting = 3
SPI2:: drvIoctl :set u8LinkAdaptation : 1 Result=[0]
SPI2:: drvIoctl :changed HEOCSIWLNADPALLOWRATES: 8 allowed rate codes
SPI2:: drvInitConnect :Req to connect to new WLAN network
SPI2:: drvInitConnect :Disabling TX queue and setting CARRIER_OFF
SPI2:: drvInitConnect :Connecting To AP...
SPI2:: drvInitConnect :step2
SPI2:: drvHhalEventIndicationHandler :
SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVENT_DISCONNECT; setting CARRIER_OFF
SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00001003, IFF_UP=1
SPI2:: drvHhalEventIndicationHandler :device was already opened; stopping queue
SPI2:: drvInitConnect :step3 : u8Status 255
SPI2:: drvInitConnect :step5
SPI2:: drvInitConnect :TIMEDOUT
SPI2:: drvInitConnect :step6
SPI2:: drvInitParamsAndPowerOnAndConnect :Connect failed!
Motorola CSI Linux driver ver 0.1
- Copyright (C) 2004 Motorola Inc

Driver SENSOR-1.29 (REFERENCED)
i2c-client version : 1.9
Initialize i2c-client-aic14 module
Module i2c-client-aic14 initialized
Insert module aic14 (AIC14-1.0)
Module AIC14 assumes CODEC MCLK already configured for 20480000Hz
Driver KPP-1.36 (REFERENCED)
Driver DOZE-1.27 (REFERENCED)
SPI2:: drvDoScan :Buero (bittorf)
SPI2:: drvProcessScanCfm :Scan Confirm: Success 1 APs
SPI2:: drvInitConnect :Req to connect to new WLAN network
SPI2:: drvInitConnect :Disabling TX queue and setting CARRIER_OFF
SPI2:: drvInitConnect :Connecting To AP...
SPI2:: drvInitConnect :step2
SPI2:: drvHhalEventIndicationHandler :
SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVNT_INIT_CONNECT; setting CARRIER_ON
SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00001003, IFF_UP=1
SPI2:: drvHhalEventIndicationHandler :device was already opened; enabling queue
SPI2:: drvInitConnect :step3 : u8Status 8
SPI2:: drvInitConnect :step4
SPI2:: drvInitConnect :Successful
SPI2:: drvInitConnect :step6
SPI2:: drvIoctl :dot11LongRetryLimitAC0 = 8
SPI2:: drvIoctl :dot11LongRetryLimitAC1 = 8
SPI2:: drvIoctl :dot11LongRetryLimitAC2 = 8
SPI2:: drvIoctl :dot11LongRetryLimitAC3 = 8
SPI2:: drvIoctl :dot11ShortRetryLimitAC0 = 8
SPI2:: drvIoctl :dot11ShortRetryLimitAC1 = 8
SPI2:: drvIoctl :dot11ShortRetryLimitAC2 = 8
SPI2:: drvIoctl :dot11ShortRetryLimitAC3 = 8
SPI2:: drvIoctl :PA Request
SPI2:: drvIoctl :No state change!
SPI2:: drvIoctl :Fast PS Request
PhgHhalDoM2SDMA:1661:-->P1
SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
SPI2:: drvStop :Driver Stop: disable TX queue! (usage: 2)
SPI2:: drvIoctl :Deauth BSSID: 00:1d:7e:18:e3:89
SPI2:: drvHhalEventIndicationHandler :
SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVENT_DISCONNECT; setting CARRIER_OFF
SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00000002, IFF_UP=0
SPI2:: drvHhalEventIndicationHandler :device was already closed
SPI2:: drvOpen :opening net device
SPI2:: drvOpen :ERROR: Associated, but Carrier flag is set to CARRIER_OFF
SPI2:: drvDoScan :Buero (bittorf)
SPI2:: drvProcessScanCfm :Scan Confirm: Success 1 APs
SPI2:: drvInitConnect :Req to connect to new WLAN network
SPI2:: drvInitConnect :Disabling TX queue and setting CARRIER_OFF
SPI2:: drvInitConnect :Connecting To AP...
SPI2:: drvInitConnect :step2
SPI2:: drvHhalEventIndicationHandler :
SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVNT_INIT_CONNECT; setting CARRIER_ON
SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00000003, IFF_UP=1
SPI2:: drvHhalEventIndicationHandler :device was already opened; enabling queue
SPI2:: drvInitConnect :step3 : u8Status 8
SPI2:: drvInitConnect :step4
SPI2:: drvInitConnect :Successful
SPI2:: drvInitConnect :step6
SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
SPI2:: drvIoctl :PA Request
PhgHhalDoM2SDMA:1661:-->P0
SPI2:: drvHhalEventIndicationHandler :PS wake (0) in Drvmain
drvSetWOWFilter: Enable UNICAST: Disable ARP: Enable
Ip addr = 10.63.17.5. LMP=2
SPI2:: drvIoctl :PA Request
SPI2:: drvIoctl :No state change!
SPI2:: drvIoctl :Fast PS Request
PhgHhalDoM2SDMA:1661:-->P1
SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
Sensor driver: initialize device OV7660
Warning: Remapping obsolete /dev/fb* minor 32 to 1
SPI2:: drvIoctl :dot11LongRetryLimitAC0 = 4
SPI2:: drvIoctl :dot11LongRetryLimitAC1 = 4
SPI2:: drvIoctl :dot11LongRetryLimitAC2 = 4
SPI2:: drvIoctl :dot11LongRetryLimitAC3 = 4
SPI2:: drvIoctl :dot11ShortRetryLimitAC0 = 4
SPI2:: drvIoctl :dot11ShortRetryLimitAC1 = 4
SPI2:: drvIoctl :dot11ShortRetryLimitAC2 = 4
SPI2:: drvIoctl :dot11ShortRetryLimitAC3 = 4
SPI2:: drvIoctl :PA Request
PhgHhalDoM2SDMA:1661:-->P0
SPI2:: drvHhalEventIndicationHandler :PS wake (0) in Drvmain
</pre>

=== Ausgabe von <tt>logread</tt> auf einem VP6500 ===
<pre>
Jan 1 00:00:00 imx21 syslog.info syslogd started: BusyBox v0.60.0 (2007.02.28-13:39+0000)
Jan 1 00:00:02 imx21 daemon.emerg klogd: klogd started: BusyBox v0.60.0 (2007.02.28-13:39+0000)
Jan 1 00:00:02 imx21 daemon.warn klogd: Linux version 2.4.20-celf3 (root@wbul04) (gcc version 3.3.2) #1 Wed Feb 28 13:30:26 UTC 2007
Jan 1 00:00:02 imx21 daemon.warn klogd: CPU: ARM926EJ-Sid(wb) [41069264] revision 4 (ARMv?(8))
Jan 1 00:00:02 imx21 daemon.warn klogd: CPU: D undefined 14 cache
Jan 1 00:00:02 imx21 daemon.warn klogd: CPU: I cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets
Jan 1 00:00:02 imx21 daemon.warn klogd: CPU: D cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets
Jan 1 00:00:02 imx21 daemon.warn klogd: Machine: Freescale i.MX2 ADS
Jan 1 00:00:02 imx21 daemon.warn klogd: FCLK=266000 kHz HCLK=133000 kHz IPGCLK=66500 kHz
Jan 1 00:00:02 imx21 daemon.warn klogd: PERCLKs: 1=44333 KHz 2=33250 kHz 3=44333 kHz 4=88666 kHz
Jan 1 00:00:02 imx21 daemon.warn klogd: On node 0 totalpages: 16384
Jan 1 00:00:02 imx21 daemon.warn klogd: zone(0): 16384 pages.
Jan 1 00:00:02 imx21 daemon.warn klogd: zone(1): 0 pages.
Jan 1 00:00:02 imx21 daemon.warn klogd: zone(2): 0 pages.
Jan 1 00:00:02 imx21 daemon.warn klogd: Kernel command line: root=/dev/mtdblock2 noinitrd ip=none mtdparts=s29gl512n:256k@0x00000000
(bootloader)ro,896k@0x00040000(kernel)ro,50432k@0x00120000(fs#1),12800k@0x03260000(fs#2),1152k@0x03EE0000(fs#3)
Jan 1 00:00:02 imx21 daemon.debug klogd: Relocating machine vectors to 0xffff0000
Jan 1 00:00:02 imx21 daemon.warn klogd: Console: colour dummy device 80x30
Jan 1 00:00:02 imx21 daemon.warn klogd: Calibrating delay loop (skipped)... 132.71 BogoMIPS
Jan 1 00:00:02 imx21 daemon.info klogd: Memory: 64MB = 64MB total
Jan 1 00:00:02 imx21 daemon.notice klogd: Memory: 63052KB available (1366K code, 299K data, 68K init)
Jan 1 00:00:02 imx21 daemon.info klogd: Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
Jan 1 00:00:02 imx21 daemon.info klogd: Inode cache hash table entries: 4096 (order: 3, 32768 bytes)
Jan 1 00:00:02 imx21 daemon.warn klogd: Mount-cache hash table entries: 1024 (order: 1, 8192 bytes)
Jan 1 00:00:02 imx21 daemon.warn klogd: Buffer-cache hash table entries: 4096 (order: 2, 16384 bytes)
Jan 1 00:00:02 imx21 daemon.warn klogd: Page-cache hash table entries: 16384 (order: 4, 65536 bytes)
Jan 1 00:00:02 imx21 daemon.warn klogd: POSIX conformance testing by UNIFIX
Jan 1 00:00:02 imx21 daemon.info klogd: Linux NET4.0 for Linux 2.4
Jan 1 00:00:02 imx21 daemon.info klogd: Based upon Swansea University Computer Society NET3.039
Jan 1 00:00:02 imx21 daemon.warn klogd: Initializing RT netlink socket
Jan 1 00:00:02 imx21 daemon.info klogd: apm: Simulating APM BIOS version 1.2 (Driver version 1.0)
Jan 1 00:00:02 imx21 daemon.warn klogd: i.MX21 Dynamic Power Management
Jan 1 00:00:02 imx21 daemon.warn klogd: Starting kswapd
Jan 1 00:00:02 imx21 daemon.warn klogd: Disabling the Out Of Memory Killer
Jan 1 00:00:02 imx21 daemon.notice klogd: JFFS2 version 2.1. (C) 2001, 2002 Red Hat, Inc., designed by Axis Communications AB.
Jan 1 00:00:02 imx21 daemon.info klogd: i2c-core.o: i2c core module version 2.6.2 (20011118)
Jan 1 00:00:02 imx21 daemon.warn klogd: pty: 256 Unix98 ptys configured
Jan 1 00:00:02 imx21 daemon.info klogd: Serial driver version 5.05c (2001-07-08) with no serial options enabled
Jan 1 00:00:02 imx21 daemon.warn klogd: UART driver version 0.3.6
Jan 1 00:00:02 imx21 daemon.warn klogd: RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
Jan 1 00:00:02 imx21 daemon.warn klogd: I2C driver Feb 28 2007 / 13:31:04
Jan 1 00:00:02 imx21 daemon.info klogd: Initialize i2c-client-dbmx-codec module
Jan 1 00:00:02 imx21 daemon.notice klogd: s29gl512n: probing 16-bit flash bus
Jan 1 00:00:02 imx21 daemon.notice klogd: Amd/Fujitsu Extended Query Table v1.3 at 0x0040
Jan 1 00:00:02 imx21 daemon.notice klogd: number of CFI chips: 1
Jan 1 00:00:02 imx21 daemon.notice klogd: cfi_cmdset_0002: Using Write Buffer method.
Jan 1 00:00:02 imx21 daemon.notice klogd: cfi_cmdset_0002: buffer_Write_Time = 128
Jan 1 00:00:02 imx21 daemon.notice klogd: cfi_cmdset_0002: Disabling fast programming due to code brokenness.
Jan 1 00:00:02 imx21 daemon.notice klogd: Using static partition definition
Jan 1 00:00:02 imx21 daemon.notice klogd: Creating 5 MTD partitions on "s29gl512n":
Jan 1 00:00:02 imx21 daemon.notice klogd: 0x00000000-0x00040000 : "bootloader"
Jan 1 00:00:02 imx21 daemon.notice klogd: 0x00040000-0x00120000 : "kernel"
Jan 1 00:00:02 imx21 daemon.notice klogd: 0x00120000-0x03260000 : "fs #1"
Jan 1 00:00:02 imx21 daemon.notice klogd: 0x03260000-0x03ee0000 : "fs #2"
Jan 1 00:00:02 imx21 daemon.notice klogd: 0x03ee0000-0x04000000 : "fs #3"
Jan 1 00:00:02 imx21 daemon.info klogd: NET4: Linux TCP/IP 1.0 for NET4.0
Jan 1 00:00:02 imx21 daemon.info klogd: IP Protocols: ICMP, UDP, TCP
Jan 1 00:00:02 imx21 daemon.info klogd: IP: routing cache hash table of 512 buckets, 4Kbytes
Jan 1 00:00:02 imx21 daemon.info klogd: TCP: Hash tables configured (established 4096 bind 8192)
Jan 1 00:00:02 imx21 daemon.info klogd: NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
Jan 1 00:00:02 imx21 daemon.warn klogd: NetWinder Floating Point Emulator V0.95 (c) 1998-1999 Rebel.com
Jan 1 00:00:02 imx21 daemon.warn klogd: VFS: Mounted root (jffs2 filesystem).
Jan 1 00:00:02 imx21 daemon.info klogd: Freeing init memory: 68K
Jan 1 00:00:02 imx21 daemon.warn klogd: PCB version: ind3 v2
Jan 1 00:00:02 imx21 daemon.info klogd: Driver SYSTCLK: SYSTCLK-1.12 (REFERENCED)
Jan 1 00:00:02 imx21 daemon.info klogd: Driver GPIO-1.59 (REFERENCED)
Jan 1 00:00:02 imx21 daemon.warn klogd: ****p_gpio_init_low_bat****
Jan 1 00:00:02 imx21 daemon.warn klogd: GPIO: p_gpio_it_init at 762
Jan 1 00:00:02 imx21 daemon.info klogd: Driver FRAMEBUF-1.12 (REFERENCED)
Jan 1 00:00:02 imx21 daemon.info klogd: Driver SPI-1.20 (REFERENCED) Debug level 3
Jan 1 00:00:02 imx21 daemon.warn klogd:
Jan 1 00:00:02 imx21 daemon.warn klogd: u32_spi1_MinLenghtForDMAInTX set to 300
Jan 1 00:00:02 imx21 daemon.warn klogd:
Jan 1 00:00:02 imx21 daemon.warn klogd: u32_spi1_MinLenghtForDMAInRX set to 300
Jan 1 00:00:02 imx21 daemon.info klogd: Driver LCD-1.20 (REFERENCED)
Jan 1 00:00:02 imx21 daemon.info klogd: Driver TVLINK-1.45 (REFERENCED)
Jan 1 00:00:02 imx21 daemon.warn klogd: Reset from Software Reset.
Jan 1 00:00:02 imx21 daemon.warn klogd: Motorola PostProcessor Linux driver ver 0.64 - Copyright (C) 2003 Motorola Inc
Jan 1 00:00:02 imx21 daemon.warn klogd: pp: hw ver = 2
Jan 1 00:00:03 imx21 daemon.warn klogd: prp_dbg=0
Jan 1 00:00:03 imx21 daemon.warn klogd: Motorola PreProcessor Linux driver ver 0.0 - Copyright (C) 2003 Motorola Inc
Jan 1 00:00:03 imx21 daemon.info klogd: hmp4d: base_port=0x10026800 irq=50
Jan 1 00:00:03 imx21 daemon.info klogd: hmp4d: module inserted
Jan 1 00:00:03 imx21 daemon.info klogd: hmp4e: base_port=0x10026c00 irq=49
Jan 1 00:00:03 imx21 daemon.info klogd: hmp4e: Compatble HW found with ID: 0x004c1882
Jan 1 00:00:03 imx21 daemon.info klogd: hmp4e: module inserted. Major = 249
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :PID of driver: 134
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :ScanList not provided. Will use the default scan list.
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :ScanChannelList :1 6 11 14 2 7 12 3 8 13 4 9 5 10
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :setting PhyType to: Rf-to-Rf
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :Ref.Clock parameter not provided
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :Configure target for a reference clock of 'default=40' Mhz.
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drvRegEtherDev :Interface Name is: eth%d
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drv_Init :HEOCSIWPOWON: Powering on...
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drvPhase2Init :Protocol Firmware will be loaded by driver ...
Jan 1 00:00:03 imx21 daemon.err klogd: SPI2:: drvPhase2Init :Initializing HHAL (PhgHhalInitialize)...
Jan 1 00:00:03 imx21 daemon.warn klogd: Divider : 8
Jan 1 00:00:03 imx21 daemon.warn klogd:
Jan 1 00:00:03 imx21 daemon.warn klogd: OCR2 : e4015308 (12582912)
Jan 1 00:00:03 imx21 daemon.warn klogd: Reset : 3 / 27 (c497cc00 / e401531c)
Jan 1 00:00:03 imx21 daemon.warn klogd:
Jan 1 00:00:03 imx21 daemon.warn klogd: Reset : 3 / 27 (c497cc00 / e401531c)
Jan 1 00:00:03 imx21 daemon.warn klogd: GPIO: p_gpio_init_gpio_status at 1262
Jan 1 00:00:03 imx21 daemon.warn klogd: GPIO: POWER_FAIL signal NOT detected at GPIO driver init carry on !!!
Jan 1 00:00:03 imx21 daemon.warn klogd: GPIO: CHARGE_IN at init
Jan 1 00:00:03 imx21 daemon.warn klogd: GPIO: LOW_BAT_OUT at init
Jan 1 00:00:03 imx21 daemon.warn klogd: GPIO: No accessory plugged at init. - Set Video on jack
Jan 1 00:00:03 imx21 daemon.warn klogd: GPIO: camera to front at init
Jan 1 00:00:04 imx21 daemon.warn klogd: **ChargeStatusPmb=========gpio_Read_ChargeStatus_Ready=1
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: PhgOsal_linux_init_thread :assigning thread name and deamonize() ..
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvPhase2Init :Success
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvPhase2Init : registering callbacks with HHAL..
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVNT_INIT_COMPLETE; setting CARRIER_ON
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvPhase2Init :calling PhgHhalQueueMgmtReq()!
Jan 1 00:00:06 imx21 daemon.info klogd: PhgHhalQueueMgmtReq:1172:HHAL got Init message
Jan 1 00:00:06 imx21 daemon.info klogd: PhgHhalQueueMgmtReq:1217:HHAL done Init message
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS wake (0) in Drvmain
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVENT_DISCONNECT; setting CARRIER_OFF
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00001002, IFF_UP=0
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :device was already closed
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvMgmtCfmHndler :Using MAC Address: 00:08:c6:86:8b:99
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvPhase2Init :init etherdev; stopping queue, setting CARRIER_OFF
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drv_Init :Philips WLAN Drv - loaded - in state: 1
Jan 1 00:00:06 imx21 daemon.err klogd: SPI2:: drvInit :Philips WLAN Drv - loaded
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvOpen :opening net device
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvOpen :Device is not associated!
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvOpen :Carrier flag is already set to CARRIER_OFF
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvOpen :Disabling again netqueue
Jan 1 00:00:07 imx21 daemon.warn klogd: requested reg.domain code setting = 3
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvIoctl :set u8LinkAdaptation : 1 Result=[0]
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvIoctl :changed HEOCSIWLNADPALLOWRATES: 8 allowed rate codes
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvInitConnect :Req to connect to new WLAN network
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvInitConnect :Disabling TX queue and setting CARRIER_OFF
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvInitConnect :Connecting To AP...
Jan 1 00:00:07 imx21 daemon.err klogd: SPI2:: drvInitConnect :step2
Jan 1 00:00:10 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :
Jan 1 00:00:10 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVENT_DISCONNECT; setting CARRIER_OFF
Jan 1 00:00:10 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00001003, IFF_UP=1
Jan 1 00:00:10 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :device was already opened; stopping queue
Jan 1 00:00:11 imx21 daemon.err klogd: SPI2:: drvInitConnect :step3 : u8Status 255
Jan 1 00:00:11 imx21 daemon.err klogd: SPI2:: drvInitConnect :step5
Jan 1 00:00:11 imx21 daemon.err klogd: SPI2:: drvInitConnect :TIMEDOUT
Jan 1 00:00:11 imx21 daemon.err klogd: SPI2:: drvInitConnect :step6
Jan 1 00:00:11 imx21 daemon.err klogd: SPI2:: drvInitParamsAndPowerOnAndConnect :Connect failed!
Jan 1 00:00:12 imx21 daemon.warn klogd: Motorola CSI Linux driver ver 0.1
Jan 1 00:00:12 imx21 daemon.warn klogd: - Copyright (C) 2004 Motorola Inc
Jan 1 00:00:12 imx21 daemon.warn klogd:
Jan 1 00:00:12 imx21 daemon.info klogd: Driver SENSOR-1.29 (REFERENCED)
Jan 1 00:00:12 imx21 daemon.info klogd: i2c-client version : 1.9
Jan 1 00:00:12 imx21 daemon.info klogd: Initialize i2c-client-aic14 module
Jan 1 00:00:12 imx21 daemon.info klogd: Module i2c-client-aic14 initialized
Jan 1 00:00:12 imx21 daemon.alert klogd: Insert module aic14 (AIC14-1.0)
Jan 1 00:00:12 imx21 daemon.warn klogd: Module AIC14 assumes CODEC MCLK already configured for 20480000Hz
Jan 1 00:00:12 imx21 daemon.info klogd: Driver KPP-1.36 (REFERENCED)
Jan 1 00:00:13 imx21 daemon.info klogd: Driver DOZE-1.27 (REFERENCED)
Jan 1 00:00:14 imx21 daemon.err klogd: SPI2:: drvDoScan :Buero (bittorf)
Jan 1 00:00:15 imx21 daemon.err klogd: SPI2:: drvProcessScanCfm :Scan Confirm: Success 1 APs
Jan 1 00:00:15 imx21 daemon.info netsyncd[315]: creating FIFO_NETSYNC_HMON_NAME...
Jan 1 00:00:15 imx21 daemon.info netsyncd[315]: creating FIFO_HMON_NETSYNC_NAME...
Jan 1 00:00:18 imx21 daemon.err klogd: SPI2:: drvInitConnect :Req to connect to new WLAN network
Jan 1 00:00:18 imx21 daemon.err klogd: SPI2:: drvInitConnect :Disabling TX queue and setting CARRIER_OFF
Jan 1 00:00:18 imx21 daemon.err klogd: SPI2:: drvInitConnect :Connecting To AP...
Jan 1 00:00:18 imx21 daemon.err klogd: SPI2:: drvInitConnect :step2
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVNT_INIT_CONNECT; setting CARRIER_ON
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00001003, IFF_UP=1
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :device was already opened; enabling queue
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvInitConnect :step3 : u8Status 8
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvInitConnect :step4
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvInitConnect :Successful
Jan 1 00:00:19 imx21 daemon.err klogd: SPI2:: drvInitConnect :step6
Jan 1 00:00:19 imx21 daemon.err modprobe: modprobe: Can't locate module sound-slot-0
Jan 1 00:00:19 imx21 daemon.err modprobe: modprobe: Can't locate module sound-service-0-0
Jan 1 00:00:20 imx21 local0.debug dhcpcd[337]: broadcasting DHCP_DISCOVER
Jan 1 00:00:23 imx21 local0.debug dhcpcd[337]: DHCP_OFFER received from (10.63.17.1)
Jan 1 00:00:23 imx21 local0.debug dhcpcd[337]: DHCP_ACK received from (10.63.17.1)
Jan 1 00:00:23 imx21 daemon.err netsyncd[314]: father received(10) eth0 up!
Jan 1 00:00:23 imx21 daemon.info netsyncd[314]: Dhcp_start 337 return : 0
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC0 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC1 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC2 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC3 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC0 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC1 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC2 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC3 = 8
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :PA Request
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :No state change!
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvIoctl :Fast PS Request
Jan 1 00:00:26 imx21 daemon.info klogd: PhgHhalDoM2SDMA:1661:-->P1
Jan 1 00:00:26 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
Jan 1 00:00:28 imx21 daemon.info netsyncd[314]: Dhcp_stop 359 return : 0
Jan 1 00:00:28 imx21 local0.debug dhcpcd[347]: sending DHCP_RELEASE for 10.63.17.5 to 10.63.17.1
Jan 1 00:00:29 imx21 local0.err dhcpcd[347]: terminating on signal 1
Jan 1 00:00:29 imx21 daemon.err netsyncd[314]: father received(12) eth0 down!
Jan 1 00:00:29 imx21 daemon.err klogd: SPI2:: drvStop :Driver Stop: disable TX queue! (usage: 2)
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvIoctl :Deauth BSSID: 00:1d:7e:18:e3:89
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVENT_DISCONNECT; setting CARRIER_OFF
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00000002, IFF_UP=0
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :device was already closed
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvOpen :opening net device
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvOpen :ERROR: Associated, but Carrier flag is set to CARRIER_OFF
Jan 1 00:00:41 imx21 daemon.err klogd: SPI2:: drvDoScan :Buero (bittorf)
Jan 1 00:00:42 imx21 daemon.err klogd: SPI2:: drvProcessScanCfm :Scan Confirm: Success 1 APs
Jan 1 00:00:45 imx21 daemon.err klogd: SPI2:: drvInitConnect :Req to connect to new WLAN network
Jan 1 00:00:45 imx21 daemon.err klogd: SPI2:: drvInitConnect :Disabling TX queue and setting CARRIER_OFF
Jan 1 00:00:45 imx21 daemon.err klogd: SPI2:: drvInitConnect :Connecting To AP...
Jan 1 00:00:45 imx21 daemon.err klogd: SPI2:: drvInitConnect :step2
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PHGHHAL_EVNT_INIT_CONNECT; setting CARRIER_ON
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :netdev->flags=0x00000003, IFF_UP=1
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :device was already opened; enabling queue
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvInitConnect :step3 : u8Status 8
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvInitConnect :step4
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvInitConnect :Successful
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvInitConnect :step6
Jan 1 00:00:46 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
Jan 1 00:00:54 imx21 local0.debug dhcpcd[386]: broadcasting DHCP_DISCOVER
Jan 1 00:00:57 imx21 local0.debug dhcpcd[386]: DHCP_OFFER received from (10.63.17.1)
Jan 1 00:00:57 imx21 local0.debug dhcpcd[386]: DHCP_ACK received from (10.63.17.1)
Jan 1 00:00:57 imx21 daemon.info netsyncd[314]: Dhcp_start 386 return : 0
Jan 1 00:00:57 imx21 daemon.info netsyncd[314]: NTP server request on : ntp.xs4all.nl
Jan 1 00:00:57 imx21 daemon.err klogd: SPI2:: drvIoctl :PA Request
Jan 1 00:00:57 imx21 daemon.info klogd: PhgHhalDoM2SDMA:1661:-->P0
Jan 1 00:00:57 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS wake (0) in Drvmain
Jan 1 00:00:57 imx21 daemon.warn klogd: drvSetWOWFilter: Enable UNICAST: Disable ARP: Enable
Jan 1 00:00:57 imx21 daemon.warn klogd: Ip addr = 10.63.17.5. LMP=2
Jan 1 00:00:57 imx21 daemon.err klogd: SPI2:: drvIoctl :PA Request
Jan 1 00:00:57 imx21 daemon.err klogd: SPI2:: drvIoctl :No state change!
Jan 1 00:00:57 imx21 daemon.err klogd: SPI2:: drvIoctl :Fast PS Request
Jan 1 00:00:57 imx21 daemon.info klogd: PhgHhalDoM2SDMA:1661:-->P1
Jan 1 00:00:57 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
Jan 1 00:00:57 imx21 daemon.warn klogd: Sensor driver: initialize device OV7660
Jan 1 00:00:57 imx21 daemon.warn klogd: Warning: Remapping obsolete /dev/fb* minor 32 to 1
Apr 7 09:58:16 imx21 daemon.info netsyncd[314]: NTP process return code : 0
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC0 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC1 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC2 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC3 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC0 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC1 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC2 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC3 = 4
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvIoctl :PA Request
Apr 7 09:58:19 imx21 daemon.info klogd: PhgHhalDoM2SDMA:1661:-->P0
Apr 7 09:58:19 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS wake (0) in Drvmain
Apr 7 09:58:19 imx21 daemon.info upgraded[312]: K_SW0_DWNLD_ACK
Apr 7 09:58:27 imx21 auth.info login[393]: root login on `ttyp0' from `bittorf-AP.olsr'
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC0 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC1 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC2 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11LongRetryLimitAC3 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC0 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC1 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC2 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :dot11ShortRetryLimitAC3 = 8
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :PA Request
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :No state change!
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvIoctl :Fast PS Request
Apr 7 09:59:50 imx21 daemon.info klogd: PhgHhalDoM2SDMA:1661:-->P1
Apr 7 09:59:50 imx21 daemon.err klogd: SPI2:: drvHhalEventIndicationHandler :PS Ind (1) in Drvmain
Apr 7 10:14:53 imx21 auth.info login[408]: root login on `ttyp1' from `bittorf-AP.olsr'
Apr 7 10:17:18 imx21 syslog.info -- MARK --
</pre>

=== Ausgabe von <tt>/proc/cpuinfo</tt> auf einem VP6500 ===
<pre>
# cat /proc/cpuinfo
Processor : ARM926EJ-Sid(wb) rev 4 (v5EJl)
BogoMIPS : 133.01
Features : swp half thumb fastmult
CPU implementer : 0x41
CPU architecture: ?(8)
CPU variant : 0x0
CPU part : 0x926
CPU revision : 4
Cache type : undefined 14
Cache clean : undefined 14
Cache lockdown : undefined 14
Cache unified : Harvard
I size : 16384
I assoc : 4
I line length : 32
I sets : 128
D size : 16384
D assoc : 4
D line length : 32
D sets : 128

Hardware : Freescale i.MX2 ADS
Revision : 0000
Serial : 0000000000000000

</pre>

== Software Modifikationen ==

=== Startscripts ===
Die Scripts in /etc/rc.d/" müssen im Hintergrund laufen. Tut ein Script das nicht, ist an dieser Stelle Schluß mit dem Bootvorgang.

Dann darf man als nächstes den Lötkolben anheizen und die serielle Schnittstelle ([[#UART]]) zugänglich machen.

=== Aktivierung WPA2 Unterstützung ===
Standardmäßig kann das Telefon nur WPA, dabei unterstützt es allerdings auch WPA mit AES Verschlüsselung.
Es gibt zwei verschiedene Tricks mit denen auch WPA2 aktiviert werden kann.
Allerdings wurde von einigen ein Einbruch der Verbindungsgeschwindigkeit festgestellt (scheint jedoch nur bei geringem Akkuladestand aufzutreten).

==== Trick1 ====
Mittels
vi /etc/marvell/wpa_supplicant.conf
den Texteditor starten.
Mit PageDown (Bild runter) bis zum Ende des Files gehen.
Die Zeilen
proto=WPA
pairwise=TKIP
group=TKIP
auskommentieren, indem ein # vorangestellt wird:
* cursor auf Beginn einer Zeile
* i drücken zum Einfügen
* # eintippen
* [Esc]
Sind alle Zeilen auskommentiert, dann mittels
:wq[enter]
abspeichern und Editor verlassen.

Danach neu booten.

Anmerkung: Will man sich mit dieser Änderung in einem reinen WPA2 Netz anmelden (registrieren), kann man als Verschlüsselung nur noch WEP auswählen - der Verbindungsversuch scheitert natürlich! (Hardware: Fritz!Box 7270, PHILIPS VP5500)

Also am Accesspoint WPA2 + WPA einstellen, Telefon registrieren, dann Accesspoint auf WPA2 konfigurieren.

'''Achtung!'''

'''Das Herumspielen an der wpa_supplicant.conf endet sehr schnell damit das man sich ausperrt'''

'''Um sich eine Wiederbelebung per serieller Konsole zu ersparen, ist es günstig, immer nur Einträge an die wpa_supplicant.conf hinten anzuhängen, niemals aber vorne einzufügen!'''

==== Trick2 ====
Dieser Trick funktioniert mit Accesspoints, bei denen man auch WPA Verbindungen mit AES verschlüsseln kann. Vorteil dieses Tricks ist, dass man die wpa_supplicant.conf nicht manuell editieren muss. Allerdings unterstützt nicht jeder Accesspoint WPA mit AES (aber dd-wrt kann das).

Man konfiguriert den Accesspoint zunächst mit WPA PSK und wählt AES als Verschlüsselungsalgorithmus. Jetzt meldet man das Telefon an, das Telefon erkennt richtig, dass eine WPA Verbindung vorliegt und verbindet sich per WPA und AES mit dem Accesspoint. Anschliessend konfiguriert man den Accesspoint von WPA PSK AES nach WPA2 PSK AES. Jetzt schaltet man das VP5500/6500 aus und wieder ein. Nachdem es fertig gebootet hat, verbindet es sich automisch per WPA2 PSK und mit AES Verschlüsselung. Fertig.

==== Versehentliche Aussperrung nach WPA2 Einstellversuch beheben ====

'''1. DON'T PANIC!'''

Für den Fall das man sich den Zugangsweg per wireless abgeschnitten hat, gibt es, neben dem Bau eines seriellen Adapters und der Notwendigkeit zu löten, noch eine Variante um wieder auf das Telefon zu kommen:

'''den Demo-Modus!'''

Dieser Modus war dazu gedacht die Funktionalität der VPs ohne SIP-Server ausprobieren zu können. 2 Geräte starten dazu im WLAN-AdHoc-Modus mit unterschiedlichen IP's (192.168.10.1 + 192.168.10.2 , jeweils /24 = 255.255.255.0) und machen ein IBSS-Netzwerk mit WEP-Verschlüsselung auf (Key: VP6500 = 5648751265 beim VP5500 = 7295569793).

Nun kann man auch ein einzelnes Telefon in den Demo-Modus versetzen (vorhandener neuer Menüeintrag nach dem rooten, oder per Tastenkombination "*#3 "), gibt sich eine passende WLAN und IP-Einstellungen auf dem Rechner und schon kann man wieder darauf connecten und Fehleinstellungen wieder beheben. Dummerweise wird eine zufällige IBSS-Cell-ID verwendet, aber neuere Betriebssysteme können der Zelle trotzdem beitreten. Als ESSID kann man ''demo_mode_obiwan'' verwenden.

===== Schritt für Schritt Anleitung für Linux =====

Einstellungen

'''am Telefon'''
* Telefon anschalten und per Menüeintrag oder Tastenkombination "*#3" in Demomodus versetzen (z.B. als Einstellung Handset 1)
-> IP des Telefons wird danach zu 192.168.10.1

'''am Computer'''
* Konsole öffnen
per ifconfig checken welches das WLAN-Gerät am Rechner ist (im weiteren "wlan0" genannt)
ifconfig wlan0 down
iwconfig wlan0 mode ad-hoc (ad-hoc Modus aktivieren)
iwconfig wlan0 essid 'demo_mode_obiwan' (Passende essid-Kennung setzen)
iwconfig wlan0 key 5648751265 (Key für das VP6500)
ifconfig wlan0 up
ifconfig wlan0 192.168.10.2 (setzen der IP)

man kann nun mittels:

ping 192.168.10.1

testen ob alles korrekt verlaufen ist und man eine Antwort bekommt - sollte das der Fall sein ist man '''fertig!'''

Nun kann man per telnet oder ssh, mit den üblichen Benutzerkennung und dem Passwort, auf die IP 192.168.10.1 connecten und die Probleme beheben.

=== Menüs ===
==== Hauptmenu ====

Das File
/usr/local/etc/defaultbuttons.conf
enthält unter anderm die definition des Hauptmenus.

Hierfür ist der Abschnitt Menu besonders interessant.
[Menu]
1 = Applications/camera.desktop
2 = Applications/callhistory.desktop
3 = Applications
4 = Settings
5 = Applications/addressbook.desktop
6 = Settings/RingProfiles.desktop
Columns = 3
Default = 5
Map = 123456789*0#
Rows = 2

'Rows' und 'Columns' geben an, wieviel Reihen und Spalten das Hauptmenu hat.
Über die Zuweisungen 1 bis (Columns * Rows) kann man dann den Menüpositionen die Menüpunkte zuweisen.
Die Menupunkte sind definiert in den Verzeichnissen unter
/usr/local/apps
gibt man nur ein Verzeichnis an, dann erscheint ein Submenü, dessen Icon und Name in der .directory -Datei des entsprechenden Ordner definiert ist.
direkte Menupunkte haben Dateinamen mit der Endung .desktop
'Default' bestimmten vorselektierten Eintrag.

Hier ein weiteres Beispiel für ein angepasstes Menu:
[Menu]
1 = Applications/addressbook.desktop
2 = Applications/callhistory.desktop
3 = Applications/sysinfo.desktop
4 = Applications
5 = Settings
6 = Games
7 = Applications/camera.desktop
8 = Applications/photoedit.desktop
9 = Applications/manualsub.desktop
Columns = 3
Default = 5
Map = 123456789*0#
Rows = 3

Der Ordner Games ist (momentan ;) leer.

==== Genereller Aufbau Menü-Einträge ====
Die Einträge für die Menüs sind im Filesystem abgelegt:

* Settings: /usr/local/apps/Settings
* Applications: /usr/local/apps/Applications
* Klingeltöne: /usr/local/etc/SystemRingTones/

Die Dateien haben die Endung ".desktop" und sind normale Textdateien, die die relevanten Infos enthalten.

Ein Beispiel aus dem Settingsordner:
[Translation]
File=QtopiaSettings
Context=Sound
[Desktop Entry]
Type=Application
Exec=sound
Icon=Sound
Name[]=Sound
CanFastload=0

Der Abschnitt 'Translation' gibt an in welchem File, die Lokalisationsdaten stehen.

Der Abschnitt 'Desktop Entry':
* Type: Typ des Eintrages
** Application für Anwendungen
** audio/x-wav für Klingeltöne
* Exec: Anwendung, die ausgeführt werden soll
* Icon: Icon, das im Menü benutzt wird.
** Pfad ist meist: /usr/local/pics/[Exex]/[Icon].png
** manchmal aber auch: /usr/local/pics/icons/[14x14|16x16|22x22]/[Icon].png
* Name[]: Name im Menü, wird über das in Translation angegebe File und Context aufgelöst. Dies wird verhindert, wenn die Klammern wegelassen werden, was das Einfügen eigener Einträge ermöglicht

Hier ein Textfile mit allen desktop-Files als Referenz: [[File:alleDesktopFiles.txt]]

==== Versteckte Menüeinträge ====
in den oben genannten Ordnern existieren ein paar Dateien mit der Endung '.desktopMASK'.
Benennt man diese um, werden die Einträge nach einem Neustart im Menü freigeschaltet.

folgendes an der Kommandozeile eingeben:
cd /usr/local/apps/Settings
mv datetime.desktopMASK datetime.desktop
mv callforward.desktopMASK callforward.desktop
mv calloptions.desktopMASK calloptions.desktop
mv resetparam.desktopMASK resetparam.desktop
mv subkpncode.desktopMASK subkpncode.desktop

Dies aktiviert folgende Optionen:
* Datums/Zeit-Einstellung
* Anrufweiterleitung
* Anrufoptionen
* Parameter zurücksetzen
* Number Switch

Die nützlichsten sind wohl die ersten Einträge.
Bei dem "Number Switch" ist unklar, was er bewirken soll. Beim Start wird ein Code abgefragt.

Weiterhin kann im Verzeichnis /usr/local/apps/Applications eine [[File:demomode.desktop]] anlegen.

Hierfür ist folgende Prozedur nötig:
* im Terminal
cd /usr/local/apps/Applications
vi demomode.desktop

* i drücken
* folgendes Textfragment einfügen
[Translation]
File=QtopiaApplications
Context=DemoMode
[Desktop Entry]
Exec=demomode
Icon=Camera
Type=Application
Name[]=DemoMode
* [Esc]
* :wq [Enter]

Dies schaltet einen Demo-Modus frei.

===== Französisch =====
Es gab die Geräte wohl auch in Frankreich von der France Telekom.
Zumindest sind entsprechende monitor und upgrade Referenz-Dateien
vorhanden in denen das steht. Daher wohl auch die Französischen
Sprachdateien, die zwar auf dem Gerät sind, allerdings in einem
Unterverzeichnis, so das sie nicht auswählbar sind.

Um diese Dateien zu aktivieren:

cd /usr/local/i18n
mv NOTUSED/fr .

Nun ist auch noch französisch als Sprache verfügbar.

==== Eigene Menüeinträge ====

Es besteht die Möglichkeit Menüeinträge anzulegen, durch die Shellskripte ausgeführt werden. Im Folgenden ist dieses Vorgehen am Beispiel des ein- und ausschaltens von SSH beschrieben.

===== SSH aktivieren & deaktivieren =====

*'''Achtung! Folgendes Vorgehen kann das Gerät bricken, falls Telnet deaktiviert ist und irgendwas mit SSH schief läuft!'''
*Die folgenden beiden Dateien repräsentieren die Menüeinträge und rufen ''enablessh'' bzw. ''disablessh'' auf.
:*''/usr/local/apps/Applications/enablessh.desktop'':
[Translation]
File=QtopiaApplications
Context=enablessh
[Desktop Entry]
Exec=enablessh
Icon=Camera
Type=Application
Name[]=Enable SSH
:*''/usr/local/apps/Applications/disablessh.desktop'':
[Translation]
File=QtopiaApplications
Context=disablessh
[Desktop Entry]
Exec=disablessh
Icon=Camera
Type=Application
Name[]=Disable SSH
*''enablessh'' und ''disablessh'' sind Shellskripte, die in ''/usr/local/bin/'' liegen und folgendes enthalten:
:*''/usr/local/bin/enablessh'':
#!/bin/sh
touch /var/log/lastlog
/etc/rc.d/init.d/S99dropbear restart
exit(0)
:*''/usr/local/bin/disablessh'':
#!/bin/sh
/etc/rc.d/init.d/S99dropbear stop
exit(0)
*Nach einem Neustart des Telefons sollten im Menü ''Anwendungen'' die zwei neuen Menüpunkte auftauchen.

=== Grafische Anpassungen ===
So gut wie alle Grafiken liegen im Verzeichnis /usr/local/pics und können beliebig ausgetauscht werden (gleicher Dateityp, gleiche Größe).

Einige besonders interessante werden hier aufgeführt:

===== Eigene Startup/Shutdown-Animation =====

Die Animation beim Starten oder Herunterfahren sind normale (animierte) GIFs. Diese findet man in

/usr/local/pics/qpe

Die Links "splash.gif" und "goodbye.gif" zeigen auf die tasächlich zu verwendenen Dateien ("splash-chuck.gif", "goodby-chuck.gif").
Man kann sein eigenes animiertes GIF im Format 176x220 Pixel raufladen und die
splash.gif entsprechend neu verlinken. Dazu löscht man zuerst die alte
mit:

rm /usr/local/pics/qpe/splash.gif

Anschliessend erzeugt man den Link neu, dabei zeigt er dann auf die
eigene Datei:

ln -s /pfad/zur/eigenedatei.gif /usr/local/pics/qpe/splash.gif

Danach wird dann bei jedem Neustart die eigene Animation angezeigt.
Analog hierzu mit der goodbye.gif.

Also, man packe das Bild auf einen Webserver. Dann am Gerät über telnet
anmelden und:

<syntaxhighlight lang="c">
cd /usr/local/pics/qpe/
wget http://dein.server/woauchimmer/Matrix5.gif
rm /usr/local/pics/qpe/splash.gif
ln -s Matrix5.gif splash.gif
</syntaxhighlight>

Die GIF Animation kann bis zu 176x220 Pixel groß sein.
Kleinere (möglicherweise auch größere) Bilder werden zentriert.
<gallery>
Bild:walking_baby_tux.gif
Bild:custwakeup2.gif
Bild:custgoodbye2.gif
Bild:94vw4.gif
Bild:1_Matrix--16984.gif
Bild:1_Matrix--16985.gif
Bild:3.gif
Bild:ClanSpider2.gif
Bild:TS12.gif
Bild:Matrix5.gif
Bild:qtopia3.gif
Bild:qtopia0.gif
</gallery>

===== Eigener Boot/Update-Screen =====

Der Boot-Screen, oder der Update-Screen liegen als Rohdaten vor.
/user_data/data/welcome.rgb565
/usr/local/startup_V4.20/update.bin

Weitere Beispielbilder:
/user_data/prod/data/lcd_test_card1.bin
/user_data/prod/data/lcd_test_card2.bin

Diese werden direkt in den Framebuffer geschrieben.
Daher müssen sich diese genau ein bestimmtes Format halten:
* Größe 240 x 220 (Das Display ist 176 x 220, der Rest ist also nicht zu sehen)
* 16bit pro Pixel RGB565

Um solch ein Bild zu erstellen sind folgende Schritte notwendig:
# Ein Bild in Gimp mit 176x220 erstellen
# das Bild vertikal spiegeln
# Bild->Leinwandgröße auf 240x220 erweitern (Bilddaten ganz nach links)
# Datei->Kopie speichern...
# Als Windowsbitmap -> erweiterete Optionen -> 16bit R5G6B5
# BMP-Header entfernen (dies kann auch auf dem Telefon gemacht werden)
tail -c 105600 input.bmp > output.raw

Von der Kommandozeile kann auch manuell das Bild in den Framebuffer geschrieben werden.
cat /user_data/prod/data/lcd_test_card1.bin > /dev/fb0

=== Wichtige Verzeichnisse ===

==== Adressdaten ====

# cat /user_data/home/Applications/addressbook/addressbook.xml

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Addressbook ><AddressBook>
<Groups>
</Groups>
<Contacts>
<Contact
Uid="-1269720893"
Categories="-1269431263;-1266109093;-1266109094"
FirstName="ich"
FileAs="ich"
JobTitle="cc"
Department="dep"
Company="aa"
BusinessPhone="55"
BusinessFax="77"
BusinessMobile="66"
HomePhone="55"
HomeMobile="11"
HomePc="12"
HomeData="13"
HomeFax="14"
BusinessStreet="street"
BusinessCity="city"
BusinessState="state"
BusinessZip="zip"
BusinessCountry="country"
BusinessPager="88"
Office="office"
Profession="prof"
Assistant="ass"
Manager="man"
HomeStreet="ptjml"
HomeCity="cit"
HomeState="stat"
HomeZip="zi"
HomeCountry="coun"
Spouse="spouse"
Gender="1"
Birthday="20100322"
Anniversary="20100429"
Children="child"
Notes="gakm"
CompanyPronunciation="bb"
BUSINESS_CONTACT=""
photofile="ci-1269721575-0.jpg"
qdl-private-data=""
tone="/usr/local/etc/SystemRingTones/16-Tetris.desktop"
/>
<Contact Uid="-1269554029"
FirstName="VoIP"
LastName="Phone1"
FileAs="VoIP Phone1"
HomeMobile="**621"
tone="/usr/local/etc/SystemRingTones/15-Techno2.desktop" />
<Contact Uid="-1269554032"
Categories="-1269431263"
FirstName="VoIP"
LastName="Phone3"
FileAs="VoIP Phone3"
HomeMobile="**623"
BUSINESS_CONTACT=""
qdl-private-data=""
tone="/usr/local/etc/SystemRingTones/08-Celtrelax.desktop" />
</Contacts>
</AddressBook>

UID ist wohl egal, solange sie nicht zweimal vorkommen.

Die Beschränkung auf 500 Adressbucheinträge kann man auch aufheben
"maxEntries = 500" in der "Contacts.conf"

Die Kategorien stehen in /user_settings/Categories.xml:

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE CategoryList>
<Categories>
<Category id="-1269431263" name="_Personal" />
<Category id="-1269431262" name="_Business" />
</Categories>

Wenn man einem Kontakt eine bestimmte Kategorie zuordnet, dann wird die ID dieser Kategorie im Attribut "Categories" des Kontakts eingetragen (s.o.). Wenn die Kategorie BUSINESS gewählt wird, steht im Attribut "BUSINESS_CONTACT" eine "1".

Ein Einrücken der Attribute mit TABs in den Dateien ist übrigens nicht erlaubt!

Die Software des Telefons wertet die XML Datei bei jedem Zugriff neu aus. Es ist also möglich, die Datei im laufendem Betrieb zu ändern (z.B. per Script).

Um VCards in das XML Format zu konvertieren gibts es im Forum folgendes kleines C-Programm: [http://www.mikrocontroller.net/attachment/74583/VCardsToXML.c] [http://www.mikrocontroller.net/attachment/highlight/74583]

==== Fotos & Videoschnappschüsse ====

Alle Bilder die mit der Kamera aufgenommen werden, sowie Schnappschüsse die während eines Videocalls aufgenommen wurden, befinden sich in folgendem Verzeichnis:
/user_data/home/Documents

Die Bilder sind dabei nach folgendem Namensschema benannt:

DD-MM-YYYY-hh:mm:ss.jpg

Tag (DD) und Monat (MM) können auch einstellig sein.

=== Klingeltöne ===

Die Klingeltöne liegen, wie oben beschrieben unter:

/usr/local/etc/SystemRingTones/

Es sind .wav Dateien mit (16Khz, 16Bit, Mono), welche sich z.B. mit MhWaveEdit (Linux/GTK) oder auch Audacity recht komfortabel erzeugen lassen.

Auch .wav Dateien mit 22050Hz werden problemlos abgespielt (ein wenig bessere Qualität als 16000hz) und sind als Klingeltöne nutzbar. Dateien mit 44100Hz spielt es leider nur 'ruckelnd' ab (32000Hhz nicht getestet).

Zusätzlich zu den Audio Dateien müssen noch im selben Verzeichnis entsprechende *.desktop dateien angelegt werden, damit alles korrekt ins Menü integriert wird.
Eine für den fiktiven 16. Klingelton erzeugte Datei "16-Tetris.desktop" könnte den folgenden Inhalt haben:

[Desktop Entry]
Categories =
File =16-Tetris.wav
Name[] =16-Tetris
Type = audio/x-wav
[Translation]
File=QtopiaRingTones
Context=16-Tetris

Nach dem Anlegen der Dateien sollte sie dann im Menü auftauchen und auswählbar sein.

=== Systemtöne ===

*Sämtliche Systemtöne liegen unter ''/usr/local/sounds/'' und lassen sich problemlos durch andere Dateien ersetzen.
*z.B.:
**''charge.wav'' - Ton, wenn Gerät in die Basisstation gesetzt wird

=== Timeserver ===

http://www.mikrocontroller.net/topic/170483#1645101
Unter /usr/local/data steht in der monitor.cfg und monitor_ref_KPN.cfg
die Adressen der Zeitserver "ntp.xs4all.nl" und "130.142.110.71". Diese
könnte man z.B. auf "ptbtime1.ptb.de" und "ptbtime2.ptb.de" ändern.

(http://www.mikrocontroller.net/topic/170483#1649594)
das telefon benutzt *nicht ntp* sondern das time-protokoll auf port 37 -
time.fu-berlin.de (bzw. chronos.zedat.fu-berlin.de) kann man als server nehmen.

vi Tipp hilfe:
<syntaxhighlight lang="c">
vi /usr/local/data/monitor.cfg
:%s/ntp.xs4all.nl/time.fu-berlin.de/g
:%s/130.142.110.71/130.133.1.10/g
</syntaxhighlight>

Aufgefallen ist, dass sich Telefone mit fest eingestellter IP nicht automatisch
die Uhrzeit abrufen. Stellt man das Telefon auf DHCP, stellt sich die Uhr
auch ohne Timeserver Modifikation auf die richtige Uhrzeit ein.

Wer lieber einen NTP-Client einsetzen möchte findet unter dem Link für [[#Weitere_Konsolenkommandos]] ein Paket mit ntpd oder auch ntpdate für den schnellen Zeitabgleich via Konsole:
ntpdate pool.ntp.org

=== Zusätzliche Software ===
==== Dropbear (SSH-Server) installieren ====

# Mit telnet auf dem Gerät einloggen
# folgendes in die Kommandozeile kopieren
cd /
wget http://www.mikrocontroller.net/attachment/74656/dropbear.tgz
tar -xzf dropbear.tgz
rm dropbear.tgz
cd /etc/rc.d/init.d
mv dropbear S90dropbear
./S90dropbear start
Das Kommando in Zeile 6 (mv ...) ist notwendig damit dropbear bei jedem Reboot automatisch gestartet wird.

Prüfen ob dropbear gestartet ist und läuft:
# ps
PID Uid Stat Command
136 root S /usr/sbin/dropbear

Bei jedem Login sucht dropbear nach der /var/log/lastlog, daher sollte man abschließend noch ein

touch /var/log/lastlog

machen, um diese Datei anzulegen. Damit verschwinden auch die entsprechenden Fehlermeldungen im logread.

==== Nano 2.2.3 installieren ====

* Mit telnet auf dem Gerät einloggen
* folgendes in die Kommandozeile kopieren
cd /usr/bin
wget http://www.mikrocontroller.net/attachment/74023/nano
* Mit folgendem Befehl den Editor ausführbar machen:
chmod +x /usr/bin/nano
* Nun ist vi Geschichte ;-)

Ggf. kann der Fehler "Error opening terminal: xterm-color" auftreten wenn nano gestartet wird, in diesem Fall hilft folgendes:

* /root/.bashrc öffnen (mit vi :-)
* "export TERM=xterm" in die Datei schreiben
* ausloggen / einloggen

==== OpenVPN 2.0.9 installieren ====

* Mit telnet auf dem Gerät einloggen
* folgendes in die Kommandozeile kopieren
cd /
wget http://www.mikrocontroller.net/attachment/74380/openvpn.tar.gz
tar -xzf openvpn.tar.gz
depmod
mknod /dev/net/tun c 10 200

==== OpenVPN 2.1.1 installieren ====

* Mit telnet auf dem Gerät einloggen
* folgendes in die Kommandozeile kopieren
cd /
wget http://www.mikrocontroller.net/attachment/74395/openvpn-2.1.1.tar.gz
tar -xzf openvpn-2.1.1.tar.gz
depmod
mknod /dev/net/tun c 10 200

==== Tinc 1.0.12 installieren ====

* Mit telnet auf dem Gerät einloggen
* folgendes in die Kommandozeile kopieren
cd /
wget http://www.mikrocontroller.net/attachment/74396/tinc-1.0.12.tar.gz
tar -xzf tinc-1.0.12.tar.gz
depmod
mknod /dev/net/tun c 10 200

==== Weitere Konsolenkommandos ====

Auf http://thinksilicon.de/57/Hacking-the-VP6500.html finden sich einige nützliche Konsolentools. Darunter sind bc, lsof, file, curl, mc (bzw. mcedit), hexedit, mktemp, rsync, tcpdump, crond (mit crontab) und ntpd.
* Hinweis zu tcpdump; muss folgendermaßen ausgeführt werden:
tcpdump -U root

=== simpler WLAN-Switcher ===

Ohne tiefer gehende GUI-Programmiererfahrungen bei Qtopia zu haben, kann man sich mit folgendem WLAN-Switcher behelfen:

Im Script /user_data/prod/config_ats.sh stehen viele Befehle, wie man mittels ''config code'' Einstellungen vorschreiben kann. Dies habe ich mir mit folgenden eigenen Scripten zu nutze gemacht:

'''/root/switch_wlan.sh'''
#!/bin/bash
#================================================
# WLAN-Switcher
#================================================

CONFIG_CODE=`cat /user_data/config_code.txt`

cp -f /usr/local/data/wpa_supplicant_ref_${CONFIG_CODE}.conf /user_data/wifi/wpa_supplicant.conf

* kopiert Anhand des ''config code''s die wpa_supplicant.conf
* es muss für jeden ''config code'' eine wpa_supplicant_ref_<''config code''>.conf vorhanden sein, am besten dazu die aktuelle /user_data/wifi/wpa_supplicant.conf dorthin kopieren

'''/root/set_config_code.sh'''
#!/bin/bash
echo "$1" > /user_data/config_code.txt

* schreibt den ersten übergebenen Parameter in die Datei /user_data/config_code.txt

'''/usr/local/bin/set2XXX'''
#!/bin/sh
/root/set_config_code.sh XXX
/root/switch_wlan.sh
reboot

* XXX durch den ''config code'' ersetzen
* Script, welches als Applikation gestartet wird
* derzeit leider keine "on-the-fly"-Eingabe des ''config code''s möglich, daher muss für jedes WLAN ein Script vorhanden sein

'''/usr/local/apps/Applications/set2YYY.desktop'''
[Translation]
File=QtopiaApplications
Context=set2XXX
[Desktop Entry]
Exec=set2XXX
Icon=Camera
Type=Application
Name[]=Enable YYY

* XXX durch den ''config code'' ersetzen (gleicher Scriptname wie oben)
* YYY durch einmaligen Namen ersetzen
* Diese Datei erscheint unter dem Name[] im Applications-Menü

Theoretisch kann man mit dieser Methode auch problemlos zwischen mehreren SIP-Einstellungen und vielen weiteren Telefoneinstellungen umschalten (siehe Dateien in /usr/local/data) - der DemoMode funktioniert nach dem gleichen Prinzip.

== Buildumgebung erstellen ==

Bislang ist die Erstellung von GUI-Applikationen (QTopia) noch nicht auf einfache Weise möglich. Das größte Problem ist, dass der Quellcode der Video-Telefon-Anwendung zum größten Teil zur Verfügung steht. Das Erstellen von Kommandozeilen-Anwendungen geht aber schon problemlos.

=== Windows ===
Die verwendete ARM-Entwicklungsumgebung basiert auf gcc. Mittels cygwin kann diese zwar auch als Win32-Anwendungen gebaut werden, dies ist aber in hohem Maß unüblich.

Am besten eine aktuelle Version von Debian oder Ubuntu in "vmware player" oder "virtual box" installieren. Wenn man die virtuelle Maschine nur zum Compilieren verwendet, reicht eine kompakte Kommandozeilen-Version (z.B. Ubuntu Server 9.10). Fertige virtuelle Machinen, im vmware-Marketing-Sprech gerne auch "virtual appliances" genannt, sind reichlich verfügbar (z.B. http://www.vmware.com/appliances/directory/70918).

=== Freetz-Linux ===
Für Fritzbox-Besitzer besonders geeignet ist das Freetz-linux, welches man im IP-Phone-Forum finden kann (-> http://www.ip-phone-forum.de/showpost.php?p=1400234&postcount=1).
Dieses kann zum Erstellen von Freetz-Images einerseits und andererseits als Buildumgebung benutzt werden. So spart man sich eine zweite VM.

Die VM selbst braucht nur gestartet werden, den Rest macht man am Besten von seiner gewohnten Umgebung aus.
Mittels Samba kann einfach per Windowsnetzwerk auf das Home-Verzeichnis zugriffen werden und per SSH kann einfach eine Shell (UTF als Codierung einstellen, dann stimmen auch die Sonderrzeichen) geöffnet werden.
Es muss sichergestellt werden sein, dass die VM zugriff auf das lokale Netzwerk, sowie das Internet hat (am besten mit einem 'ping google.com' überprüfen). Bei mir ging es eigenartiger weise erst, als ich die virtuelle Netzwerkkarte in den VM-Settings auf NAT gestellt habe.

Benutzername und alle Kennwörter sind 'freetz'

==== Installation und Test der VP5500 Toolchain ====
Installation der Buildumgebung:
sudo mkdir -p /opt/VP5500/toolchain
cd /opt/VP5500/toolchain
sudo wget http://www.handhelds.org/download/projects/toolchain/arm-linux-gcc-3.3.2.tar.bz2
sudo tar xjf arm-linux-gcc-3.3.2.tar.bz2
sudo rm arm-linux-gcc-3.3.2.tar.bz2
sudo ln -s /opt/VP5500/toolchain/usr/local/arm /usr/local/arm

Nach einem sudo-Kommando muss eventuell das Passwort eingegeben werden, weswegen die Befehle einzeln eingegeben werden sollten (oder man öffnet am Anfang eine sudo shell, dann kann man das auch weglassen.

Test der Buildumgebung:
cd ~
wget http://www.mikrocontroller.net/attachment/73161/helloworld.tgz
tar -xzf helloworld.tgz
rm helloworld.tgz
cd helloworld
make
Dannach sollte im ~/helloworld verzeichnis ein neues executable liegen, was vom Hostrechner einfach via Netzwerkfreigabe (\\freetz-linux\helloworld) und via WinSCP auf das Telefon kopiert werden kann.

==== Freetz-Linux eigentlicher Anwendungszweck ====
wer das Ding auch zum Bauen von Freetz-Images zum Erweitern seiner Fritz-box benutzen will muss sich zuerst ein Freetz runterladen.
Folgende Schritte machen dies:
cd ~
svn checkout http://svn.freetz.org/trunk freetz-trunk
Dannach gibts im Home-Verzeichnis das aktuelle Freetz im Verzeichnis 'freetz-trunk'.
Konfigurieren mit 'make menuconfig' und Image erstellen mit 'make'.
Wenn alles gut geht kann man das image dann vom Hostrechner aus der Windowsfreigabe '\\freetz-linux\freetz-trunk\images' rausholen und auf die Box spielen.

Für detailiertere Infos bitte direkt bei Freetz nachschlagen:
http://trac.freetz.org/

=== Linux ===
Die bisher bekannten, mit VP5500 und VP6500 ausgelieferten Software-Versionen, basieren auf einer etwas älteren "gcc 3.3.2-ARM-Toolchain". Eine passende Toolchain für ein x86-basiertes Entwicklungssystem ist unter http://www.handhelds.org/download/projects/toolchain/arm-linux-gcc-3.3.2.tar.bz2 zu finden.

Unter Debian-basierten Linux-Distros kann dieses Archiv z.B. nach /opt/VP5500/toolchain entpackt werden (einige der Makefiles im Forum setzen diesen Speicherort voraus). Es ist jedoch zu beachten, dass diese Version der Toolchain auch über den Pfad /usr/local zugänglich sein sollte. Dazu kann mittels "ln -s /opt/VP5500/toolchain/usr/local/arm arm" in /usr/local ein Symlink auf den eigentlichen Speicherort gesetzt werden.

==== Hello World ====
Christian Klippel (ChrisK) hat unter http://www.mikrocontroller.net/attachment/73161/helloworld.tgz ein "Hello World" bereitgestellt, mit dem sich die Toolchain testen lässt und dessen "Makefile" und "Makefile.local" als Grundlage für eigene Versuche dienen kann.

Das Archiv wird in ein lokals Verzeichnis (z.b. ~/helloworld) entpackt und dort durch Eingabe von "make" compiliert.

Zum Testen muss das Binary natürlich auf das Zielsystem übertragen werden. Wenn auf dem Entwicklungssystem ein http-Server oder ein ssh-Server läuft, kann man das Binary einfach in ein darüber zugängliches Verzeichnis kopieren es anschließend in einer telnet-Sitzung vom Verzeichnis /tmp aus mittels wget oder scp laden. Zum Test muss die Datei mittels "chmod +x helloworld" ausführbar gemacht werden, bevor sie mit "./helloworld" ausgeführt werden kann.

Mittels eines ftpd (z.B. http://www.mikrocontroller.net/attachment/73780/troll-ftpd_1.28-cg2_arm.tgz) oder sshd (z.B. Dropbear von http://vp6500.bd8.nl/) auf dem Zielsystem, kann man das Kopieren auch vom Entwicklungssystem aus durchführen.

==== Anpassen kleinerer Konsolen-Tools auf Cross-Compilierung ====
Die Makefiles von kleineren Projekten sind häufig nicht so sauber aufgebaut wie das helloworld-Beispiel, so dass man sie leichter ersetzt, als ändert. Am Beispiel von micro_httpd (http://www.acme.com/software/micro_httpd/ , http://www.mikrocontroller.net/attachment/73175/micro_httpd.tar.gz) kann leicht nachvollzogen werden, wie man den modularen Ansatz vom "Hello World"-Beispiel übernehmen kann (Makefile -> Makefile + Makefile.local).

= Sonstiges =

== Hardware + Software Versionen ==

Listet mal eure Hardware- und Softwareversion aus dem '''Applications''' => '''System Info''' Menü auf, wenn sie hier noch nicht stehen!

=== VP5500 ===

{| class="wikitable"
|-
! Hardware Version !! Date !! Software Version
|-
| ind5 ||0645 || 4.20
|-
| ind5 ||0647 || 4.20
|-
| ind5 ||0648 || 4.20
|-
| ind5 ||0649 || 4.20
|-
| ind5 ||0702 || 4.20
|-
| ind5 ||0703 || 4.20
|}

=== VP6500 ===

{| class="wikitable"
|-
! Hardware Version !! Date !! Software Version
|-
| ind3-v2 || 0711 || 3.22
|-
| ind3-v2 || 0713 || 3.22
|-
| ind3-v2 || 0716 || 3.22
|}

= Wünsche und Nutzungsideen =

Wer was äußern möchte, kann das hier reintippeln, zwecks Bündelung Interessen und Kräfte. Muss ja keiner das Rad 3x erfinden und man kann schaun was der ein oder andere macht.
Eine Status und Kontaktangabe bei Sachen die in Arbeit sind wäre toll.

== Software ==

* Audiostream-Player
* MP3
** madplay gibt es [http://www.mikrocontroller.net/topic/172616#1704777 hier], der automatische Stromsparmodus stört aber noch
* Browser und sei es nur für Wikipedia und Google
* Mailclient
* T9 Unterstützung bei Texteingabe
* Skypebenutzung
* Nutzungsmöglichkeit als Wireless-Webcam
* WLAN-Repeater
** besser: [http://freifunk.net Freifunk-] bzw. [http://www.olsr.org OLSR-Daemon]
* Wecker
* Voice-Crypto
* Unterstützung von mehreren WLAN Profilen, damit man das Gerät an mehreren APs betreiben kann ohne jedesmal SSID / Key neu eingeben zu müssen. (Sollte durch mehrere Einträge in der wpa_supplicant.conf möglich sein. Diese wird aber bei Änderungen über's Menü komplett überschrieben. Alternative: [[#simpler WLAN-Switcher]])
* Unterstützung von mehreren SIP Profilen, um z.b. von einem SIP-Anbieter auf den anderen zu wechseln. Ideal wäre, wenn man 2 SIP Profile gleichzeitig nutzen könnte
* YouTube Client, vgl. mit "MiniTube Linux"
* 'ne aktuelle Firmware? z.B: mit 2.6er Kernel und Android?
* Unterstützung für WLANs mit 802.1x die WPA2 verschlüsselt sind reparieren bzw. passende Konfig finden (unverschlüsselt geht schon)

== Hardwarerweiterungen ==

* Speichererweiterung
* USB Anschluss
* Blauzahn
* zusätzlicher Anschluss um eine andere Videokamera anzuschließen mit CINCH

== Nutzungsideen ==

* VoIP Phone und Webcam (nahliegend)
* WLAN-Finder
* Repeater
* mobiles Infogerät mit Wikizugriff und Mailpush in der Wireless-Bubble
* Türöffner, Ferncontroller
* Robohirn
* WLAN-Radio
* Video(Streaming)-Client in Verbindung mit VDR
* Barcode-Reader
* als Fernbedienung für PC (Winamp/Mediaplayer/VLC...
* Streamclient und Fernbedienung für DBOX2 mit Linux

[[Kategorie:Projekte]]
[[Kategorie:Datenübertragung]]

Kategorie:Forum

2013-07-23T06:10:52Z

Nicolas: Spam entfernt

IC-Gehäuseformen

2009-01-29T12:03:21Z

Nicolas: /* DIL/SIL (DIP, PDIP) */

[[Image:IC-Gehaeuseformen.jpg|thumb|Gehäuseformen im Größenvergleich]]

== DIL/SIL (DIP, PDIP) ==

SIL: '''S'''ingle '''I'''n '''L'''ine Package (Anschlüsse einreihig)

DIL: '''D'''ual '''I'''n '''L'''ine Package (Anschlüsse zweireihig)

DIP: '''D'''ual '''I'''n line '''P'''ackage

PDIP: '''P'''lastic '''D'''ual '''I'''n line '''P'''ackage

Meistverbreitete Gehäuseform in der Elektronik mit durchsteckbaren Anschlüssen ("Beinchen"). Beine werden durch Löcher in die Platine oder in einen Sockel gesteckt und von unten verlötet. Da die meisten DIL und SIL Gehäuse ein einheitliches Raster aufweisen und damit auf (universal-) 2,54 mm Raster-Leiterplatten steckbar sind, sind diese für Neuentwicklungen und Versuchsaufbauten wesentlich einfacher zu handhaben als [[SMD]]-Bauteile, deren Anschlußraster teils völlig exotisch sind.

== SOP/SSOP/TSOP/TSSOP ==
[[Bild:Top_at49bv322a.jpg|thumb|right|256px|Flash-Baustein Atmel AT49BV322A im TSOP Gehäuse]]
('''T'''hin) ('''S'''hrinked) '''S'''mall '''O'''utline '''P'''ackage

Eine [[SMD]]-Gehäuseform. Das rechteckige Gehäuse hat im Gegensatz zu QFP nur auf zwei Seiten Pins. Üblicherweise sind dies die längeren Seiten des rechteckigen Plastikkörpers. TSOP hat die Anschlußpins allerdings auf der schmalen Gehäuseseite. TSOP-Bauformen werden besonders häufig bei Speicherbausteinen eingesetzt, da ihre besondere Form die Verdrahtung von Bussystemen auf der Leiterplatte vereinfacht. Die Namensgebung unterscheidet sich allerdings auch von Hersteller zu Hersteller oder es werden für leicht andere Gehäuse (unterschiedliche Breite der Plastikkörper usw.) neue Namen eingeführt.

== BGA ==

('''F'''ine-pitch) '''B'''all '''G'''rid '''A'''rray

Eine [[SMD]]-Gehäuseform. Dabei befinden sich die Kontakte in Form von Pads, meist in mehreren Reihen, auf der Unterseite des Gehäuses. Diese Pads werden in einem sogenannten "Balling-Process" mit Lot versehen, wobei sich Lotkugeln auf diesen Pads bilden. Wird ein BGA entfernt und später wieder aufgesetzt, muss dieser Prozess wiederholt werden (Reballing-Process).

Wegen der für diese Prozesse notwendigen Geräte sind [[IC]]s in BGA-Gehäusen für Hobbybastelzwecke nicht gut geeignet. Dafür beschafft man sich lieber welche im DIP-Gehäuse bzw. in TQFP.

Wer es dennoch versuchen möchte, findet ein interessantes Projekt zum BGA selber löten auf:
http://wwwbode.cs.tum.edu/~acher/bga/

== QFP ==

[[Bild:Qfp.jpg|right]]

'''Q'''uad '''F'''lat '''P'''ackage

Eine flache, rechteckige [[SMD]]-Gehäuseform. Dabei werden die Pins an allen vier Kanten in Form relativ kleiner Kontakte nach aussen geführt. Die Handverlötung von QFP-Bauteilen ist zwar schwierig und erfordert einiges an Übung und eine ruhige Hand, ist jedoch möglich. Beachte dazu die [[SMD Löten|Hinweise zum Verlöten von SMD-Bauteilen]].

Varianten:
* TQFP = '''T'''hin '''Q'''uad '''F'''lat '''P'''ackage
* LQFP = '''L'''ow Profile '''Q'''uad '''F'''lat '''P'''ackage
* PQFP = '''P'''lastic '''Q'''uad '''F'''lat '''P'''ackage
* CQFP = '''C'''eramic '''Q'''uad '''F'''lat '''P'''ackage
* BQFP = '''B'''umpered '''Q'''uad '''F'''lat '''P'''ackage
* SQFP = '''S'''mall '''Q'''uad '''F'''lat '''P'''ackage

== Adapterplatinen für SMD-ICs ==

* http://www.ibhn.de/
* http://www.segor.de/
* http://www.futurlec.com/SMD_Adapters.shtml
* http://www.epboard.com/eproducts/ezadapter.htm
* http://www.ak-modul-bus.de/stat/platine_fuer_ics_im_tssop_gehaeuse.html (20 PIN TSSOP)
* http://www.saelig.com/miva/merchant.mvc?Screen=CTGY&Category_Code=SMD
* http://www.elk-tronic.de/Products/Adapter/SolderAdapter/SolderAdapter.htm
* http://www.omega-research.co.uk/
* http://www.sparkfun.com (-->"Breakout Boards")
* http://www.schmartboard.com
* http://www.c51.de/c51.de/ChipAdapter.php
* http://www.tme.pl/ip/369,universal-leiterplatten-fur-integrierte-schaltkreise-in-plcc-und-qfp-gehausen,de.html
* http://www.chipenable.co.uk/shop.htm (SO8-->DIL8)

== Weblinks ==

* http://www.ic-168.com/English/js-pk.htm - Übersicht über sehr viele verschiedene Gehäuse, mit Fotos

[[Category:Bauteile]]

Drucker und Material für Platinenlayouts

2009-01-29T08:06:09Z

Nicolas: /* HL-1430 */

== Einleitung ==
Steht man vor der Entscheidung sich einen Drucker anzuschaffen, mit dem man auch Layouts für die eigene Platinenherstellung erstellen möchte, stellt sich die grundlegende Frage, welche Geräte überhaupt geeignet sind. Auch wer schon im Besitz eines Druckers ist, erlebt nicht selten Überraschungen wenn es an das Testen neuer Materialien wie Folie, Papier, Toner oder Tinte geht.

Dieser Artikel soll eine Plattform für diejenigen darstellen, die schon die ein oder anderen positiven wie auch negativen Erfahrungen gesammelt haben und diese weitergeben möchten. Dies können Empfehlungen für Drucker, Kombinationen von Drucker und Material oder auch Erfahrungen bei der Vorgehensweise sein.
Allein die Suche nach der perfekten Treibereinstellung ist oft lästig und kann hier durch niedergeschriebene Erfahrungswerte für andere vereinfacht werden.

Jeder kann/soll seinen Beitrag leisten, d.h. wenn man selbst gute oder schlechte Erfahrungen mit bestimmten Geräten, Einstellungen, Materialien oder deren Kombination gesammelt hat, sollte man keine Scheu zeigen diese hier niederzuschreiben.
Auch Fotos von Ergebnissen sind natürlich erwünscht.

Die optimale Aufteilung und Struktur wird sich sicherlich mit der Zeit noch ergeben.

== Tintenstrahldrucker nach Hersteller ==
=== Canon ===
==== Pixma IP5200 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltinte
| Zweckform 2503 Inkjet Folie
|
*Folie
*Druckqualität = Hoch
*Farbe = Automatisch
| sehr gut
| sehr gut
| Getestet anhand eines TQFP100 Adapters. Die SMD-Pads sehen beim Schwarzweißdruck etwas unscharf aus. Ich empfehle den Druck mit Farbtinte. Diese lieferte definitiv das bessere Ergebnis.
|}
==== Pixma IP4200 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltinte
| Canon Inkjet Folie
|
*Folie
*Druckqualität = Hoch
*Farbe = manuell (alles auf Maximum, Foto dunkel), Graustufen
| sehr gut
| sehr gut
| Einfach bedruckt schon fast perfekt blickdicht. Perfekte Ergebnisse mit Bungardmaterial und Lötstop. Für Teinting Resist doppelt bedrucken. Einzug arbeitet perfekt, kein Versatz erkennbar (auch nicht mit Lupe).
|-
| Originaltinte
| Conrad OH1
|
*Professional Fotopapier
*Druckqualität = Hoch
*Farbe = manuell (alles auf Maximum, Foto dunkel)
| sehr gut
| sehr gut
| Einfach bedruckt schon fast perfekt blickdicht.
|}

=== Epson ===
==== Epson Stylus C62 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| MVCD www.mvcd.com
| HP Premium Transparentfolien inkjet 0,13mm
|
*Premium Glossy Photo Paper
*Optimales Foto
*Nur schwarze Tinte
*Gamma 2,2
*Helligkeit -25
*Sättigung +25
| sehr gut
| sehr gut
| Getestet bis 0,2 mm Breite/Abstand. Kleiner sicherlich möglich. Foto liegt leider nicht vor.
|}
==== Epson Stylus C70 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Original und Billignachbau
| Zweckform Inkjetfolien 2503
|
*Premium Glossy Photo Paper
*Optimales Foto
*Nur schwarze Tinte
*Gamma 2,2
*Helligkeit -25
*Sättigung +25
| befriedigend
| sehr gut
| Für Bungard Platinenmaterial gut allerdings für Lötstop oder Teinting-Resist einfach zu durchlässig.
|}
==== Epson Stylus D88 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Billignachbau (Die-Patrone)
| Sattelford OHP Folie
|
*Epson Matte
*Optimales Foto
*Nur schwarze Tinte
*Schnell: Aus
*Gamma 2,2
*Helligkeit -25
*Kontrast +25
| gut-sehr gut
| sehr gut
| Sehr gute Ergebnisse
|}

==== (Epson Stylus Photo 750) ====

* http://www.mikrocontroller.net/topic/93261#801235

=== HP ===
==== (HP Deskjet 970Cxi) ====

* http://www.mikrocontroller.net/topic/93261#800835
** sehr erfolgreich
** Zweckform Inkjet-Folien 1440 Dpi
** 1 Druckgang, 1200x1200 Dpi, schwarz, aus Eagle 4.16.
** Bisher feinste Struktur war SOT553-5 Package
** Bei 600x600 Dpi ist die Deckung der Farbe nicht mehr gesichert. Man sieht auch feine Streifen.

==== HP 980Cxi ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Original 45er Tinte schwarz
| Normales Transparent(paus)papier aus dem Schreibwarenhandel
|
* HP Premium Photo Paper (oder so ähnlich)
* Druck: Optimal
* schwarz/weiß Druck (keine Farbe!!!)
| sehr gut (10 mil Bahnen bei 8 mil Abstand ohne Probleme)
| sehr gut
|
* Mit Bungard Grundmaterial probiert
* Sollte mit jedem HP Drucker funktionieren, der die 45er Tinte verwendet
* Evtl. mit der Belichtung etwas übertreiben (bei mir waren es 30 statt 20 Minuten), damit die Zwischenräume sauber belichtet werden
* Unbedingt auf den "seitenverkehrten" Druck achten, so daß die bedruckte Seite direkt auf der Platine sitzt
* Das Papier vorher gut trocknen lassen (mind. 30 Minuten!)
|}

==== (HP Deskjet F380) ====

* http://www.mikrocontroller.net/topic/93261#801047
* 0,18mm Strukturen mit 0,2mm Clearance, siehe [[STK500USB-Adapter]]

=== Brother MFC260C ===

*bisher nichts brauchbares erzielt.
*Getestet: Zweckform Inkjetfolie, nahezu alle Einstellungen.Nie annähernd blickdicht, größere Flächen stellenweise fast durchsichtig.

== Laserdrucker nach Hersteller ==

===Kyocera===
==== Kyocera Mita FS1000+ ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Tonermethode:Reicheltpapier oder billiges Papier von MM (Gelbe Verpackung "Copy Extra")
| 600dpi, kein Sparmodus
| sehr gut
| sehr gut
| Keine Probleme bis 0,1er Leiterbahnen,Cups Standardtreiber, Keine Erfahrung mit Windows.
|}

===Brother===

====HL-1430====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Overheadfolie Zweckform
| Folie, nicht Graustufen verbessern
| sehr gut*
| sehr gut
|Belichten: 4 UV-Röhren 300s, Leiterbahnabstand 0,2mm kein Problem. *Tonerverdichtung mit Etikettenlöser "Solvent 50"
|-
| Originaltoner
| Reichelt-Katalogpapier
| Normalpapier, nicht Graustufen verbessern
| -
| mittel
| Toner breitet sich beim Aufbügeln sehr auf
|}

====HL-5030====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Overheadfolie Zweckform 3491
| 600dpi, Schwärzungsgrad hoch
| gut
| sehr gut
| Belichten: 4 UV-Röhren 45s, Leiterbahnen mit 0,4mm kein Problem, mit ein wenig Erfahrung sind auch 0,3mm möglich.
|}
====HL-2040====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Transparentpapier 85g/m^2
| 600dpi, Schwärzungsgrad: Maximum, Einstellung: Dickes Papier
| gut
| sehr gut
| Belichten: 4 UV-Röhren 110s, Leiterbahnen mit 8mil kein Problem, mit ein wenig Erfahrung sind auch 6mil möglich. Auch gut für die Tonertransfermetode geeignet (Katalogpapier auf Normalpapier aufgeklebt).
|}

===Epson===
==== C1100 (Farblaser) ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Normalpapier, mit Öl transparent gemacht
| Druckqualität = Hoch
| gut
| sehr gut
| Bei 4x15W UV Belichtungszeit über 4 Minuten. Abstände von 0,635mm/25mil sind kein Problem, 16mil geht gerade noch. Daher vor allem für bedrahtete Sachen geeignet, wo sich so sehr niedrige Druckkosten ergeben.
[http://img47.imageshack.us/img47/8416/uwegwminbotplatinefertig0az.jpg Foto meiner ersten Platine]

--[[Benutzer:Uwegw|Uwegw]] 11:35, 19. Mai 2006 (CEST)
|}

===Samsung===
==== ML1610 ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| billiges (=ziemlich dünnes) Fotopapier, mit Öl transparent gemacht
| 600dpi, kein Sparmodus, CUPS mit Treiber "Samsung ML-1510 Foomatic/gdi"
| sehr gut
| sehr gut
| TQFP64 kein Problem, allerdings die Pads etwas schmaler machen als norma (so 10mil Breite), da sonst die Abstände zu klein werden. Linien bis 3 mil gut aufgelöst. Der Linux-Treiber von der CD liefert schlechtere Ergebnisse als der bei CUPS mitgelieferte für den ML1510. Keine Erfahrung mit Windows.

Sehr günstiger Drucker (neu <90EUR), mittlerweile ersetzt durch den Nachfolger ML2010.

--[[Benutzer:R2D2|R2D2]] 17:50, 22. Sep 2007 (CEST)
|}

===IBM===
====PagePrinter 3116====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
|Originaltoner
|Tonermethode auf Reicheltpapier
|1200dpi
|sehr gut, große Masseflächen einwandfrei
|sehr gut
|4mil Leiterbahn mit 8mil Abstand sind gut machbar, gedruckt mit cups und ps2 Treiber. Belichten durch Normalpapier+Öl ging auch ganz gut (TSSOP ohne Probleme).
|}

[[Category:Platinen]]

Drucker und Material für Platinenlayouts

2009-01-29T08:00:53Z

Nicolas: /* HL-1430 */

== Einleitung ==
Steht man vor der Entscheidung sich einen Drucker anzuschaffen, mit dem man auch Layouts für die eigene Platinenherstellung erstellen möchte, stellt sich die grundlegende Frage, welche Geräte überhaupt geeignet sind. Auch wer schon im Besitz eines Druckers ist, erlebt nicht selten Überraschungen wenn es an das Testen neuer Materialien wie Folie, Papier, Toner oder Tinte geht.

Dieser Artikel soll eine Plattform für diejenigen darstellen, die schon die ein oder anderen positiven wie auch negativen Erfahrungen gesammelt haben und diese weitergeben möchten. Dies können Empfehlungen für Drucker, Kombinationen von Drucker und Material oder auch Erfahrungen bei der Vorgehensweise sein.
Allein die Suche nach der perfekten Treibereinstellung ist oft lästig und kann hier durch niedergeschriebene Erfahrungswerte für andere vereinfacht werden.

Jeder kann/soll seinen Beitrag leisten, d.h. wenn man selbst gute oder schlechte Erfahrungen mit bestimmten Geräten, Einstellungen, Materialien oder deren Kombination gesammelt hat, sollte man keine Scheu zeigen diese hier niederzuschreiben.
Auch Fotos von Ergebnissen sind natürlich erwünscht.

Die optimale Aufteilung und Struktur wird sich sicherlich mit der Zeit noch ergeben.

== Tintenstrahldrucker nach Hersteller ==
=== Canon ===
==== Pixma IP5200 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltinte
| Zweckform 2503 Inkjet Folie
|
*Folie
*Druckqualität = Hoch
*Farbe = Automatisch
| sehr gut
| sehr gut
| Getestet anhand eines TQFP100 Adapters. Die SMD-Pads sehen beim Schwarzweißdruck etwas unscharf aus. Ich empfehle den Druck mit Farbtinte. Diese lieferte definitiv das bessere Ergebnis.
|}
==== Pixma IP4200 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltinte
| Canon Inkjet Folie
|
*Folie
*Druckqualität = Hoch
*Farbe = manuell (alles auf Maximum, Foto dunkel), Graustufen
| sehr gut
| sehr gut
| Einfach bedruckt schon fast perfekt blickdicht. Perfekte Ergebnisse mit Bungardmaterial und Lötstop. Für Teinting Resist doppelt bedrucken. Einzug arbeitet perfekt, kein Versatz erkennbar (auch nicht mit Lupe).
|-
| Originaltinte
| Conrad OH1
|
*Professional Fotopapier
*Druckqualität = Hoch
*Farbe = manuell (alles auf Maximum, Foto dunkel)
| sehr gut
| sehr gut
| Einfach bedruckt schon fast perfekt blickdicht.
|}

=== Epson ===
==== Epson Stylus C62 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| MVCD www.mvcd.com
| HP Premium Transparentfolien inkjet 0,13mm
|
*Premium Glossy Photo Paper
*Optimales Foto
*Nur schwarze Tinte
*Gamma 2,2
*Helligkeit -25
*Sättigung +25
| sehr gut
| sehr gut
| Getestet bis 0,2 mm Breite/Abstand. Kleiner sicherlich möglich. Foto liegt leider nicht vor.
|}
==== Epson Stylus C70 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Original und Billignachbau
| Zweckform Inkjetfolien 2503
|
*Premium Glossy Photo Paper
*Optimales Foto
*Nur schwarze Tinte
*Gamma 2,2
*Helligkeit -25
*Sättigung +25
| befriedigend
| sehr gut
| Für Bungard Platinenmaterial gut allerdings für Lötstop oder Teinting-Resist einfach zu durchlässig.
|}
==== Epson Stylus D88 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Billignachbau (Die-Patrone)
| Sattelford OHP Folie
|
*Epson Matte
*Optimales Foto
*Nur schwarze Tinte
*Schnell: Aus
*Gamma 2,2
*Helligkeit -25
*Kontrast +25
| gut-sehr gut
| sehr gut
| Sehr gute Ergebnisse
|}

==== (Epson Stylus Photo 750) ====

* http://www.mikrocontroller.net/topic/93261#801235

=== HP ===
==== (HP Deskjet 970Cxi) ====

* http://www.mikrocontroller.net/topic/93261#800835
** sehr erfolgreich
** Zweckform Inkjet-Folien 1440 Dpi
** 1 Druckgang, 1200x1200 Dpi, schwarz, aus Eagle 4.16.
** Bisher feinste Struktur war SOT553-5 Package
** Bei 600x600 Dpi ist die Deckung der Farbe nicht mehr gesichert. Man sieht auch feine Streifen.

==== HP 980Cxi ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Original 45er Tinte schwarz
| Normales Transparent(paus)papier aus dem Schreibwarenhandel
|
* HP Premium Photo Paper (oder so ähnlich)
* Druck: Optimal
* schwarz/weiß Druck (keine Farbe!!!)
| sehr gut (10 mil Bahnen bei 8 mil Abstand ohne Probleme)
| sehr gut
|
* Mit Bungard Grundmaterial probiert
* Sollte mit jedem HP Drucker funktionieren, der die 45er Tinte verwendet
* Evtl. mit der Belichtung etwas übertreiben (bei mir waren es 30 statt 20 Minuten), damit die Zwischenräume sauber belichtet werden
* Unbedingt auf den "seitenverkehrten" Druck achten, so daß die bedruckte Seite direkt auf der Platine sitzt
* Das Papier vorher gut trocknen lassen (mind. 30 Minuten!)
|}

==== (HP Deskjet F380) ====

* http://www.mikrocontroller.net/topic/93261#801047
* 0,18mm Strukturen mit 0,2mm Clearance, siehe [[STK500USB-Adapter]]

=== Brother MFC260C ===

*bisher nichts brauchbares erzielt.
*Getestet: Zweckform Inkjetfolie, nahezu alle Einstellungen.Nie annähernd blickdicht, größere Flächen stellenweise fast durchsichtig.

== Laserdrucker nach Hersteller ==

===Kyocera===
==== Kyocera Mita FS1000+ ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Tonermethode:Reicheltpapier oder billiges Papier von MM (Gelbe Verpackung "Copy Extra")
| 600dpi, kein Sparmodus
| sehr gut
| sehr gut
| Keine Probleme bis 0,1er Leiterbahnen,Cups Standardtreiber, Keine Erfahrung mit Windows.
|}

===Brother===

====HL-1430====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Overheadfolie Zweckform
| Folie, nicht Graustufen verbessern
| sehr gut*
| sehr gut
|Belichten: 4 UV-Röhren 300s, Leiterbahnabstand 0,2mm kein Problem. *Tonerverdichtung mit Etikettenlöser "Solvent 50"
|}

====HL-5030====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Overheadfolie Zweckform 3491
| 600dpi, Schwärzungsgrad hoch
| gut
| sehr gut
| Belichten: 4 UV-Röhren 45s, Leiterbahnen mit 0,4mm kein Problem, mit ein wenig Erfahrung sind auch 0,3mm möglich.
|}
====HL-2040====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Transparentpapier 85g/m^2
| 600dpi, Schwärzungsgrad: Maximum, Einstellung: Dickes Papier
| gut
| sehr gut
| Belichten: 4 UV-Röhren 110s, Leiterbahnen mit 8mil kein Problem, mit ein wenig Erfahrung sind auch 6mil möglich. Auch gut für die Tonertransfermetode geeignet (Katalogpapier auf Normalpapier aufgeklebt).
|}

===Epson===
==== C1100 (Farblaser) ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Normalpapier, mit Öl transparent gemacht
| Druckqualität = Hoch
| gut
| sehr gut
| Bei 4x15W UV Belichtungszeit über 4 Minuten. Abstände von 0,635mm/25mil sind kein Problem, 16mil geht gerade noch. Daher vor allem für bedrahtete Sachen geeignet, wo sich so sehr niedrige Druckkosten ergeben.
[http://img47.imageshack.us/img47/8416/uwegwminbotplatinefertig0az.jpg Foto meiner ersten Platine]

--[[Benutzer:Uwegw|Uwegw]] 11:35, 19. Mai 2006 (CEST)
|}

===Samsung===
==== ML1610 ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| billiges (=ziemlich dünnes) Fotopapier, mit Öl transparent gemacht
| 600dpi, kein Sparmodus, CUPS mit Treiber "Samsung ML-1510 Foomatic/gdi"
| sehr gut
| sehr gut
| TQFP64 kein Problem, allerdings die Pads etwas schmaler machen als norma (so 10mil Breite), da sonst die Abstände zu klein werden. Linien bis 3 mil gut aufgelöst. Der Linux-Treiber von der CD liefert schlechtere Ergebnisse als der bei CUPS mitgelieferte für den ML1510. Keine Erfahrung mit Windows.

Sehr günstiger Drucker (neu <90EUR), mittlerweile ersetzt durch den Nachfolger ML2010.

--[[Benutzer:R2D2|R2D2]] 17:50, 22. Sep 2007 (CEST)
|}

===IBM===
====PagePrinter 3116====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
|Originaltoner
|Tonermethode auf Reicheltpapier
|1200dpi
|sehr gut, große Masseflächen einwandfrei
|sehr gut
|4mil Leiterbahn mit 8mil Abstand sind gut machbar, gedruckt mit cups und ps2 Treiber. Belichten durch Normalpapier+Öl ging auch ganz gut (TSSOP ohne Probleme).
|}

[[Category:Platinen]]

Drucker und Material für Platinenlayouts

2009-01-29T07:59:59Z

Nicolas: /* Brother */

== Einleitung ==
Steht man vor der Entscheidung sich einen Drucker anzuschaffen, mit dem man auch Layouts für die eigene Platinenherstellung erstellen möchte, stellt sich die grundlegende Frage, welche Geräte überhaupt geeignet sind. Auch wer schon im Besitz eines Druckers ist, erlebt nicht selten Überraschungen wenn es an das Testen neuer Materialien wie Folie, Papier, Toner oder Tinte geht.

Dieser Artikel soll eine Plattform für diejenigen darstellen, die schon die ein oder anderen positiven wie auch negativen Erfahrungen gesammelt haben und diese weitergeben möchten. Dies können Empfehlungen für Drucker, Kombinationen von Drucker und Material oder auch Erfahrungen bei der Vorgehensweise sein.
Allein die Suche nach der perfekten Treibereinstellung ist oft lästig und kann hier durch niedergeschriebene Erfahrungswerte für andere vereinfacht werden.

Jeder kann/soll seinen Beitrag leisten, d.h. wenn man selbst gute oder schlechte Erfahrungen mit bestimmten Geräten, Einstellungen, Materialien oder deren Kombination gesammelt hat, sollte man keine Scheu zeigen diese hier niederzuschreiben.
Auch Fotos von Ergebnissen sind natürlich erwünscht.

Die optimale Aufteilung und Struktur wird sich sicherlich mit der Zeit noch ergeben.

== Tintenstrahldrucker nach Hersteller ==
=== Canon ===
==== Pixma IP5200 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltinte
| Zweckform 2503 Inkjet Folie
|
*Folie
*Druckqualität = Hoch
*Farbe = Automatisch
| sehr gut
| sehr gut
| Getestet anhand eines TQFP100 Adapters. Die SMD-Pads sehen beim Schwarzweißdruck etwas unscharf aus. Ich empfehle den Druck mit Farbtinte. Diese lieferte definitiv das bessere Ergebnis.
|}
==== Pixma IP4200 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltinte
| Canon Inkjet Folie
|
*Folie
*Druckqualität = Hoch
*Farbe = manuell (alles auf Maximum, Foto dunkel), Graustufen
| sehr gut
| sehr gut
| Einfach bedruckt schon fast perfekt blickdicht. Perfekte Ergebnisse mit Bungardmaterial und Lötstop. Für Teinting Resist doppelt bedrucken. Einzug arbeitet perfekt, kein Versatz erkennbar (auch nicht mit Lupe).
|-
| Originaltinte
| Conrad OH1
|
*Professional Fotopapier
*Druckqualität = Hoch
*Farbe = manuell (alles auf Maximum, Foto dunkel)
| sehr gut
| sehr gut
| Einfach bedruckt schon fast perfekt blickdicht.
|}

=== Epson ===
==== Epson Stylus C62 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| MVCD www.mvcd.com
| HP Premium Transparentfolien inkjet 0,13mm
|
*Premium Glossy Photo Paper
*Optimales Foto
*Nur schwarze Tinte
*Gamma 2,2
*Helligkeit -25
*Sättigung +25
| sehr gut
| sehr gut
| Getestet bis 0,2 mm Breite/Abstand. Kleiner sicherlich möglich. Foto liegt leider nicht vor.
|}
==== Epson Stylus C70 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Original und Billignachbau
| Zweckform Inkjetfolien 2503
|
*Premium Glossy Photo Paper
*Optimales Foto
*Nur schwarze Tinte
*Gamma 2,2
*Helligkeit -25
*Sättigung +25
| befriedigend
| sehr gut
| Für Bungard Platinenmaterial gut allerdings für Lötstop oder Teinting-Resist einfach zu durchlässig.
|}
==== Epson Stylus D88 ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Billignachbau (Die-Patrone)
| Sattelford OHP Folie
|
*Epson Matte
*Optimales Foto
*Nur schwarze Tinte
*Schnell: Aus
*Gamma 2,2
*Helligkeit -25
*Kontrast +25
| gut-sehr gut
| sehr gut
| Sehr gute Ergebnisse
|}

==== (Epson Stylus Photo 750) ====

* http://www.mikrocontroller.net/topic/93261#801235

=== HP ===
==== (HP Deskjet 970Cxi) ====

* http://www.mikrocontroller.net/topic/93261#800835
** sehr erfolgreich
** Zweckform Inkjet-Folien 1440 Dpi
** 1 Druckgang, 1200x1200 Dpi, schwarz, aus Eagle 4.16.
** Bisher feinste Struktur war SOT553-5 Package
** Bei 600x600 Dpi ist die Deckung der Farbe nicht mehr gesichert. Man sieht auch feine Streifen.

==== HP 980Cxi ====
{| border=1
| '''Tinte'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Original 45er Tinte schwarz
| Normales Transparent(paus)papier aus dem Schreibwarenhandel
|
* HP Premium Photo Paper (oder so ähnlich)
* Druck: Optimal
* schwarz/weiß Druck (keine Farbe!!!)
| sehr gut (10 mil Bahnen bei 8 mil Abstand ohne Probleme)
| sehr gut
|
* Mit Bungard Grundmaterial probiert
* Sollte mit jedem HP Drucker funktionieren, der die 45er Tinte verwendet
* Evtl. mit der Belichtung etwas übertreiben (bei mir waren es 30 statt 20 Minuten), damit die Zwischenräume sauber belichtet werden
* Unbedingt auf den "seitenverkehrten" Druck achten, so daß die bedruckte Seite direkt auf der Platine sitzt
* Das Papier vorher gut trocknen lassen (mind. 30 Minuten!)
|}

==== (HP Deskjet F380) ====

* http://www.mikrocontroller.net/topic/93261#801047
* 0,18mm Strukturen mit 0,2mm Clearance, siehe [[STK500USB-Adapter]]

=== Brother MFC260C ===

*bisher nichts brauchbares erzielt.
*Getestet: Zweckform Inkjetfolie, nahezu alle Einstellungen.Nie annähernd blickdicht, größere Flächen stellenweise fast durchsichtig.

== Laserdrucker nach Hersteller ==

===Kyocera===
==== Kyocera Mita FS1000+ ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Tonermethode:Reicheltpapier oder billiges Papier von MM (Gelbe Verpackung "Copy Extra")
| 600dpi, kein Sparmodus
| sehr gut
| sehr gut
| Keine Probleme bis 0,1er Leiterbahnen,Cups Standardtreiber, Keine Erfahrung mit Windows.
|}

===Brother===

====HL-1430====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Overheadfolie Zweckform
| Folie, nicht Graustufen verbessern
| sehr gut*
| Belichten: 4 UV-Röhren 300s, Leiterbahnabstand 0,2mm kein Problem. *Tonerverdichtung mit Etikettenlöser "Solvent 50"
|}
====HL-5030====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Overheadfolie Zweckform 3491
| 600dpi, Schwärzungsgrad hoch
| gut
| sehr gut
| Belichten: 4 UV-Röhren 45s, Leiterbahnen mit 0,4mm kein Problem, mit ein wenig Erfahrung sind auch 0,3mm möglich.
|}
====HL-2040====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Transparentpapier 85g/m^2
| 600dpi, Schwärzungsgrad: Maximum, Einstellung: Dickes Papier
| gut
| sehr gut
| Belichten: 4 UV-Röhren 110s, Leiterbahnen mit 8mil kein Problem, mit ein wenig Erfahrung sind auch 6mil möglich. Auch gut für die Tonertransfermetode geeignet (Katalogpapier auf Normalpapier aufgeklebt).
|}

===Epson===
==== C1100 (Farblaser) ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| Normalpapier, mit Öl transparent gemacht
| Druckqualität = Hoch
| gut
| sehr gut
| Bei 4x15W UV Belichtungszeit über 4 Minuten. Abstände von 0,635mm/25mil sind kein Problem, 16mil geht gerade noch. Daher vor allem für bedrahtete Sachen geeignet, wo sich so sehr niedrige Druckkosten ergeben.
[http://img47.imageshack.us/img47/8416/uwegwminbotplatinefertig0az.jpg Foto meiner ersten Platine]

--[[Benutzer:Uwegw|Uwegw]] 11:35, 19. Mai 2006 (CEST)
|}

===Samsung===
==== ML1610 ====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
| Originaltoner
| billiges (=ziemlich dünnes) Fotopapier, mit Öl transparent gemacht
| 600dpi, kein Sparmodus, CUPS mit Treiber "Samsung ML-1510 Foomatic/gdi"
| sehr gut
| sehr gut
| TQFP64 kein Problem, allerdings die Pads etwas schmaler machen als norma (so 10mil Breite), da sonst die Abstände zu klein werden. Linien bis 3 mil gut aufgelöst. Der Linux-Treiber von der CD liefert schlechtere Ergebnisse als der bei CUPS mitgelieferte für den ML1510. Keine Erfahrung mit Windows.

Sehr günstiger Drucker (neu <90EUR), mittlerweile ersetzt durch den Nachfolger ML2010.

--[[Benutzer:R2D2|R2D2]] 17:50, 22. Sep 2007 (CEST)
|}

===IBM===
====PagePrinter 3116====
{| border=1
| '''Toner'''
| '''Druckmedium'''
| '''Treibereinstellung'''
| '''Lichtdichtheit'''
| '''Schärfe'''
| '''Kommentar'''
|-
|Originaltoner
|Tonermethode auf Reicheltpapier
|1200dpi
|sehr gut, große Masseflächen einwandfrei
|sehr gut
|4mil Leiterbahn mit 8mil Abstand sind gut machbar, gedruckt mit cups und ps2 Treiber. Belichten durch Normalpapier+Öl ging auch ganz gut (TSSOP ohne Probleme).
|}

[[Category:Platinen]]

Platinenherstellung mit der Tonertransfermethode

2008-12-01T14:15:44Z

Nicolas: /* Druckerdatenbank */

Ohne die Zwischenschritte des Belichtens und Entwickelns können Platinen schneller und günstiger hergestellt werden, indem der Toner eines Laserdruckers oder Kopierers als Ätzmaske verwendet wird. Der Toner wird beim Erhitzen weich und haftet so am Kupfer der Platine.

Die erreichte Qualität hängt - wie auf dem konventionellen Wege auch - von vielen Faktoren ab, ist daher nicht unbedingt reproduzierbar, und meistens nicht so hoch wie die durch die optische Methode erreichte. 0,2 mm Leiterbahnbreite sind möglich (KM FS-1020D), durch den minimalen Tonerauftrag kommt es dabei allerdings u.U. zu kleinsten Löchern, die in dieser Größenordnung schon schaden können. Leiterbahnunterbrechungen behebt man am besten mit einem säurefesten Edding.

== Material ==
* Laserdrucker oder Kopierer
* Transfer-Medium (Zeitschriftenpapier, Reichelt-Katalogseite)
* kupferbeschichtete Platine (ohne! Fotolack und sonstige Beschichtungen)
* Stahlwolle oder Glitzi-Schwamm zum mechanischen Reinigen der Platine
* Aceton zum Entfetten der Platine
* Handelsübliches Bügeleisen
* Temperaturunempfindliche, glatte Oberfläche
* Ätzmittel

== Vorgehen ==

=== Drucken ===
Das Layout wird mit einem Laserdrucker (hier: Samsung ML4600) auf ein geeignetes Transfer-Medium (hier: Reichelt-Katalogseite) gedruckt. Am besten für den Transfer eignen sich beschichtete Hochglanzpapiersorten aus Katalogen oder Zeitschriften.

Bei SMD-Projekten (Leiterbahnen auf der Bestückungsseite) muß darauf geachtet werden, daß spiegelverkehrt gedruckt wird. Sollen sich die Leiterbahnen auf der Rückseite befinden, müßte 2x gespiegelt werden, was sich wieder aufhebt. Nur eventueller Text für die Rückseite muß dann gespiegelt werden.

=== Platine reinigen ===
Die Platine ist ausgiebig zu reinigen. Dazu eignen sich Scheuermilch, ein Stahlwolle-Topfreiniger und am Ende Aceton zum Entfetten (Brennspiritus geht im Prinzip auch, hinterläßt aber immer einen (sehr) dünnen Fettfilm). Der Stahlwolle-Topfreiniger eignet sich auch sehr gut dazu, die Toner-Reste bei einem Fehlschlag wieder von der Platine herunter zu bekommen. Ein Glitzi-Schwamm ist weniger effektiv, geht aber auch (Glitzi-Schwamm = Küchen-Schwamm mit einseitiger Scheuerbeschichtung).

=== Übertragen ===
Ein einfaches Bügeleisen bringt ausreichende Hitze. Die nötige Bügelzeit liegt bei etwa 5 Minuten, aber je nach Toner können auch bereits nach kürzerer Zeit (Bügeleisen max!) gute Ergebnisse erreicht werden. Zu langes Bügeln führt zu zunehmendem Verfließen des Toners.

Entscheidend ist, daß auf alle Bereiche Druck ausgeübt wird. Da Bügeleisen-Unterseite, Platine und Unterlage nie ganz eben sind, reicht es nicht, das Bügeleisen nur auf den Stapel draufzustellen und darauf zu drücken. Auch untergelegte Stoffe verteilen den Druck erfahrungsgemäß nicht ganz gleichmäßig.

Der Toner verklebt Blatt und Platine nach kurzer Zeit ausreichend gut (ggf. einen Rahmen um das gesamte Projekt drucken) so daß man nach kurzer Zeit den Stapel vorsichtig bewegen kann.

Bewährt hat es sich, den Stapel aus einer Lage Alufolie (als Gleithilfe), der Platine, dem Papier und dem Bügeleisen mehrfach über die Kante eines auf einer glatten Oberfläche (hier: Ceran-Kochfeld) liegenden Küchenhandtuchs gleiten zu lassen. Bei zu hohem Druck können die Leiterbahnen aber in die Breite gehen und natürlich darf der Stapel dabei nicht verrutschen!

[[Bild:Tonertransfer0.jpg|thumb|center|Übertragen]]

Laminiergeräte sind zum Aufbringen des Toners auf die Platine möglicherweise auch geeignet.

=== Papier abwaschen ===
Katalogpapiere können durch Einweichen in warmem Seifenwasser und sanftes Abreiben mit den Fingern entfernt werden, so dass nur der Toner auf der Platine verbleibt. Der Toner verbindet sich in der Regel bei gleichmäßigem Druck sehr gut mit der Kupferschicht, man kann überraschend beherzt vorgehen.

[[Bild:Tonertransfer1.jpg|thumb|center|Nach dem Aufbügeln]]

=== Ätzen ===

Nach dem Ätzen mit Eisen(III)-chlorid (FeCl3) (die Papierfasern stören dabei nicht) sieht die Platine so aus. Die Kupferschicht verschwindet nach einiger Ätzzeit überall recht zügig.

[[Bild:Tonertransfer2.jpg|thumb|center|Nach dem Ätzen]]

Danach wird der Toner mit Aceton entfernt, ungiftiges mechanisches Entfernen des Toners funktioniert auch. Dazu rubbelt man mit der rauhen Seite eines Glitzi-Schwammes oder einem Stahlwolle-Topfreiniger (wirkungsvoller!) so lange auf der Platine herum, bis keine Tonerreste mehr zu sehen sind. Alternativ zum Schwamm eignet sich auch super ein "Schleifpad" aus dem Bau-/Sanitärmarkt, welches normalerweise zur Lötstellenreinigung von Kupferrohren verwendet wird.

[[Bild:Tonertransfer3.jpg|thumb|center|Gereinigte Platine]]

Aceton gibt es im Baumarkt (unbedingt Sicherheitsvorschriften beachten!)

=== Bestückungsdruck ===

Übrigens eignet sich die Tonertransfermethode auch hervorragend, um einen Bestückungsdruck auf der anderen Seite der Platine herzustellen. Dazu wird ebenfalls der Druck spiegelverkehrt auf die Platine gebügelt. Am Besten gelingt die Positionierung auf einer bereits gebohrten Platine, die mit dem Papier gegen das Licht gehalten und auf die Bohrlöcher fixiert wird.

Zwar hält der Toner auf der anderen Seite nicht so stark, aber zumindest bis zur fertigen Bestückung der Platine mit den Bauteilen sollte es reichen, bei vorsichtigem Umgang auch um einiges länger.

Tip: nach dem ersten Bügeln nochmal abkühlen lassen und ein zweites mal darüber bügeln, macht das ganze noch beständiger.

Es ist u.U. auch möglich, das Einweichen in Wasser zu umgehen, indem man in Graustufen druckt - durch die Rasterung des Laserdruckers ist die Verbindung des Transfermediums mit der Leiterplatte auf Tonerpunkte begrenzt, die beim Abziehen eher auf der Platine verbleiben, als auf dem Transfermedium.

== Getestete Folien und Papiere ==

* Seiten aus einem Reichelt-Katalog gehen super!!!
* Avery Zweckform Laser A4
* Injet Glossy Paper (Lidl) (Hochglanzfotopapier für Tintenstrahler) ist super gut!
* Papier aus einem ELV-Katalog/Focus/Spiegel (möglichst schwarz/weiße Seiten nehmen)
* Laminierfolie - http://www.mikrocontroller.net/forum/read-6-261025.html#277670
* Thermo-Papier (Fax-Papier) auf der glatten Seite

== Notizen zu Druckern ==

Kyocera-Mita FS-1020D mit Reichelt-Katalogpapier
Kyocera-Mita FS-820 schluckt kein Reichelt-Papier, ist zu dünn.
Brother HL-1230 auch nicht, was sich aber umgehen lässt, indem man ein Seite Normalpapier unter die Katalogseite legt und den Durchzug des Druckers nutzt.

== Druckerdatenbank ==
'''''HP:'''''
* '''LaserJet 4:''' Mit Reicheltpapier alleine bekommt man Papierstau.
* '''LaserJet 4100:''' Mit Reicheltpapier auf Papier geklebt funktioniert alles recht problemlos (Direkteinzug von Reichelt-Papier noch nicht ausprobiert).
* '''LaserJet 5MP:''' Reicheltpapier aus manueller Papierzufur geht problemlos
* '''LaserJet 6P:''' Ersatztoner von Reichelt, Druck auf (fast) DinA-4 Seiten aus Pollin Katalog - Problemfrei wenn man die hintere Umlenkklappe des Drucker (Gitter) öffnet und das Papier hier entnimmt!
* '''ColorJet 2550N:''' Toner haftet nur sehr schlecht auf Reicheltpapier. Selbe Einzugsprobleme wie der LJ4, doch mit einem A4 Papier als "Träger" umgehbar. Dazu einfach mit Prittstift an den oberen Ecken der kleineren Reicheltseite auf der A4 Kopierpapierseite fixieren. '''Kein Tesafilm - Drucker wird sonst beschädigt!'''
* '''Color LaserJet 2605dn:''' Toner haftet gut auf Reicheltpapier. Keine Einzugsprobleme wenn man verfährt wie beim 2550N.
* '''LaserJet 2200D:''' Zerknittert Reicheltpapier beim Einzug, lässt sich aber durch aufkleben auf ein DIN A4 Blatt trotzdem bedrucken.Druckergebnisse sind gut.
* '''LaserJet 1018:''' Geht mit aufgeklebter Reichelt Seite auf DINA4 Blatt

Manche Laser-Drucker (HP Laserjet 5MP) sparen (auch im nicht-Econo-Mode) bei großen Flächen offenbar stark an Toner. Große schwarze Flächen konnte ich nicht zuverlässig transferieren, Leiterbahnen in deren Nähe auch nicht.

'''''Epson:'''''
* '''AcuLaser C2000:''' Reichelt Papier auf Din A4 (mit Kreppklebeband) oder als Din A5 zurecht geschnitten ist kein Problem. Sehr gute Deckung, große Masseflächen sind kein Problem. Auch detaillierte farbige Zeichnungen lassen sich sehr gut transferieren (Alu Platten). Alles mit billig Toner.
* '''EPL-3000:''' Reichelt Papier geht direkt(Treiber HP-LJ4L), Hörzu (auf Din A4 geklebt) ist kein Problem. Sehr gute Deckung. Allerdings hat der Drucker nur 300dpi, daher sind feine Strukturen ein wenig problematisch.

'''''Minolta:'''''
* '''PagePro 1200W:''' Reichelt Papier mit einem normalen Din A5 Papier als Träger (angeklebt) und über den manuellen Einzug, funktioniert ohne Probleme mit Nachfülltoner. Auf Backpapier haftet der Toner allerdings kaum

'''''Lexmark:'''''
* '''Optra S 1650:''' Reichelt Papier auf Din A4 Träger: Druckt sehr dicht (Deckung auf Max. stellen), lässt sich gut auf die Platine übertragen. Super Ergebnisse.

* '''E120N:''' Sehr gute Erfolge mit Reichelt Papier auf Din A4 Träger: Druckt sehr dicht (Deckung auf Max. stellen), lässt sich gut auf die Platine übertragen. Super Ergebnisse. (Original Toner)

'''''Kyocera:'''''
* '''FS-400 / FS-400A:''' Reicheltpapier aus manueller Papierzufur geht, lässt sich gut auf die Platine übertragen. Super Ergebnisse. (Original Toner)
* '''FS-1000+''' Reicheltpapier vorher auf eine A4 Seite kleben am besten nur an der einzugseite befestigen sonst wellt es Poligone werden in mittleren teil sehr dünn gedruckt da am besten das Papier nicht abruppeln. Hinterher noch die Platine mit der unbedruckten seite 5min aufs Bügeleisen legen dann werdenden die Polygone auch dicht.

'''''Brother:'''''
* '''HL-5250DN:''' Reichelt-Papier ohne Träger in der manuellen Papierzufuhr führt zu einem Papierstau. Mit DIN A4-Träger lässt sich das Layout jedoch problemlos mit Originaltoner drucken. Ergebnisse nach dem Bügeln sind erstaunlich gut. Leiterbahnen mit 6 mil Breite ließen sich ohne Unterbrechung herstellen.
* '''HL-1030:''' Mit Reichelt- oder Pollin-Papier kriegt man sehr gute Ergebnisse, der Toner ist bereits nach 2-3 mal Drübergehen mit nem Bügeleisen auf Stufe 2,5 sauber auf der Platine, Papier löst sich unter gewöhnlichem Wasser sehr leicht ab. Dünnes Katalogpapier kann man nur über den manuellen Papiereinzug verwenden, mit normalem Papier als Unterlage, das Katalog- und normale Papier müssen jedoch nicht verklebt werden. Toner gut abwaschbar mit Universalverdünnung.
* '''HL-1430:''' Sehr gute Ergebnisse mit Reicheltkatalogpapier, das mit Klebestift auf der Einzugseite auf normales Druckerpapier aufgeklebt ist. Der Toner läßt sich sehr gut mit "Solvent 50" entfernen.

'''''Samsung:'''''
* '''ML-1710:''' Reichelt Papier im Manuellen Einzug funktioniert geht wunderbar. Gute ergebnisse mit Originaltoner.

'''''IBM:'''''
* '''PagePrinter 3116:''' Mit Reichelt-Papier gibt's Papierstau, aufkleben auf normales Papier funktioniert aber einwandfrei. 4mil Leiterbahn mit 8mil Abstand sind reproduzierbar, selbst riesige Masseflächen satt schwarz.

'''''Tektronix:'''''
* '''Phaser 740P:''' Mit Reichelt-Papier aufgeklebt (mit Klebestift) auf 80g/m^2 Papier und Originaltoner geht wunderbar, gibt aber manchmal Falten je nachdem wie man es angeklebt hat.

== Links ==

* Ausführliche Diskussion verschiedener Varianten in [http://www.mikrocontroller.net/forum/read-6-40012.html diesem Thread] im Mikrocontroller.net-Forum.
* http://myweb.cableone.net/wheedal/pcb.htm (hier sieht man, dass QFP und ein pitch von 0,5 mm realisierbar ist!)
* http://www.fullnet.com/u/tomg/gooteepc.htm
* http://www.roboternetz.de/wissen/index.php/Platinenherstellung
* http://thomaspfeifer.net/platinen_aetzen.htm (Die Methode Schritt für Schritt erklärt mit Bildern und Video) + Regelung für Laminiergerät
* [http://comwebnet.weimars.net/index.php?option=com_content&task=view&id=18&Itemid=50 Platinenherstellung mit Klebefolien für bessere Ergebnisse]
* http://www.die-wuestens.de/dindex.htm?/platine.htm (spezielle Drucktransferfolie)
* http://wiesi.dyndns.org/tt/tonertransfermethode-0_4.pdf

=== Im Forum ===
*http://www.mikrocontroller.net/forum/read-6-40012.html
*http://www.mikrocontroller.net/forum/read-6-261025.html - Umbau Laminiergerät
*[http://www.mikrocontroller.net/forum/read-6-284891.html Forum - Tonertransfer mit Reichelt-Katalog 2006]
[[Category:Platinen]]