# DCF77 Decoder: Original ASM vs. C-compiled ASM Comparison

## Overview

This document compares the hand-written ATtiny4313 assembly (`DCF77_Decoder_020418.asm`) with the GCC-generated assembly (`DCF77_Decoder.s`) produced from the C translation (`DCF77_Decoder.c`) using:

```
avr-gcc -Os -mmcu=attiny4313 DCF77_Decoder.c -S
```

---

## 1. Structure & Mapping Decisions

### 1.1 SRAM Variables

| Original ASM | C Translation |
|---|---|
| `.dseg` / `.org 0x0060` with `.byte` directives | `static volatile uint8_t` globals |
| Fixed addresses starting at 0x0060 | Compiler-assigned `.bss` addresses (`.comm` directives) |

The original places variables at explicit SRAM addresses. GCC uses `.comm` (common block) symbols in `.bss`, so the linker assigns addresses. Functionally equivalent, but the exact memory layout may differ.

### 1.2 Register Allocation

| Original ASM | C / GCC output |
|---|---|
| Named registers: `dcfsec=r9`, `dcfcnt=r10`, `flags=r11`, `inttmp=r12`, `flags1=r13`, `intreg=r15`, `temp=r16`, `temp1=r17`, `temp2=r18` | All "register variables" become SRAM globals; GCC uses r18–r31 + r24/r25 as scratch |
| Dedicated low registers (r9–r15) for ISR-persistent state | SRAM loads/stores (`lds`/`sts`) on every access |

**Impact:** The original keeps frequently-used state in registers permanently, avoiding memory access. GCC treats everything as memory-resident and loads/stores each time. This costs a few extra cycles per access but is functionally identical.

### 1.3 Interrupt Vector Table

| Original ASM | GCC output |
|---|---|
| Explicit `.org` for each vector (0x0000–0x0014), `rjmp` to handlers, unused vectors → `reti` | Only `__vector_4` (TIMER1_COMPA) defined; linker fills the vector table from `gcrt1.S` startup code |

The original manually defines all 21 interrupt vectors. GCC's linker script + crt startup handles this automatically — unused vectors point to a default `__bad_interrupt` → `rjmp 0x0000` (reset).

### 1.4 Startup / Reset Sequence

| Original ASM | GCC output |
|---|---|
| Manual stack pointer init, SRAM zero-fill loop, then `rcall regclr`, `portinit`, etc. | C runtime (`__do_clear_bss`, `__do_copy_data`) handles BSS zeroing and `.data` init; `main()` is called after |

GCC's startup code (linked from `crt1.S`) sets SP, copies `.data` from flash, and clears `.bss` before calling `main()`. The original does this manually with a loop from 0x0060 to RAMEND.

### 1.5 Function Call Convention

| Original ASM | GCC output |
|---|---|
| Arguments in `temp` (r16), `temp1` (r17), `temp2` (r18); return values in same | avr-gcc ABI: first arg in r24 (or r24:r25 for 16-bit), return in r24 |

This is why you see `mov r24,...` before every `rcall` in the GCC output — it's placing arguments per the ABI.

---

## 2. Key Functional Differences

### 2.1 BCD-to-Binary Conversion

| Original | GCC |
|---|---|
| Loop: `add temp, 10` while `r1 > 0`, then `add temp, r0` | `rcall __mulqi3` (8×8 multiply helper), then `add r24, r29` |

GCC optimizes `tens * 10 + ones` into a multiply instruction call (`__mulqi3`) since the ATtiny4313 lacks a hardware MUL. The original uses a subtraction loop. Both produce the same result; GCC's approach is slightly more compact.

### 2.2 `nums()` — RS232-only Number Output

| Original | GCC |
|---|---|
| Separate `nums` subroutine with its own loop | Inlined into `mtsync`: two direct `senden` calls with literal '2' and '0' for the value 0x14 (=20 decimal) |

GCC recognized that `nums(0x14)` always outputs "20" and partially constant-folded it into two `senden` calls with the ASCII characters '2' (0x32) and '0' (0x30). This is a valid optimization.

### 2.3 DCF77 Bit Shift Chain (ISR)

| Original | GCC |
|---|---|
| Unrolled: load X pointer to `dcftab+6`, loop 6× with `ld -x` / `ror` / `st x` | Loop with Z-pointer: `movw r30,r18` + `ld`/`lsr`/`ror`/`or`/`st`, counter in r18 |

Both implement the same carry-chain shift through 6 bytes. The original uses X-register pre-decrement addressing (`ld inttmp, -x`); GCC uses computed Z-pointer addressing. The logic is equivalent but GCC's version is slightly different in register usage.

**Notable:** GCC's ISR shift loop has a subtle difference — it uses `lsr` + `ror` + `clr` + `ror` to inject the new bit, whereas the original uses `rol inttmp` / `com inttmp` / `ror inttmp` to invert and re-extract the carry. The net effect (shifting a new bit into byte 5's MSB and cascading through all bytes) is the same.

### 2.4 Parity Checking

| Original | GCC |
|---|---|
| Uses `rol`/`ror` + `adc r22, r21` (where r21=0) to count set bits | Uses `sbrc rN, bit` + `subi r24, -(1)` to conditionally increment |

GCC replaces the carry-based bit counting with explicit bit-test-and-increment. Functionally identical, slightly different instruction sequences.

### 2.5 Month-Length Calendar (ISR)

| Original | GCC |
|---|---|
| Long chain of `cpi yl, N` / `brne` / `cpi r25, days` / `brcc t400` for each month | Optimized switch: groups months by days (30 vs 31), uses fewer comparisons |

GCC's optimizer recognizes that months 4,6,9,11 all have 30 days and months 1,3,5,7,8,10,12 have 31 days, so it groups them with fewer branches. The original tests each month individually.

### 2.6 Weekday Display (`wochentag_display`)

| Original | GCC |
|---|---|
| Chain of `cpi`/`breq` for each day, then load Z-pointer to flash string, `lpm` loop | Jump table via `ijmp` (computed goto), then load SRAM string pointer |

GCC compiles the `switch` into an indirect jump table (`.L28` with `rjmp` entries), which is more efficient for 7 cases. The strings are in `.rodata` (SRAM) rather than program memory (flash), so `ld` is used instead of `lpm`.

### 2.7 String Storage

| Original | GCC |
|---|---|
| Strings in `.cseg` (flash), accessed via `lpm` | Strings in `.rodata` → copied to SRAM by `__do_copy_data`, accessed via `ld` |

This is a significant difference. The original stores weekday/timezone strings in program memory and reads them with `lpm`. GCC places them in `.rodata` which gets copied to RAM at startup. On a device with only 256 bytes of SRAM, this wastes ~32 bytes. To match the original, one would use `PROGMEM` + `pgm_read_byte()`.

### 2.8 LCD Backlight Handling

| Original | GCC |
|---|---|
| Uses `sbi`/`cbi` on the port pin (bit-level I/O) | Same: `sbi 0x18,6` / `cbi 0x18,6` |

Identical — GCC correctly optimizes single-bit port operations to `sbi`/`cbi` instructions.

### 2.9 Delay Loops

| Original | GCC |
|---|---|
| `sbiw xl:xh, 1` / `brne` (tight 16-bit decrement loop) | `volatile` stack variable: `ldd`/`std` + `subi`/`sbc` + `or` + `brne` |

The original's delay loop is 3 cycles per iteration (sbiw + brne). GCC's `volatile` variable forces load-modify-store through the stack frame, making each iteration ~8–10 cycles. The delays will be roughly 3× longer than intended. To match timing exactly, inline assembly delay loops or `_delay_us()`/`_delay_ms()` from `<util/delay.h>` should be used.

---

## 3. Code Size Comparison

| Metric | Original | GCC -Os |
|---|---|---|
| Code segment | 2064 bytes (50.4% of 4096) | Likely larger due to ISR prologue/epilogue overhead and ABI compliance |
| Data segment | 27 bytes | 27 bytes globals + ~32 bytes string copies in RAM |
| ISR register saves | 2 registers (intreg, inttmp) | 15 registers pushed/popped (full ABI-compliant save) |

The original ISR saves only `SREG` and uses dedicated registers (r9–r15) that don't need saving. GCC must save all caller-saved registers (r0, r1, r18–r27, r30–r31) because it can't guarantee they aren't in use by `main()`.

---

## 4. Correctness Notes

### 4.1 Potential Issues in the C Version

1. **Delay timing**: As noted above, the `volatile` loop approach produces different timing than the original `sbiw` loop. For LCD timing this is acceptable (delays are longer, not shorter), but it's not cycle-accurate.

2. **String storage**: Using SRAM instead of PROGMEM wastes RAM. On ATtiny4313 with 256 bytes total, this could be critical.

3. **Atomic access**: The C version uses `cli()`/`sei()` pairs around multi-byte volatile reads, matching the original's approach.

4. **Port I/O addresses**: GCC correctly uses `out`/`in`/`sbi`/`cbi` for I/O registers in the 0x00–0x3F range, matching the original.

### 4.2 Things GCC Got Right

- The parity logic produces identical results (verified by bit-counting equivalence)
- The calendar rollover logic handles all months correctly including leap years
- The DCF77 bit-shift chain correctly propagates carry through all 6 bytes
- The `mtsync` output sequence matches the original's LCD + RS232 interleaving
- Interrupt enable/disable placement matches the original

---

## 5. Summary

| Aspect | Verdict |
|---|---|
| Functional equivalence | ✅ Logically equivalent behavior |
| Register usage | Different — GCC uses SRAM for all state, original uses dedicated registers |
| Code density | Original is more compact (hand-optimized for the target) |
| ISR overhead | GCC adds significant prologue/epilogue (15 push/pop vs. 2) |
| Timing accuracy | Delay loops differ by ~3× (GCC loops are slower) |
| Memory efficiency | Original is better (flash strings, no ABI overhead) |
| Maintainability | C version is far more readable and maintainable |

The C translation faithfully reproduces the original's logic. The differences are all consequences of compiler ABI compliance, optimization strategy, and the inherent overhead of translating register-resident state to memory-resident variables.
