Forum: Projekte & Code Fast JPEG decoder on AT(x)mega

von Horst M. (horst)

19.12.2015 09:43

Angehängte Dateien:

JPEG_Decoder.zip (591 KB)

Lesenswert?

•

This is my implementation of a fast JPEG decoder on 8-bit AVR.
It has been written entirely in assembly language (AVR Studio 4 project) 
and needs approx. 5 Kbytes Flash and 2.5 Kbytes RAM. The core IDCT is 
based on the implementation used in Tiny JPEG decompressor, (C) Chan 
2012.
The decoder supports only sequential 8 bit color JPEG/JFIF images with 
4:2:0 horizontal&vertical subsampling, 4:2:2 horizontal or vertical 
subsampling and 4:4:4 no subsampling. While testing I've used GIMP for 
resizing/encoding the images, offering precise control over the encoding 
parameters.
There's only a very basic error handling implemented in the decoder, 
it's up to the user to provide valid input data the decoder can handle.
During troubleshooting I've got very big support from JPEGsnoop, a quite 
helpful tool for examining the structure of an JPEG file and to follow 
the decoding process.
Development was done for ATxmega32A4U but the decoder itself doesn't 
depend on special Xmega features and can be adapted to run on ATmega.
Please note: It's not intended to be a copy&paste solution, you have to 
adapt the source to the particular resources of your environment, i.e. 
controller, display etc.

Find a video to illustrate the speed of the decoder with various clock 
frequencies and display dimensions at 
https://www.youtube.com/watch?v=0XEoSdZ-Ek0
Compare with https://www.youtube.com/watch?v=Y0Zuu6Df5mQ , 
https://www.youtube.com/watch?v=oMnfp5ZVUkE or 
https://www.youtube.com/watch?v=qzLS68eUM0U

The source code provides three different ways to feed the decoder with 
input data.

1. Image data in controller's flash memory
Images are included in the code, capable to use flash sizes >65536 
bytes.

2. Image data stored on external RAM, connected via SPI
Tested with 23LC1024 serial RAM (128 KBytes). There's a simple loader 
implemented to receive a prepared image library via serial interface and 
to store it into the serial RAM.
Code could be extended to connect an SPI Flash.

3. Image data received over serial interface
Send JPEG images from PC over serial interface (8n1, RTS handshaking) 
with up to 921 Kbit/s.


While the speed of the decoder is near the maximum of what the AVR 8 bit 
architecture is capable of, I'm not completely satisfied with the 
Huffman decoder part.
If anybody has an idea how to build a decoding tree using RAM more 
efficiently but with similar performance, let me know. There might be a 
chance to get a slimmed-down version of this code (but still for color 
images) working on controllers with only 2 KBytes RAM.

Of course, there are various ways to further optimize the memory usage 
with regards to the limited resources of an 8 bit microcontroller if the 
constraints are more tightened.
If you only want to handle grayscale JPEG files on your B&W - or 
monochrome, to be more general - display, the RGB conversion subroutines 
(consuming a whole lot processing power) incl. the coefficient tables 
can be thrown away, as well as the conversion part (Huffman decoding & 
IDCT) for the Cb/Cr blocks including the corresponding arrays in RAM. 
This will increase the decoding speed significantly, almost doubled as a 
rough guess.
And if the format of the grayscale JPEG images is fixed to 4:4:4 no 
subsampling, the RAM requirements will be satisfied with 2 Kbytes.

An approach to reduce the size of the JPEG image data itself - 
especially for small pictures, e.g. 160x128 pixels - could be to omit 
optimization of the image during encoding. It might sound paradox, but 
let me explain.
"No optimization" usually means the Huffman trees are encoded based on 
the JPEG spec (ITU T.81). Hence the DHT segment contains exactly the 
same data in every JPEG image encoded this way. So it's no problem to 
remove this part from the file and to use precalculated Huffman trees in 
the decoder (the decoder could even be optimized to handle more than one 
bit at a time when walking through the tree, increasing the speed once 
more). Of course, it does only make sense if multiple images have to be 
handled.

Beitrag melden Bearbeiten Thread verschieben Thread sperren Anmeldepflicht aktivieren Anpinnen Thread löschen Thread mit anderem zusammenführen Markierten Text zitieren Antwort Antwort mit Zitat

Re: Fast JPEG decoder on AT(x)mega

von c-hater (Gast)

21.12.2015 16:27

Lesenswert?

•

▲
▼

Horst M. schrieb:

> If anybody has an idea how to build a decoding tree using RAM more
> efficiently but with similar performance, let me know.

Hmm...

Ich muss zugeben, dass ich bisher weder deinen Code intensiv gelesen 
habe noch sehr "familar" mit den Innereien des JPEG-Standards bin.

Aber, nach dem, was ich über den JPEG-Standard weiss, steht doch bei 
"Custom"-Decodingtables die Tabellendefinition vollständig und optimal 
bereits im File selber, oder?

Wenn das so ist, braucht der Kram in zwei der von dir genannten drei 
Möglichkeiten für die Datenquelle überhaupt nicht in das SRAM 
transferiert werden, man liest einfach direkt aus der Quelle. Im Falle 
des internen Flash dürfte der Overhead (verglichen mit dem Rest der 
Vorstellung) vernachlässigbar klein sein und auch bei seriellen Flash 
noch nicht allzusehr in's Gewicht fallen. Man muss ja auch bedenken, 
dass dafür der Aufwand entfällt, denn Kram überhaupt erstmal in's RAM zu 
transferieren.

Und was andere Datenquellen betrifft: Warum nicht auch die 
Decodingtabelle statt in's RAM in den EEPROM oder den Flash packen? 
Wobei ich hier wegen der deutlich höheren Lese-Effizienz Flash 
bevorzugen würde. Da bliebe dann nur das Problem der relativ geringen 
Zahl von Erase/Write-Zyklen des Flash.

Und das ist der eigentliche Grund, warum ich überhaupt etwas zu deinem 
Posting schreibe. Ich arbeite nämlich gerade an einem Stück Code, 
welches unbenutzten Flash als eine Art Datei (mit wear-leveling) 
bereitstellt. Ein Mittelding zwischen der Gerätedatei eines Blockdevice 
und einer echten Datei. Um das Teil für deine Anwendung fit zu machen, 
wäre es eigentlich nur noch nötig, das wear-leveling so abzuändern, das 
es möglich wird, "on-demand" einen Block bestimmter Größe temporär aus 
dem wear-leveling herauszunehmen und dafür zu sorgen, dass die 
physischen Adressen der Pages des Blocks direkt aufeinander folgen.
Wie es der Zufall will, war genau das sowieso geplant (wenn auch noch 
nichtmal ansatzweise realisiert), um nachladbaren Code effizient zu 
implementieren. Es würde aber natürlich genausogut für ladbare konstante 
Datenstrukturen funktionieren.

Wäre das aus deiner Sicht ein sinnvoller Weg zur Umgehung der 
RAM-Misere?

Beitrag melden Bearbeiten Löschen Markierten Text zitieren Antwort Antwort mit Zitat

Re: Fast JPEG decoder on AT(x)mega

von Horst M. (horst)

22.12.2015 12:38

Lesenswert?

•

▲
▼

c-hater schrieb:
> Aber, nach dem, was ich über den JPEG-Standard weiss, steht doch bei
> "Custom"-Decodingtables die Tabellendefinition vollständig und optimal
> bereits im File selber, oder?

Zur Dekodierung eines Huffman-kodierten Bitstroms braucht es einen 
Huffman-Baum. Ein DHT-Segment in einer JPEG-Datei enthält nur die 
Beschreibung, wie der entsprechende Huffman-Baum konstruiert werden muß, 
aber nicht den Baum selbst, d.h. es gibt für jede Codewortlänge (1, 2, 3 
usw. bis 16 Bits) eine Liste mit den Zielwerten, die bei der Dekodierung 
letztlich herauskommen sollen.
Die Art und Weise, wie die Knoten des Huffman-Baums miteinander 
verknüpft werden, ist natürlich nicht vorgeschrieben, wohl aber die 
Zuweisung zu den Codeworten (sonst ergibt die Dekodierung irgendwas, 
aber nicht die ursprünglichen Daten).
Die DHT-Segmente (auch wenn es nur die in ITU T.81 vorgeschlagenen sind) 
sind in jeder JPEG-/JFIF-Datei enthalten, es gibt m.W. keinen Default, 
der ansonsten zu verwenden ist. Deshalb müssen die Huffman-Bäume immer 
konstruiert werden, wenn man nicht zusätzliche Einschränkungen definiert 
(wie ich am Ende des Eröffnungspostings beschrieben habe).

Meine Implementation des Huffman-Baums verwendet pro Knoten zwei Byte 
(egal ob Verzweigungs- oder Endknoten), eins als Pointer zum nächsten 
Knoten und eins für den dekodierten Wert, falls es ein Endknoten ist. 
Das ist schön homogen für die Adreßarithmetik, aber im Falle von 
Verzweigungsknoten ist das Wertebyte natürlich verschwendet. Ich hatte 
bislang keine gute Idee, wie das ähnlich schnell (v.a. bei der 
Konstruktion des Baums) und kompakt (bzgl. des Programmcodes), aber mit 
weniger RAM-Bedarf realisierbar wäre.

Beitrag melden Bearbeiten Löschen Markierten Text zitieren Antwort Antwort mit Zitat

Thread beobachten |

Seitenaufteilung abschalten

Bitte melde dich an um einen Beitrag zu schreiben. Anmeldung ist kostenlos und dauert nur eine Minute.

Bestehender Account

Schon ein Account bei Google/GoogleMail? Keine Anmeldung erforderlich!
Mit Google-Account einloggen

Noch kein Account? Hier anmelden.

Kontakt/Impressum – Datenschutzerklärung – Nutzungsbedingungen – Werbung auf Mikrocontroller.net