DSA Hype Thread 2: The Rehypening

zahlman · August 30, 2020, 5:14pm

It’s that time again! I’m starting the thread over because old thread is old - and while the OP provides useful information, it no longer makes a good OP.

What `dsa` is and how it works

dsa - a data structure assembler - is designed as a fully general-purpose tool/library for presenting data from a binary source in a readable form and then re-creating binary data that can be written back into a binary. For FEGBA hacking, that means an “everything assembler” that potentially replaces pretty much every command-line tool you’ve ever used (but especially EA and Nightmare).

Overview of what it looks like and how it works

The disassembler produces a file that describes data as a series of chunks, and which can follow pointers to determine the start of new chunks. The chunk data’s format can be described in various ways, using so-called interpreters. Most commonly, a chunk will be described as a series of data structures, described by a structgroup (a kind of interpreter) that allows for specifying various data types.

Example of disassembly results

With the built-in data descriptions, you can get results that look like:

!@main 0x0 hex
HEXD 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
HEXD 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
!# 0x20

Here you can see each struct is identified by its name, HEXD (this is analogous to an event code in EA). Normally, hexadecimal values would need a 0x prefix; but a custom data type is used by the HEXD struct in the hex structgroup to avoid this.

There is also a system that allows the disassembler to label each struct in the output, and once I get it properly integrated/tested/fixed a way to use those labels to create either pointers to structs with a chunk, or to get their index.

Interpreters can also be written as Python plugins, and the behaviour can be further customized using filters that transform the data extracted from the binary before the interpreter translates it into a text description.

Interpreter and filter plugins

The built-in string interpreter extracts a null-terminated string (it actually handles mixed binary and text data), producing fancy results like:

!@main 0x0 [string, utf-8, basic]
'This is a [Open]test[Close].[NL]'
'日本語、[0xc0][0xc1]かわいい！！'
!# 0x33

As a side note, if we were writing the disassembly by hand (to be assembled into the target), we could equivalently have written:

!@main 0x0 string:utf-8:basic # Everything after a `#` on a line is a comment.
# Notice the alternate syntax for the "multi-part token".
'This is a [Open]test[Close].[NL]'
'日本語、[0xc0][0xc1]かわいい！！'
# The `0x33` in the previous example was a comment inserted by the disassembler
# and is not necessary. The line starting with `!` ends the block.
!

Because those plugins are written in Python, there is no need to shell out for them. They are dynamically loaded as part of the language set up when you start the assembler or disassembler.

When filters are used, the resulting disassembly chunks get tagged like so:

!size 32
!@main 0x0 hex
HEXD 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
HEXD 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
!# 0x32

size is the only built-in filter; it basically ensures the size of the underlying binary data. When disassembling, the value is determined by how much data was disassembled; when assembling, it will truncate the data if it’s too long, and zero-pad it if it’s too short. However, the filter mechanism allows dsa-extras to handle compression methods like lz77, rearrange graphics from 8x8 tiles into what they should actually look like, etc.

Types and structgroups are described using plain-text files.

Type and structgroup examples

Structgroup files look like:

align:1 # there is always a single header line,
# followed by one or more sections describing possible structs.

EXAMPLE
    my_type x
    my_type y

The types are in separate files, and look like:

# 32-bit value that directly gives the location of another chunk in the file.
pointer example_pointer 32
    size # this filter will be applied to the pointed-at chunk.

# A simple integer type.
type Byte
    8 value

# A type with multiple fields, restricted value ranges and custom value names.
type my_type
    8 first values:my_enum
    8 second values:my_enum

enum my_enum
    0 fee
    1 fie
    2 foe
    3 fum

Lastly, interpreters (TODO: and filters) can be configured using codecs that read data from a similarly-formatted text file and interpret that data with more Python code.

“But can I `make hack` with it?”

The built-in stuff is, by design, pretty basic - because I want to keep the GBAFE-specific stuff separate for a variety of reasons. I will be providing all of that in a separate package called dsa-extras.

Using `dsa` and `dsa-extras`

Four command-line programs are provided by dsa:

dsa - assembly mode.
dsd - disassembly mode. There is an option to test the results by immediately re-assembling them into the original binary (without writing to disk) to see if any corruption results.
dsa-use - adds a specified “library” folder path to the places DSA will look for structgroups, types, filters, interpreters and codecs.
dsa-drop - removes a path from the above list.

Each of these has its own command-line help, powered by another library I wrote called epmanager. However, the code is also designed to be imported as a package and used from other Python programs. One day I hope to make a hex editor that leverages dsa to describe the data as you scroll over it.

The dsa-extras package includes:

a ton of structgroups and types to describe event codes, a bunch of NMMs and maybe a few other things, for FE6/7/8
some helper scripts that I used to produce the above (don’t expect to be able to use them out of box; talk to me if you think they might be useful for you)
codecs, filters and interpreters to deal with Huffman-compressed text and GBAFE text codes, LZ77-compressed images, and possibly more
a post-install script (basically it calls dsa-use for you so that you don’t have to figure out the path to where dsa-extras was installed)

I might also make a master install script so that you don’t have to understand Python packaging stuff although I’ll be explaining about all of that in a separate thread in the near future.

To migrate your EA files and other such content, the recommended approach is:

Build your ROM as before, with the old tools.
Use DSA to disassemble the built ROM.
Use the resulting disassembled files going forward.

If you need help with this, let me know and I’ll see what I can do for you.

Project status etc.

Please see the next post for live updates.

I’ll be doing some kind of promotional video for dsa and dsa-extras myself for FEE3 (it’ll probably be pretty basic). I intend to spend a good chunk of the next two weeks polishing things up because it’s definitely not ready yet.

Boring stuff about licensing

DSA now has a proper, open-source license: the Open Software License 3.0. It’s similar to LGPL, except it’s a lot shorter, it’s based more on the principles of contract law rather than just on copyright, and it includes a “network use is distribution” clause (with the GPL system you would need the Affero variant to get this, which is hideously complex and also otherwise more restrictive than LGPL). Unlike many other licenses it also includes one tiny bit of “warranty” (basically, it represents that I am not blatantly plagiarizing the code).

The documentation for DSA as little as there is at the moment is separately licensed as Creative Commons BY-NC-SA 4.0.

dsa-extras content is made as free as possible, via the Unlicense, because it’s not the part of this project that I’m really interested in promoting myself with.

Also, I completely re-thought the versioning and realized that the project isn’t stable enough to be on a 1.x version number, let alone a 2.x one. In particular, I started taking semantic versioning principles more seriously, along with migrating to a more modern build/packaging setup (with automated tests and everything!).

How to help

If you make or maintain a graphical tool for FE hacking, it would be amazing of you to design it to (or at least add the option) output data for DSA to assemble, rather than editing the ROM directly. Also, you can contribute content for dsa-extras (as long as you’re ok with the licensing).

It would also be amazing if someone could provide syntax-colouring support for this thing, or even point me in the right direction for implementing that.

(I’ve already talked to @Lexou a bit about tool integration, and I’m hoping to collaborate with @Pikmin1211 to make sure that Japanese text handling works smoothly.)

zahlman · September 2, 2020, 1:49pm

Project status tracker

Current pushed and tagged dsa version: 0.19.3+301
dsa-extras commit count: 194

2019

Library:
- New formats for type and structgroup files and disassembly listings
  - Standard library now includes a hexdump type and corresponding hex structgroup that produce hex-dump output. The standard library will be kept quite minimal since, again, it needs to have general-purpose application; dsa-extras has all the fun stuff.
- Support for “filter” plugins
  - size filter implemented that tracks the size of chunks on disassembly; on assembly, it zero-pads if the data is too short, and errors out if it’s too long
  - Filters are applied to each chunk in a “chain” determined by the chunk pointer, and (buggy) the size of the data is tracked at each step
- Support for “interpreter” plugins that implement parsing in Python instead of with a structgroup
  - file interpreter implemented which dumps a chunk into a separate binary file and leaves the file name in the disassembly
  - string interpreter implemented which translates between a sequence of null-terminated strings and (encoding, string) pairs (n.b. this behaviour was changed!)
UI:
- Paths to extra library files to use are now specified with a separate path config file
- System for tracking extra locations to look for system library content (via command-line tools dsa-use and dsa-drop), so that dsa-extras content can be treated as part of the system library while keeping separate folders
- whereisdsa command-line tool gives the root folder of the installed copy of DSA, for interop purposes
- More detailed output logging system
- Miscellaneous improvements to error reporting
General:
- New core code for disassembly algorithm
- Support for quoted-string tokens (so filenames, extracted strings etc. don’t get mangled)
- Support for byte-array struct members, with ability to specify string encoding
- Disassembly syntax now supports chunk-internal labels, which can be automatically generated (based on a type enumeration) during disassembly
dsa-extras:
- Generated a ton of type and structgroup files based off lightly edited .nmms and EA raws. Please note that while the tools used are included and designed to be used from the command line, I don’t expect them to be particularly useful for anyone else. Some cleanup was done on these, and more is necessary.
  - nmm2dsa tool converts files from the .nmm format into type and structgroup files, with primitive support for pointers (based on NMM column names) and enumerations.
  - ea2dsa tool converts EA raws into structgroup files. A bunch of one-off hacks are used to clean up inconsistencies in the raws format, and heuristics are used to map various wonky event-code contents into nicer, pre-defined types. For example, there is a hard-coded AI-behaviour type that takes recent research into account.
- gbalz77, png and tileimg filters implemented to translate from ROM bitmap data into .png images and back. A bunch of pointer types are provided specifying the needed filter chains. The portrait filter additionally allows for rearranging portrait data into the standard spritesheet format.

2020

[TODO: stuff from before the last push in aug-sep]
[+] Continue cleaning up structgroups generated by the tools and fixing up types
[+] Text (Huffman) support
[+] Installer spit and polish for release
[+] Other stuff that I’ll explain later

Planned for future releases:

[ ] Make it possible to refer to chunk-internal labels when writing new assembly, and possibly infer the use of such labels in disassembly. Make it possible to translate chunk-internal labels back into indices in new assembly (there’s no point in disassembly, since the labels came from an enumeration anyway).
[ ] Ability to omit trailing type fields if they match “default” values
[ ] Ability to treat empty tokens ([]) as “default” values for struct members (only if every field has a default)
[ ] Support for chunks where the length is implied by the content of a header struct
[ ] Audio definitions
[ ] ASM support
[ ] Repeatable single-member structs (like how WORD etc. work in EA)
[ ] Inline variable-length content (the thing @Zeta is talking about above)
[ ] More stuff [TODO]

Development Diary

Sep 7: Ugh, got sidetracked for a few days due to RL stuff, and writing documentation turns out to be slower than I expected. I also implemented a (much) better system for managing which config files are used for a given (dis)assembly run. Anyway, I’m going to say I have enough tests for now, and tomorrow I’ll be finishing off the documentation for this release, for real.

Sep 8: Wrote a huge chunk of documentation; realized there’s still a huge chunk undocumented that’s simply going to have to wait for now. It really does take an appalling amount of time to figure out how to put these concepts into words properly, and I’m not very satisfied with what I have so far - but I have a deadline. Oh, also made a couple of minor tweaks to things. Tomorrow will be focused on making sure dsa-extras content is properly organized, and building up to an acceptance test for FE8 content as well as FE7.

Sep 9: I should supply some formal tests, especially for the assembly side; but the Huffman support looks to be done for now. I also started organizing the dsa-extras “library” content properly according to the new scheme, and fixed some bugs in the event-related FE8 config. Next up is to process some FE8 NMMs and get some config set up for that, and ensure that it works as smoothly as the FE7 stuff does (most of which hasn’t been touched for nearly a year, and doesn’t really need to be).

Sep 11: Super busy with cleaning up FE8 stuff. So far, the chapter data table works, as do all events, and some related data types. The rest of it should be easier.

Sep 12: Grind grind grind. Got FE8 more or less sorted out; now to make sure old FE7 stuff is up to snuff, extract things that are common to both, and round off any rough edges.

Sep 13: Ready for release! Got everything cleaned up as best I can figure out, and I also have a demo script that just runs all the available disassembly options. Going to make a proper installer package tonight, and work on a video tomorrow.

Sep 14: Submitted a video with almost three hours to spare if you assume PST. Just made a quick promotional trailer since video editing (especially to my standards) is actually really time consuming, so there was no hope of preparing even a minimal demo hack.

zahlman · September 12, 2020, 3:05am

Teaser content

I’ve been working on ensuring FE8 is supported as well as FE7 was (plus, you know, the improvements to the core DSA engine that have happened since last time).

Portraits

Yep. Remember Greyliwood? I give you Ghreyb:

Gheb Main Portrait

I’m probably not getting a chance this go-around to make the enhancements I wanted here (stitching the pieces together; making good use of the new codec system for the image filters; using Numpy and imageio to process image data).

Other NMM Stuff

Chapters in the chapter data table get automatically labelled. They look like:

@Prologue
DATA @[Prologue Debug Name] [Bitmap 1] None [Palette 1] [Tilemap 1] Prologue
+   [Animations 1] None Prologue 0x0 False 0x0 0x1 Normal Normal 0x10
+   [Distant Roads, Shadow of the Enemy, Shadow of the Enemy]
+   [Distant Roads, Shadow of the Enemy, Shadow of the Enemy] 0x9 0xffff 0xb
+   0xffff 0xffff 0x32 [5, 5, 5, 5] [6, 6, 6, 6] [7, 7, 7, 7] [8, 8, 8, 8]
+   [100, 100, 100, 100] [80, 80, 80, 80] [60, 60, 60, 60] [40, 40, 40, 40]
+   [1060, 1060, 1060, 1060] Text<0x160> Text<0x160> Prologue Prologue None
+   0x1 Black Text<0x1a2> Text<0x19d> [Defeat Boss] 0x0 None [x<255>, y<0>]
+   0x1d

Did you know that FE8(U) chapter data has actual FE7 ranking requirements data most of the time? Except it’s either a clone of the above (pretty sure it comes from the FE7 prologue) or zeroed out.

Characters also get automatically labelled. They look like:

@Eirika
DATA 0x212 0x26e Eirika [Lord (Eirika)] Eirika Default Light Eirika 0x1
+   [0, 0, 0, 0, 0, 0, Lck<5>] 0x0 [E, -, -, -, -, -, -, -]
+   [70, 40, 60, 60, 30, 30, Lck<60>] Female @[Eirika Support Data] 0x7

The formatting makes it pretty clear where the growths, bases and weapon rank data are. Oh, right, I forgot to make it tag the Text IDs, like I do for the items:

@[Iron Sword]
DATA Text<0x354> Text<0x404> Text<0x0> [Iron Sword] Swords Weapon NULL NULL
+   0x2e 0x5 0x5a 0x5 0x0 0x11 0xa E [Iron Sword] None Nothing 0x1 0x0

Yes, that E is the weapon rank. The second [Iron Sword] identifies the icon. Yes, the values are context-sensitive, through the magic of DSA’s user-defined types.

Text

Text blocks in the disassembly look like:

!huffman default_huffman
!@[Text 623] 0xE9504 [string, ascii, gbatext]
'The princess of the kingdom of[NL]'
"Renais. She's elegant and kind.[.]"
!# 0xE9526

I decided that instead of trying to make new Huffman tables it would be way easier to just support the AH patch, by adding a thing to “disassemble” the text pointer table and make it recognize the flagged pointers. So now you get a block with entries like COMP @[Text 623] (COMP for compressed). When you assemble new text, use RAW instead, and skip the filter line on the block:

!@[My Text] 0xF00B00 [string, ascii, gbatext]
'An [Red]example[Red] string.'
!

# and then in the table, use
RAW @[My Text]

The pointer will be flagged, and the text will be stored without compression (but text codes will be processed, of course).

Events

Oh, yes, the fun part. The DSA system is so powerful that it took almost no effort to make it follow pointers from FE8 CALL opcodes automatically - so I did. Between the fancy block formatting and the extra events EA doesn’t pick up, the resulting disassembly for the prologue is over 1000 lines long. Also, the memory-slot opcodes get some extra love:

!size 32
!@[Event 24] 0x9EE310 Event
CALL @[Event 27]
SADD [s<0x2>, s<0x3>, zero]
TEXTSHOW 0xffff
TEXTEND
CALL @[Event 28]
ENDA
!# 0x9EE330

The name zero is used by the disassembler for slot 0 (which always contains 0). You should also be able to call it 0 (the raw value) or s<0> (using the tagged range option), or use hex, binary or octal for the number (DSA supports this in general, except where overridden). Oh, and there are similarly some friendly names for event flags:

!size 88
!@Ending 0x9EF164 Event
MUSC 0x31
SVAL s<0x2> 0x1d
CALL @[Event 27]
TEXTSHOW 0x918
TEXTEND
FADI 0x10
REMA
ENUT Guide<0x2c>
ENUT Guide<0x2d>
ENUT Guide<0x3>
ENUT Guide<0x0>
ENUT Guide<0x1>
ENUT Guide<0x28>
ENUT Guide<0x5>
ENUT Guide<0xe>
ENUT Guide<0xf>
ENUT Guide<0x33>
ENUT Guide<0x15>
MNC2 0x1
ENDA
!# 0x9EF1BC

zahlman · September 14, 2020, 5:17am

GET YOU SOME DSA RIGHT THE HECC NOW

So yeah, I made an installer package. Tomorrow I’ll throw together a video I guess. Happy FEE3!

(Edit: Video is in. “Paulette” and I can’t wait for the show!)