Skip to content

Data torture, redundancy failure points and normalization

Technical discussion which is not directly related to VGM files. Talk about Hardware and Software.

Moderator: Staff

Data torture, redundancy failure points and normalization

Post by vampirefrog »

Let me begin by defining the concepts in the title.

Data torture: Testing a large amount of data for consistency, uniformity and correctness. I have been watching DEFCON talk videos lately (it is a yearly hacker and security convention), and came across this, so that's where I got the name.

Redundancy failure points: A pair of places in the data, where you should have the exact same value, but there are differences. An example is the song list in the txt file which has a song with a different name than the english name in the GD3 tag in the referenced vgm file.

Normalization is when the structure of a database is optimized so that the data has no redundancy. That means that for a song in a pack, its title is stored in one place and one place only (one field in a database table). It also means that when you edit that song's name (maybe you made a mistake and want to correct it), it propagates to everywhere (the web interface,


I've been torturing the vgmrips data recently. The data sources are as follows:

1. The phpbb database, where I grab every topic from the "Official Releases" forum. There, I parse the [table] code for an initial data set, but which remains largely unused. All I use is the zip file link and the images URLs. I could also import the data in the table, but I didn't view it as necessary. But it's a valid torture point.
2. The text file in the zip. I read all of the info in the text file: Game name, System, Music hardware, Music author, Game developer and so on, the song list, with length and loop length, Notes, Package history and even size reductions.
3. The VGM files in the zip - the header and the GD3 tags.
4. The m3u files in the zip.

In this data, several points contain redundant data. For example, the english name in the GD3 tag should be exactly the same as the song listing in the txt file.

Next up, I'll list where the inconsistencies take place.

Post by vampirefrog »

So let's try to list the many places where inconsistencies can occur with the current system.

* The text file format itself. Although it's standardized, the standard isn't 100% thorough.
* Common enumerators such as system name, sound hardware, music authors, game companies
* GD3 tag can contain different values from the txt file.
* Song list in the txt file can contain different song titles than the GD3 tag english name.
* The filenames of the vgm/vgz files can be different from the gd3 tag english title and the txt file list as well. That means we have 3 places where the song title is stored, and we can therefore have 3 different versions. This can only be fixed by manual intervention.

Furthermore, filenames have some restrictions, and some characters cannot be included, and there is a length limit.

As you can see, there are many places where data which should be exactly the same has a chance to be different. That's ok, because even if we have to through each pack, there are only about 1.1K, and a handful of people can go through them in a week, given the right administration UI (which we will have at some point). Besides, we can use database queries to list all the packs with potential problems.

I've marked many packs with inconsistencies in http://vgm.mdscene.net/bugs.txt

Post by vampirefrog »

You may view the list of vgm/vgz files in this spreadsheet: http://evo.grigoriada.net/vgmfiles.csv.gz (617KB, UTF-8, tab-separated)
Post Reply