Sep 6, 2023

Multiple Levels of Challenges

16 min read - Published: 2 years ago

The first Foundry v11 release was on May 24th 2023, and we have since been asked many times “Why isn’t v11 recommended yet?”. Today, we’ll explain our requirements for recommending a new major Foundry release as well as take a deep dive into the various technical challenges we’ve had to solve in order to get Foundry v11 ready for use by our users.

Recommending Foundry v11

The Forge prides itself in providing a service to its users that is convenient, stable and with the least hassle possible. For The Forge to recommend a Foundry version to its users, we need to have a certain level of assurance that recommending v11 will not potentially cause issues for them; For that reason, we never recommend a Foundry version shortly after its release, as we need to see real world data giving us peace of mind with regards to its stability and a painless migration path for existing users.

In the case of Foundry v11, there were multiple extenuating circumstances. The largest and most visible issue to users was that for a while, the majority of the game systems did not yet support v11. The Pathfinder 2e system, one of the most popular systems on Foundry, didn’t have a v11 compatible version until a month after its release. The same with the Pathfinder 1e system which didn’t release a v11 compatible version until July 17th, nearly two months after v11’s release.
We could not possibly recommend users switch to v11 if their first experience with it would be that they wouldn’t be able to launch their worlds anymore because the system is incompatible!

The second requirement is of course stability. After every major Foundry release, it is unavoidable that bugs get introduced alongside all the new exciting features that the Foundry team adds. While they do an excellent job testing and eliminating the vast majority of them, it is simply inconceivable that they’d be able to catch all of the bugs before release. Many of these are special use cases that cannot be predicted or encountered until the larger community of users starts putting the release through its paces. In the case of Foundry v11, we felt that the majority of these bugs had been fixed and Foundry is able to better handle unexpected scenarios (such as database corruption errors) in its most recent version (11.308) which was only released on August 24th 2023.

Last, but not least, we need to make sure that The Forge’s own features work well with the new Foundry version. Considering the complexity of this release, as well as the sheer amount of unique features that we provide, that task in itself was extremely time consuming. We have also had quite a few new and unique challenges that we weren’t expecting, and some that stumped us for a while.

Let’s Dive In

If you’ve been keeping an eye on Foundry VTT’s development and the official release notes, you’ve likely already noticed that Foundry v11 has made a significant architectural change in the form of transitioning from NeDB to LevelDB as the backend database storage for compendiums and world data.

This comes with a lot of benefits, such as improved read/write performance and support for embedded documents in sublevels of the database. This means that specific embedded documents can be updated without updating the entire parent entity, which also improves performance and reduces size fluctuations between compaction steps.

The Forge’s integration with Foundry goes beyond simple hosting. With the Bazaar Marketplace, Assets Library, Game Manager, User Manager and other unique features, we aim to simplify and enhance every user’s hosting experience. With such a core architectural change, we’re presented with an interesting array of technical hurdles, most of which demand elbow grease and innovative solutions.

In this post, I’ll delve into and explore the challenges that we’ve faced on the road to recommending v11 specifically on The Forge as they relate to LevelDB databases.

Differences in Data Access: LevelDB vs. NeDB

LevelDB is a high-performance key-value store which is designed to store sorted data and provide quick read and write operations. It was originally developed by Google and later went open-source and has had a lot of optimizations and contributions over the years.

Like NeDB, LevelDB is a NoSQL database, which makes it flexible in terms of data structure. In fact, NeDB files can be opened directly using a text editor, which makes it possible (but not recommended!) for an advanced Foundry user to view and manipulate the data in their worlds and packages directly, much like they would any other JSON file.

But LevelDB works differently. It stores data on the disk using binary files, in a collection of immutable, log-structured tables called SSTables (Sorted String Tables) and uses other supporting files. When running a LevelDB database, multiple files are compiled into an intricate structure which is managed by its specialised API. Making changes directly to any of the LevelDB files outside of that managed process will, without a doubt, lead to database corruption. For that reason, the database is locked to ensure that only one process, the intended process, can write to it at any given time.

LevelDB is optimised for performance rather than human-readability, but if you’re a developer who automates parts of their workflow or otherwise benefited from the human-readability that NeDB had, Foundry has provided a CLI tool which can be used while the databases aren’t already locked to unpack LevelDB into JSON files and pack JSON files into a LevelDB database to help make life easier.

If you’ve looked into that at all, you might already be familiar with the excellent work that the LevelJS community has been doing to make it easy to work with LevelDB databases in a NodeJS or JS environment. The library we are particularly interested in here is classic-level and abstract-level which classic-level is built on. This is how the Foundry CLI makes it possible to read the data to and from LevelDB.

Small Potatoes

Other than the various changes that were required for The Forge to support new Foundry features (like adding the option to select your theme, or prompting the user for the new telemetry option, etc..), there were also relatively small changes that were required as part of the new LevelDB database format.

Our first challenge was that it had become impossible for us to do the database migration of the user’s worlds or modules in the browser as part of our Import Wizard. For those unaware, when importing content into The Forge, a client side library will migrate all asset paths in a world or compendium to point it to the user’s asset library instead of using a relative local path. Unfortunately, after a lot of research, we were unable to find any library that can run in the browser which allows access to LevelDB files. All the LevelDB JS libraries that can run in your browser are using the browser’s own IndexDB APIs as a backend and only providing a ‘LevelDB-like’ API. We even looked into writing our own pure javascript implementation and made some progress on understanding the binary file format of the database, but we dropped the project because the complexity became exponential and a bug in that implementation could be devastating. This forced us to redesign a good portion of our importing process, creating a new microservice and APIs which would be dedicated to handle the required package migrations server-side instead of doing them on the client-side.

Another challenge that was brought on by the use of LevelDB was that the old NeDB files were leftover by default, causing user’s data storage requirements to nearly double in size! While we had the option of telling Foundry to delete the old database files after the migration, we could not feel safe in doing that without making sure our users had a backup in case things went wrong. When a user has the Game Manager enabled, we were already automatically downloading a backup of their worlds as they would be upgrading their individual worlds, but for the rest of our users, we couldn’t. It took a bit of finessing but in the end, we managed to detect when Foundry would migrate a world to a new version and automatically start the download of the world for the user before the migration happened. Once we had that in place, we felt safer in telling Foundry to delete the old database files after it’s done with its migration.

These are just a few small ~~potatoes~~ examples, but the real challenge the team had to tackle was figuring out concurrent access to compendiums from Foundry for users with multiple licences.

Accessing Compendiums in Foundry

Users may recall a period of instability when launching games with certain modules or through the Game Manager while another game was already running lead to reproducible crashes. Initially, we thought it was an isolated issue, but as we delved deeper, it became clear that we were contending with a series of interconnected challenges.

The most striking problem was that certain databases would fail to open entirely, and this would result in the Foundry process crashing without being able to open the world. This was isolated to specific modules with compendiums that contained LevelDB databases that could not be opened, either due to not being packaged properly or due to corruption. Unfortunately, it was difficult to track since the error that was emitted did not appear in the Foundry logs, and finding out which module was responsible often required some trial and error. As more packages moved to v11, more developers became aware of what to look out for and this became more rare. A later release (v11.304) prevented Foundry from crashing in those situations, and more recently, a fix in the latest stable update of Foundry (v11.308 at time of writing) also means that Foundry has the ability to recover from these rare cases of database corruption, but this was not yet available at the time.

There are other reasons why a database would not be accessible that have nothing to do with corruption, however: The aforementioned database lock. Now, why are database locks important? In simple terms, these locks ensure that multiple processes don't interfere with each other, thereby maintaining the integrity of the data.

However, these locks also meant that running concurrent games led to restrictions in accessing vital game resources like compendium packs. World compendiums are accessed only by each individual world, but systems and modules are shared between worlds and the first process to access them will lock their compendium packs and will have sole access to them. On The Forge, a key feature of the Game Manager is to allow power users with multiple Foundry licence keys the ability to run multiple worlds at the same time. This allows players to access their characters in one world while the GM is prepping or running a session in another. At the time, this meant that making use of your second Foundry licence on The Forge was likely to cause the second process to crash, which is far from ideal.

Thankfully, the aforementioned fix that v11.304 implemented to smoothly handle errors that arise from a compendium being unable to open would also apply in the case of the compendium being locked by another process.. The process would no longer crash, but unfortunately, the first world to obtain the lock would also be the only world that is able to access the compendium’s data. For Game Manager users with multiple Foundry licences, this is a frustrating case for the user and a core reason behind the delay in recommending v11 of Foundry on The Forge. It hit especially hard on systems with extensive compendium data such as the incredible Pathfinder Second Edition system for Foundry.

This was a conundrum we couldn't overlook, but the full depth of the problem was not fully evident beforehand, and each subsequent fix would only reveal a deeper issue. We spent a considerable amount of time investigating possible solutions…

Potential Solutions and Their Drawbacks

However, each potential fix came with its own problems. Would each individual instance have its own user data, leading to additional space used on the user’s storage quotas, or would we allow world specific overrides where multiple versions of an individual package could exist, to be used only in the case of a specific world?

How would merging between differences in these databases be handled for cases like shared compendiums if we were to temporarily duplicate the packages to allow individual access, and then merge them together once their respective locks are released?

One idea we explored was to create a microservice to hold the information from all module compendiums and replace the classic-level library on the Foundry server to communicate with our microservice instead when trying to access compendiums from systems and modules. Unfortunately, this would not have solved the issue for custom modules, such as shared compendiums, and the complexity of it was too high.

Another far-fetched solution that was considered was emulating a filesystem and faking the database files to send all read/write operations to a central server that would orchestrate the actual I/O on the LevelDB files while handling concurrency seamlessly. Suffice to say that it was not a solution we were keen to implement.

These strategies introduce a lot of complexity, not only to the prospective user interface but to the underlying infrastructure, and the burden of that would eventually be shifted to the user, if implemented. We needed a more elegant solution, one that would allow us to maintain the functionality of the Game Manager and allow access to compendiums, while also respecting the lock of the LevelDB database, without re-inventing our architecture or adding complexity to the user interface. After months of research, testing and prototyping, we've found the perfect solution: rave-level.

Follow the Leader: Rave Level

Rave-level is a module created by the Level org to solve this specific use case. It is intended as a drop-in replacement for classic-level and indeed uses it as a dependency. It is built on the same underlying abstract-level API, intended to allow multi-process access to LevelDB databases while still respecting the database lock and requirement for concurrent database operations from a single point of entry. It achieves this by electing a “leader” process which connects to the database using the classic-level library, and allows other processes to connect to the “leader” via a file-based Unix socket. In the case that the “leader” is closed, another leader is seamlessly elected and the process is uninterrupted.

This was a breakthrough. Multiple processes could now functionally connect to the same database in an organised way, provided that each process has access to the socket, which is located in the user data. This maintains data integrity while also providing the required functionality and restoring functionality to the Game Manager. Instead of effectively locking the compendium to other processes, the “leader” could now share that information with the party.

We worked closely with Foundry during this time to specifically replace the classic-level dependency with rave-level, but it is important to note that this is a Forge-specific change. We have done a lot of work and testing to ensure that functionality is maintained. If you believe that you have found a problem relating specifically to this implementation, please contact us to let us know.

Additionally, it is important to emphasise that this change is to allow multiple processes to read from the compendiums of systems and modules that are integral to the operation of your world. We strongly recommend against making changes to the compendium data while multiple worlds are open at the same time and have access to that compendium, as it is not a supported feature of Foundry and can lead to unexpected behavior.

One change we had to implement was to ensure that all of a user’s worlds that get launched would end up on the same machine because unix sockets do not work across nodes, even if using a distributed file storage solution as the file system where the unix socket resides. This was a simple enough fix.

We soon learned however that Unix sockets are notoriously limited in their maximum path length, which is a problem for user data paths on The Forge which can contain long package names, but this was a dragon we could take on. Instead of creating the Unix socket in the same path as the database, as rave-level does by design, we could supply an override to rave-level that would allow us to create the socket at a location in the user data outside of the database location but based on a hash of that path, so that it would be consistent. With that fixed, we were off to a running start.

We hit a final challenge with rave-level when we realised that if the database was somehow corrupted, the library would report a successful open to Foundry, which would later crash when it tried to read from the database file. This was one use case that we had failed to account for and it seems to be the one difference in behaviour between using classic-level and rave-level. While it was a rare use case and only affected users with corrupted databases, we had pulled our change from our production servers until we could correct the issue. Fixing it was unfortunately a little more involved than our previous socket-path fix, as it required a small design change in how the database gets opened by rave-level. We have however managed it without trouble and redeployed our rave-level build of Foundry on the Forge servers and everything has been running smoothly ever since!

Other Database Lock Challenges

To offer the quality of life improvements that The Forge brings with tools to allow users to manage their packages, the Assets Library, the User Manager, as well as the Bazaar, it’s sometimes necessary for us to be able to access LevelDB databases.

The User Manager lets GMs and users access games using their associated Forge account for easy access, rather than requiring the host to set up and manage individual Foundry users and passwords per world. If a world is offline, these changes need to be made in the database so that they are available the next time the world is started. In order for the User Manager to automatically log users into their Foundry world, we need to be able to read the world’s users database and act upon its data. That can only be done when Foundry is closed, which is never the case when we need to log you into the world. Thankfully, working with the Foundry team, they were able to provide us with a websocket API we could use to retrieve all the information we require, when Foundry is running and the database is locked.

For Table Tools and Game Tools from the Forge Game Configuration page, package management directly on the filesystem can lead to problems when locks are not respected. That means that there are restrictions on what we can do with package management while Foundry is running. The lock is crucial for maintaining data integrity. In simpler terms, the pack’s files being accessed while the database is locked by Foundry could lead to database corruption. A good example of this is cloning your world. If you are busy editing your world from Foundry at the same time as you are cloning it. Each file in the original world will be cloned to the new world, so theoretically the contents would be the same. However, considering the fact that LevelDB operates with multiple files, the open world may also be modifying some of them at the same time that others are being read. To prevent this, we close worlds before allowing them to be cloned. This is also the reason why we recommend that users close their local Foundry before importing their worlds to The Forge.

Package updates from the Bazaar are also affected. For any package that does not contain compendium packs, a Foundry world can operate with that package after it has been started even if the package itself changes on disc, and those changes will be reflected the next time Foundry starts. But a compendium’s database is writing to and reading from the drive all the time. This means that any package which is currently open in Foundry cannot be updated (or deleted) by the Bazaar, and requires that the world be stopped before this kind of package management can continue. Being conscious of and respecting the lock in these situations is a requirement for data integrity of the LevelDB databases. We learnt a few lessons in the arena of how locks are handled differently in different operating system environments. More on that soon.

Lessons learnt along the way

Sometimes, despite your best efforts, something slips through. With rave-level, we did not return to the test case of corrupted databases, and, as mentioned above, soon after our initial rollout discovered that there is a situation in which it would not behave identically to classic-level. When rave-level opens a corrupted database, it would not immediately throw an error. It would report a successful open, then it would emit an error event instead of directly throwing it during the 'open' call. This meant that the error handling that Foundry had for cases where the database is not able to open would be circumvented, causing the process to fail inelegantly and preventing the world from opening even if the database in question was not a required database. This meant that any package that had a compendium pack that could not be opened would, during the 24 hours in which this bug was live, prevent Forge users from accessing their worlds. Unfortunately, with corrupted compendiums being ignored under normal circumstances, some of these packages were easy to miss. Once we became aware of the issue, we immediately fixed this bug, ensuring that the sanctity of the open call and the surrounding error handling would be respected.

In the section of the article that discussed the challenges a lock imposes, I mentioned that different operating systems treat locks differently. When opening a LevelDB database with classic-level, that library detects whether the database is accessible, and throws an error if it is not. But other processes may have a harder time recognizing that the database is locked. One strategy is to simply attempt to read the file, which will behave as expected and run into an error when the file is locked… but only if you’re running Windows. Filesystem operations on Linux, which The Forge runs on, uses what is called an advisory locking mechanism, rather than the stricter mandatory lock mechanism that one might see on Windows. In a mandatory lock, a locked file would not be accessible to any write operations (or to read operations as well if the lock is exclusive). The advisory locking mechanism operates similar to a "gentlemen’s agreement”, where each process needs to request access to the resource (such as a file or a database), and then release the lock when it is done. A common Unix implementation of the advisory lock is flock, while there is also the F_SETLK method through the fcntl system call (incidentally, this is what classic-level uses as well), which works similarly to the flock system call but is incompatible with the locks made through the fcntl system call.

Being able to detect classic-level’s locks outside of attempting to open the database itself (which is an expensive operation) means using the fcntl system call from The Forge’s NodeJS process. It’s not supported by default, so in order to get access to that we had to extend Node’s own filesystem functions with the very useful node-fs-ext library. Thus armed (and once we figured out why that library fails to compile and fixed it), we could now properly check for locked databases from processes that don’t specifically require database access but still have reason to respect the lock (such as updating a module from the Bazaar), and ensure database integrity.

Looking Ahead

Like a boss fight with multiple phases, fully supporting LevelDB proved to be an evolving challenge with many levels of complexity. At The Forge, we resolve to deliver the best possible tools with which to supercharge our user's hosting and gaming experience. LevelDB was a big step for Foundry and also for us, but it is an investment that we're certain will pay dividends in the short and long term, making it easier for users and developers to get the most out of the software, and to craft the best possible stories.

Now that all of these challenges have been tackled, we're proud to be able to recommend Foundry v11 on The Forge!

Thank you for reading.

Acknowledgements

We worked closely with Foundry VTT during this time to be able to bring this solution to our users, a very large thank you to them for their work on Foundry and their invaluable support.