Technology
Historica is working to employ AI to process and manage vast amounts of data from various scientific fields. We are sharing our technological diary about our experience with using AI to create a digital map of human history.
July 2024 - Roman Chepenko - From Text to Ontology: Developing an ETL Pipeline and Graph Database for Historical Data with LLMs
In the realm of modern technology and artificial intelligence (AI), significant progress has been made in applying various models to data, particularly historical data. This article discusses our novel developments and how we utilize different LLM models to study historical data.
In a previous publication, we explored the potential of various LLMs for feature engineering of historical data. We selected a specific set of texts and transformed historical text into JSON objects with defined fields. The paper "The Semantics of History" by the University of Barcelona formulates an approach to annotating and storing data for analyzing textual data from different fields of study. More details can be found here: https://www.historica.org/technology#may-2024-roman-chepenko-using-large-language-models-for-feature-engineering-and-annotating-historical-data
At Historica Technology Lab, we developed an innovative ETL (Extract, Transform, Load) pipeline for loading data into an "Ontology." This data pipeline transforms historical texts into structured data suitable for database storage. As part of our research, we aimed to create a prototype of a "historical ontology," a knowledge repository of human history. This article describes our approach to creating this pipeline, the results achieved, and potential application areas.
Methodology
ETL Pipeline
Our ETL pipeline is designed to extract data from historical texts, transform it, and load it into a database. The ontology was built using Amazon Neptune to ensure efficient storage and access to these data. The pipeline consists of several stages:
- Data Extraction (Extract): This stage involves analyzing historical texts using a specialized large language model (LLM). The primary goal is to identify key entities and relationships such as geographical locations, dates, archaeological evidence, participants in specific events, etc.
- Data Transformation (Transform): The extracted data is formatted into a JSON structure that includes three main categories:
- UT (Unit of Topography): Describes events tied to specific places and times.
- US (Unit of Stratigraphy): Describes material evidence of past events.
- AC (Actors): Describes individuals or organizations associated with the events.
- Data Loading (Load): The transformed data is loaded into the Amazon Neptune database. A custom script creates vertices and edges in the graph database, linking entities and their attributes. The database architecture was designed based on the approach formulated by the authors from the University of Barcelona.
Justification for Choosing a Graph Database
The decision to use a graph database (in our case - Amazon Neptune) was driven by several key factors:
- Flexibility in Data Modeling: Graph databases like Amazon Neptune allow for modeling complex relationships between entities. This is crucial for historical data, where events, individuals, and artifacts can be connected in many different ways.
- Scalability: Amazon Neptune provides high scalability, allowing for efficient processing and storage of large data volumes. This is critical for historical research, where data volumes can be substantial.
- Query Performance: Graph databases offer high performance for complex queries, making them ideal for analytical tasks and uncovering hidden relationships in the data.
Ontology: Working with the Database
Our pipeline allowed us to populate our database with historical data according to the planned structure. After thorough testing and data quality checks, we developed our prototype "historical ontology."
For working with the database, we use an application based on a large language model (LLM) that enables text queries to the Neptune database. The application integrates with the OpenAI model to generate responses based on the data stored in the database. For implementing the LLM application, we used the langchain library and the RAG approach to LLM utilization.
Key Components of the Application
- Database Connection: The application establishes a connection to Amazon Neptune using a specialized driver for graph queries.
- Query Formation: Queries to the database are formed in natural language and passed through the LLM, which translates them into corresponding database queries. This allows users to interact with the database without knowing specialized query languages like Gremlin or Cypher.
- Result Processing: The query results are processed and presented to the user in an easy-to-read format, allowing them to obtain the necessary information without needing to interpret raw data.
Results
Our approach demonstrated high efficiency in working with historical data. For example, during testing, when querying the database for information using text queries, we were able to quickly retrieve answers to any questions contained in the database.
Discussion
Using large language models automates the data extraction process, significantly increasing its accuracy and speed. The Amazon Neptune database, based on a graph model, offers flexibility in data representation and relationships, as well as ease of scaling with increased data volumes. Integration with the OpenAI model allows for complex natural language queries, simplifying data access for end users. This is particularly useful for researchers and historians who do not need to learn specialized query languages. Graph databases provide high performance for complex queries, making them ideal for analytical tasks and uncovering hidden relationships in the data.
However, challenges exist. The accuracy of extracted data heavily depends on the quality and structure of the source texts. Poorly structured or low-quality texts can lead to errors in data extraction. Processing large data volumes may require optimizing the ETL pipeline's performance and database queries. Security aspects are also crucial when working with historical data, especially if it includes personal or confidential information.
Conclusion
The development of the ETL pipeline and its integration with the "Historical Ontology" opens new possibilities for analyzing and interpreting historical data. Our solution not only automates the extraction and structuring of information but also provides convenient tools for working with data through text queries. This makes historical research more accessible and efficient, contributing to new discoveries and a better understanding of the past.
In the future, we plan to expand our ontology to other regions and historical periods and improve our LLM-based solutions to enhance data extraction accuracy. We are confident that these technologies will find broad applications in digital humanities and other related fields.
15 July
May 2024 - Roman Chepenko - Using Large Language Models for Feature Engineering and Annotating Historical Data.
Introduction
The development of large language models (LLMs) offers significant opportunities for automating the analysis of vast datasets for various purposes. At Historica, our research laboratory, we have delved into the field of feature engineering using LLMs to explore their potential in historical data annotation.
Modern AI technologies are opening new horizons for analyzing historical data. In paper "The Semantics of History" by the University of Barcelona, the authors formulated the approach of annotating and storing data for analyzing textual data from different fields of study. The primary goal of our experiment is to test the hypothesis that LLMs can be effectively utilized for feature engineering of historical data based on suggested approach.
Review of the University of Barcelona's Study
- Unit of Topography (UT): Evidence of an action or situation located in space and time, regardless of the specificity of the information source and its attributes. Each UT has a specific location and date, which can be expressed as a UTM coordinate or an administrative delimitation that might have changed over time.
- Unit of Stratigraphy (US): Material evidence of a past action, representing an archaeological aspect of the cycle of time. Essential attributes of these units include graphic and cartographic representations.
- Actor: An individual or corporate entity involved in an action identified as a UT, with attributes like name, gender, religion, citizenship, date of birth and death, etc. Multiple individuals gathered for a specific period with a particular purpose can act as corporate actors
Automatic Annotation of Historical Data
Automatic annotation is a crucial step in the analysis of any textual data, especially for historical data, where the accuracy of annotations is critical. Traditional methods include manual annotation, which is time-consuming and labor-intensive. Besides manual labor, alternative approaches for annotating historical textual data include spaCy, NLTK, TextBlob, and StanfordNLP. With advancements in the field of ML/NLP transformers, modern technologies like LLMs enable automation of this process, significantly speeding up and simplifying the work with large datasets.
Capabilities of LLMs for Automatic Annotation
LLMs offer powerful tools for processing and analyzing textual data. They can generate annotations, classify texts, and identify hidden connections in the data. LLMs are also suitable for feature engineering. One of the critical questions at this stage is choosing between open-source models and proprietary models (APIs from OpenAI, Google, etc.).
When selecting tools for data analysis, it is essential to consider various aspects such as accessibility, cost, and flexibility. Open-source solutions provide a high level of flexibility and customization for specific tasks. Proprietary solutions, on the other hand, may offer more stable and ready-to-use tools but are often limited in customization and require significant resources.
Experiment with LLMs
For our experiment, we aimed to compare several models from OpenAI and selected a few open-source models for a comprehensive evaluation. The criteria for selection included text generation quality, speed, and customization capability. We relied on the LLM Arena on Hugging Face to guide our choice of models.
Data Collection for the Experiment
Data for the experiment were gathered from open historical sources. We randomly selected texts related to Estonia stored on Wikipedia. The data were standardized, resulting in five texts of varying lengths. Three texts had token counts (the smallest meaningful unit for LLMs) not exceeding 400, while the remaining two texts consisted of 2955 and 12488 tokens.
Overview of the Experiment
The experiment involved annotating historical texts using the selected models. The primary task was to verify the hypothesis derived from the University of Barcelona's article. An example task included analyzing historical text using the following prompt:
“As a specialized NLP model, your task is to process historical text data for ontology-based storage. Your task is to analyze the historical text below and extract important information relevant to the ontology database. Specifically, identify entities and relationships based on the following categories:
- UT (Unit of Topography): The evidence of an action or situation that can be located in space and time. It should include both a location and a date.
- US (Unit of Stratigraphy): The material evidence of a past action, typically archaeological in nature. Graphic and cartographic representations are key attributes.
- AC (Actor): Individuals or organizations involved in an action linked to a UT. Attributes include name, gender, citizenship, and other personal details.
Provide the extracted data in JSON format with the following structure:
{
"UT": [
{
"location": "<location>",
"date": "<date>",
"attributes": ["<additional attributes>"]
},
...
],
"US": [
{
"evidence": "<description>",
"attributes": ["<additional attributes>"]
},
...
],
"AC": [
{
"name": "<name>",
"attributes": ["<additional attributes>"]
},
...
]
}
Text: {text}
Comparative Analysis of Models
We have embarked on an exploration of feature engineering using LLMs, particularly focusing on the automatic annotation of historical data. This report provides a comprehensive evaluation of several LLMs, assessing their performance based on multiple criteria:
- Аccuracy and completeness,
- Сonsistency and relevance,
- Latency and cost-effectiveness,
- Absence of hallucinations
- Subjective evaluation of model output.
Our goal is to determine the most effective models for historical data annotation and to provide insights into their practical applications.
Model Performance Summary
GPT-3.5-turbo-instruct
GPT-3.5-turbo-instruct demonstrated high efficiency and consistent results across all analysis categories. With high accuracy, consistency, completeness, and relevance, the model only failed to meet the requirements once, indicating its reliability and capability to produce high-quality annotations. Its moderate latency and high cost-effectiveness further highlight its suitability for extensive historical data annotation tasks. The model also exhibited excellent performance in avoiding hallucinations, ensuring the integrity of the generated annotations.
Llama-codellama-7b-instruct
This variant proved to be the weakest in our set, consistently showing below-average results compared to other models. With low accuracy, consistency, completeness, and relevance, it struggled to meet the necessary standards. Despite its high latency and moderate cost-effectiveness, its persistent low performance makes it less suitable for automatic historical data annotation tasks. Additionally, the model often produced hallucinations, further reducing its reliability.
Llama-3-8b-instruct
The model exhibited variable performance depending on the task, demonstrating moderate accuracy, consistency, completeness, and relevance. In some cases, it significantly outperformed other models, showing potential for specific types of tasks. However, its moderate latency and high cost-effectiveness are offset by its inconsistent performance, making it less reliable for general application. The model performed moderately in avoiding hallucinations, indicating some potential for improvement.
Mistral-7b-instruct
While this model occasionally delivered good results, it was unstable and did not always meet evaluation criteria. It showed moderate accuracy, completeness, and relevance but low consistency. Its high latency and moderate cost-effectiveness reduce its overall effectiveness for automatic annotation, particularly where consistency is critical. The model's moderate performance in avoiding hallucinations indicates a need for further refinement.
Mixtral-8x7B-Instruct-v0.1
This model operates relatively quickly and maintains the data structure in its outputs, demonstrating moderate accuracy, consistency, and relevance. However, its responses are sometimes superficial, resulting in low completeness. The model's high latency and low cost-effectiveness further impact its overall performance, making it less suitable for comprehensive historical data annotation. Its moderate ability to avoid hallucinations is insufficient to compensate for its other weaknesses.
Mixtral-8x7b-instruct-gptq
The model demonstrates good adherence to annotation structure and effective memory utilization for attribute filling, showing high accuracy, consistency, and relevance. However, its responses are not always comprehensive, which limits its completeness. With moderate latency and high cost-effectiveness, this model excels in maintaining high accuracy and consistency while effectively avoiding hallucinations.
Phi-3-medium-128k-instruct
This model consistently delivers good results, with high accuracy, consistency, completeness, and relevance. Although it is not the fastest, its low latency and large context window allow it to handle complex tasks effectively. Its moderate cost-effectiveness is balanced by its ability to maintain high performance in avoiding hallucinations, making it a good option for similar research.
Groq-llama3-70b
The model provides good and fast responses, showing high accuracy, consistency, completeness, and relevance. Its low latency and high cost-effectiveness make it one of the better options for similar tasks. Additionally, its high performance in avoiding hallucinations ensures the reliability of its outputs.
GPT-4o
This model offers very fast and high-quality responses, with a large context window, demonstrating high accuracy, consistency, completeness, and relevance. Its very low latency and high performance in avoiding hallucinations make it an excellent choice for tasks involving automatic annotation of historical data, especially where speed and accuracy are paramount. While its cost-effectiveness is moderate, the overall benefits significantly outweigh this factor.
Summary
In summary, the analysis highlights GPT-4o and GPT-3.5-turbo-instruct as the most effective models for historical data annotation, offering an optimal balance of speed, accuracy, and reliability. Llama-3 and Phi-3 also demonstrate strong performance and are reliable choices. The remaining models, although showing potential in specific areas, require further refinement to meet the high standards necessary for comprehensive and dependable historical data annotation.
Our experiment also revealed several pitfalls in comparing models, one of the most significant being the limitations of context windows. Only two models were able to fully process the text containing 12,488 tokens. For the other models, it was necessary to segment the text into smaller parts to fit within their context window limits.
This issue raises a critical debate: Is it more effective to generate multiple responses from a model for different parts of a single text and then integrate these responses into a cohesive final answer (a process that demands additional computational and integration resources)? Or is it preferable to obtain a single, comprehensive response from a model capable of analyzing the entire text in one go, despite the risk of potentially omitting some information?
This trade-off between resource allocation and the risk of data loss requires further investigation to identify the optimal strategy for large-scale historical data annotation tasks. Such exploration will be crucial in determining whether the consolidation of multiple outputs or the use of models with larger context windows better serves the goals of accuracy and efficiency in historical data processing.
Conclusions
The use of LLMs for historical data annotation is a promising direction, capable of significantly accelerating and improving the quality of analysis. Our research has confirmed the hypothesis that LLMs can be effectively utilized for feature engineering of historical data, based on the annotation data format formulated in the results of the University of Barcelona's article. Modern models provide researchers with powerful tools for automating the annotation and processing of textual data, ensuring high accuracy, consistency, and completeness of results. The choice of a specific model depends on priorities such as accuracy, speed, cost-effectiveness, and reliability.
The integration of advanced AI technologies into historical analysis opens new pathways for interdisciplinary research, fostering more accurate and efficient methods of data processing and interpretation.
In the future, our laboratory plans to experiment with the use of LLM agents for the automatic population of historical ontology, further expanding the capabilities and applications of these advanced models in the field of digital humanities.
For further inquiries or detailed information on our findings, please feel free to contact us.
References
April 2023 - Anadea - How we use the Perceptual similarity metric (LPIPS).
In the ongoing series of experiments surrounding the generation of historical maps, this article introduces a crucial tool for evaluating the fidelity of generated images: the Perceptual Similarity Metric, or LPIPS. Rather than relying on mere pixel-by-pixel comparisons, LPIPS leverages the power of neural networks to provide a more nuanced understanding of image similarity.
This document describes how similarity between generated and original images can be evaluated with a perceptual similarity metric (LPIPS), and how we can use it to compare “quality” of generated images during training.
Reminder about data
Metric description
LPIPS is a metric that compares similarity between two images.
Instead of comparing two images by pixels, it uses features that can be extracted from a pre-trained neural network - meaning, we feed a network an image, and get some information from hidden layers of the network as an output.
Perceptual image similarity metric has two properties:
- It is large when human observes large difference between images
- It is small when observers consider images similar
We used this library to compare LPIPS between our images. To evaluate generation results we resized generated and original images to 1080x1080 (original images were cropped, generated images were downscaled to 3072x3072). Additionally, we compared LPIPS values for the same images downsized to 512x512.
How LPIPS can be used?
- Validation metric during training (to check if generation improved?)
- Comparing different models with one another
- Selecting best frame from multiple generated samples
- Selecting best model version after training
- Having multiple maps of the same style but in different periods, LPIPS can answer which map is “the closest in time” to some 3rd map - due to a general rule that the more years between the two maps - the more border changes there are on a map. This way, LPIPS can be used to cluster maps by period.
- Identify specific years or periods where generated images are of lower quality - and work on parts of the dataset related to it.
Let’s say we have an original image of a map of Europe in the year 1400.
We want to compare it with a generated image
LPIPS(original_1400, generated_1400) = 0.13760228
LPIPS(original_1400, generated_1400) = 0.13760228
LPIPS(original_1400, generated_1600) = 0.4240757
LPIPS(original_1400, generated_1600) = 0.4240757
LPIPS(original_1400, generated_1500) = 0.3257943
Comparing LPIPS for similar images
LPIPS differences are smaller if generated images are very similar.
For instance, let’s compare original image for a year 1400 and three generated images for this year
Image with higher LPIPS in the center has different colors for two countries in the middle of a map (Lithuania and Moldova). While comparing all three images, we can see that only some parts of the map are visually different. After taking a closer look, you may notice that only Ottoman and Polish-Lithuanian borders are different, and on other parts of the map only few artifacts are different.
Relation of age differences between maps
*note that years 950 & 1800 were not present in a training data, generated images are purely fictional
It may be observed that the bigger the age gap between maps, the bigger LPIPS becomes
A note on resolution
We compared LPIPS on resolution 512x512
Comparing results between two tables (1080 x 1080 vs 512x512 resolution), it may be seen that when LPIPS@1080 is larger (i.e., 0.45), LPIPS@512 becomes roughly the same or slightly smaller (roughly -3% difference).
However, smaller LPIPS@1080 values (i.e., 0.15) leads to a bigger difference with LPIPS@512 (~25% difference). This implies that we should upscale our images to detect smaller differences on the map.
Conclusions
LPIPS widely extends our capabilities to understand generated maps quality.
It can be used in validation, selecting the best generated image and evaluation of model predictive power in general.
Moreover, LPIPS approach can be extended, and be potentially used to compare original and generated maps and show regions where model makes mistakes.
February, 2023 - Anadea- Experiments on Generating Historical Maps Using the StableDiffusion Model on Real Data.
In the evolving realm of digital cartography, the role of advanced models in generating detailed and accurate historical maps has become paramount. This article delves into the recent experiments conducted with the StableDiffusion model, focusing on its application to real-world maps. We explore the challenges and nuances of training this model using a diverse dataset comprising maps and historical texts.
Training StableDiffusion model on real maps
Second part of our work was dedicated to building a foundation model for various maps and texts. Our plan was to fine-tune Stable Diffusion on a large dataset of pairs of maps and historical texts relevant to them.
As for the data, we decided to proceed with WiT dataset, as it already includes both texts and maps of high quality - and is a great tool for building a foundation model. WiT consists of images and related text fields - we used a combination of page title, abstract and image caption as a text illustrating a map together with a map itself. To train our model only with relevant information we built two supplementary models, one for filtering images, and another one for texts. We worked with an `en`-only subset of WiT (5.4 mln entries).
We trained Stable Diffusion 2.1 on this data to see what it would generate from free-form historical textual description. Despite the fact that training took only 100k steps (single GPU, 5 days of training), it achieved significant results in generating maps from historical data.
- Ability of the model to generate maps from prompt
- Map quality - lack of artifacts, map details, etc.
- Ability to understand time period and region described in prompts
Ability to understand time period and region described in prompts
As you can see, the model trained on our data is able to identify historical region where events take place and draw its map. The prompt used for inference was not in the training dataset (and is of a different format).Maps are very diverse.
Let’s take a look at some more examples:
Despite the fact that the maps in this variation do not suffice goals of the project, one may argue that increasing dataset scale, applying better filters and using more tricks during preprocessing step combined with much longer pre-train will give accurate results of desired quality. Note that this version of the model was trained with less than 0.001% compute used during the training process of a proper StableDiffusion.
Additionally, we’d like to demonstrate how the quality of generated maps improves with longer training. Results after 40000 training steps are to the left, results after 100000 training steps are to the right
Conclusions
During our research we demonstrated that diffusion models can be used to generate maps.
We showed that, depending on the prompt, diffusion models are capable of generating parts of the map (region, province or even smaller scale), accurately re-draw borders according to a historical period, and editing already-existing maps based on new context.
We’ve shown that using generalized map dataset we can create a maps-only version of diffusion network, and with power of unsupervised pre-training at scale it will be able to achieve high generalization capabilities.
We’ve shown that using generalized map dataset we can create a maps-only version of diffusion network, and with power of unsupervised pre-training at scale it will be able to achieve high generalization capabilities.
Additional steps that may be taken into consideration
- Train a model for longer and on a larger dataset
- Explore different text2image networks
- Tune the model on “downstream tasks” - i.e., editing maps based on prompt, generating map given year and region, etc.
- Improve understanding of textual part of the network by training on text only
Next steps with regard to overall project development
To verify the hypothesis in full and make generated maps usable on physical maps, the following tasks need to be addressed:
- Learn how to translate generated maps into external format (i.e., OpenStreetMap). It would be necessary to “read” generated maps. In our vision, this problem has to be split into two parts: identification of countries and borders and matching them with textual information (i.e., country names). It is possible to integrate identification of countries and borders into a diffusion network.
- Maps should be generated in different styles. For that, one may assign various labels to maps in the dataset, based on content (i.e., political map, religious map) and style (lithography map, globe). Once labeled, labels would be inserted in prompt and be later used for generation.
- Finally, we could add some input-output layers to the model, and get different outputs from it (i.e., provide each map with coordinates of its boundaries, map type and any other information)
December 2022 - Anadea -A “Toy” Dataset for the Initial Learning.
This article covers the first stage of our research into generating historical maps using neural networks. Our initial work focuses on a simple "toy" dataset for preliminary learning. Here, we examine how the StableDiffusion model responds to textual prompts and what results can be expected at this early stage.
Project Goals and our ideas
Map generation project was focused on the idea that accurate historical maps can be generated using neural networks based on textual information
Our research aimed to test a hypothesis that historical maps of high quality can be generated with diffusion models from prompts with textual description of requested region and historical period. Our task was to check that specifically StableDiffusion can be used, and to understand its possibilities and limitations with regard to how it understands prompts.
We decided to start with the following ideas in mind:
- Use StableDiffusion as our base network
- Explore how to generate maps from different prompts - starting from easiest to hardest.
- Conduct experiments with one prompt at a time.
- Each experiment has its own training data.
- During the experiment network is additionally trained to generate maps from textual prompts like in training data.
- After the first phase of experiments is done, collect a large corpora of historical texts and maps – and train StableDiffusion on various maps and real historical texts.
Used data
To speed up our research, we used a “toy” dataset of historical maps for experiment purposes. It was based on this video and consisted of yearly maps of Europe in a single style. This dataset was not historically accurate, yet it let us understand the challenges of real-world data.
Time period of maps that we used for experiments was limited to 1000 A.D. - 1800 A.D.
Each image in a dataset had a corresponding prompt where region, year and list of countries could have been mentioned.
We used StableDiffusion (mostly, v.2.1) and trained the U-Net part of the model using <image, prompt> pairs, and then evaluated using prompts in and out of the given time period. A typical dataset for such an experiment consisted of 700 to 8000 image-text pairs, which was enough for experimenting purposes.
Experiments and results
We’ve done five major experiments, each time feeding the model with more complex prompts:
1. Generate a map given prompt with a year and an image of a map.
The goal was to check the possibility of generating maps that would change based on a year in a prompt and see how SD catches map style.
We got good results:
- Generated maps borders are highly accurate
- There were almost no artifacts, generation quality is good
- We even got readable country names, which was not expected
Generated examples: (click to open larger images)
These results can conclude that the model can change countries and their borders using a given year.
2. Generate a map given prompt with a year and list of countries on a part of a map.
The goal was to check if the model can understand connections between country and region, and generate smaller maps where countries from the prompt would be displayed.
Results for this experiment were great, the model was drawing accurate crops of different regions. By scaling the dataset with different crops we could potentially train a model to generate very small parts of the map or manipulate the scale of the map through the prompt.
3. Generate a map given prompt with a year and a message like “country A conquered country B”.
In our next experiment we wanted to check if parts of the generated map can be manipulated through the prompt. Experiment results were mixed - final “accuracy” of the model is ~40% (based on the number of correct re-drawings of country B into colors of country A).
Although the results were poor, the experiment was still useful:
- We discovered LoRA - a method to quickly learn new “concepts”, such as borders of specific countries - and to learn it in literally seconds.
- We learned that not all the prompts fit diffusion models - partially because most prompts in datasets used for pretraining only describe an image - and do not give any instructions on how to change it. We concluded that manipulating a generated map is a different task.
- We found a special version of SD trained specifically to edit images -Instruct Pix2Pix. Training InstructPix2Pix to edit maps became our next experiment.
4. Edit a map given prompt with a year and a message like “country A conquered country B”.
In this experiment we wanted to check if map can be edited with a prompt that describes certain changes (i.e., year is the same, but a certain country conquered another country). We trained InstructPix2Pix to edit the original map with a certain edit instruction.
Results were accurate - model did exactly what was asked by a prompt:
5. Edit and generate maps based on historical texts of arbitrary length
A fundamental limitation of the StableDiffusion model is that the max length of its text encoder (CLIP) is limited to 77 tokens. To overcome this limitation we implemented a trick of splitting input text into chunks of 77 tokens, encoding each chunk into its own embedding, and feeding the model with average embedding. To verify that average embeddings do not negatively affect model generation capabilities we trained both vanilla SD that generates maps from long inputs and InstructPix2Pix model for editing maps based on longer texts.
For longer prompts, we decided to query DBPedia for information on historical events:
- For XVth century, we got descriptions of 286 events, 213 of them were longer than 77 tokens
- For InstructPix2Pix model we used the scheme from previous experiment - having a map and event for a year X, we can:
- Use map for a year X as edited image
- Use event description as prompt
- Use map for a year X-N (N close to 10) as an original map
- Generate samples with different N re-writing the prompts using LLM
- ~2200 pairs - enough for basic training
Results:
Results demonstrated that the model understands long prompts, uses texts from various parts of the prompt and can work with very long texts.
Experiments conclusions
Aforementioned experiments were a necessary foundation to understand model behavior while training on real maps and texts.
During experiments, we discovered a few important things that allow us to speed up future development. Here are just a few of them:Found an effective way to increase generated image size and quality up to 16x using SuperResolution. This allows us to potentially generate images up to size 12288x12288. (Generated examples in the doc are 3072x3072 and are already of decent quality).
- Started using LPIPS as an evaluation metric - it allows us to measure how similar the generated image is to the original .
- Sped up training process by at least 3 times (using memory-efficient optimizers, experimental implementation of certain network layers and utilizing more aggressive training strategies that allow us reduce number of training steps 10 times)
- Experimented with training models on different checkpoints at different resolutions, and concluded that the highest quality of images can be achieved with SD-2.1 at 768x768, with SD2.1 being the best checkpoint available so far overall (SD-1.5 may be better at generating texts, but is generally worse and unable to work with 768x768 without proper fine-tuning on high-res images). Most of our experiments were on images of 512x512 due to time and resource constraints.