Historica dataset

Historica's Latest Experiment Results: Using LLM for Feature Engineering in Historical Data

Updates in Historical Data Analysis 

At Historica, we've been investigating the use of advanced AI technologies for enhancing historical data analysis, aiming to streamline and enrich the exploration of vast historical datasets. Currently, our primary objective is to empirically assess the viability of employing Large Language Models (LLMs) for the purpose of feature engineering within the domain of historical data, leveraging the approach outlined in our research framework. 

Annotating Historical Texts

Automated annotation is a crucial step in analyzing textual data, particularly in historical research where precision is key. Traditional manual methods have been laborious, but with the evolution of ML/NLP technologies like language models, this process has become more efficient, allowing for quicker and more accurate analysis of large datasets. Besides manual labor, alternative approaches for annotating historical textual data include spaCy, NLTK, TextBlob, and StanfordNLP. 

Experimentation and Evaluation

The primary task was to verify the hypothesis derived from the University of Barcelona's article. In our study, we compared several models to assess their performance in annotating historical texts. Criteria for evaluation included accuracy, completeness, consistency, latency, cost-effectiveness, and the reliability of results. Among the models tested, two emerged as the most effective, offering a balance of speed, accuracy, and reliability.

Models Overview 

  1. GPT-3.5-turbo-instruct is reliable and efficient. This model consistently delivers high-quality annotations with moderate latency and cost-effectiveness, ideal for historical data tasks. It excels in maintaining annotation integrity.
  1. Llama-codellama-7b-instruct has weak Performance. Consistently below average, with high latency and moderate cost-effectiveness. It often produces hallucinations, reducing reliability.
  1. Llama-3-8b-instruct has variable performance. It shows potential in specific tasks but lacks consistency. Moderate latency and cost-effectiveness offset its performance.
  1. Mistral-7b-instruct is occasionally good, unstable. It delivers good results occasionally but lacks consistency, hindering automatic annotation.
  1. Mixtral-8x7B-Instruct-v0.1 shows mixed performance. It operates relatively quickly but lacks depth, limiting suitability for comprehensive annotation.
  1. Mixtral-8x7b-instruct-gptq: structured and effective. It has high accuracy, consistency, and relevance, with moderate latency and cost-effectiveness.
  1. Phi-3-medium-128k-instruct: consistently good. Delivers high-quality results consistently, suitable for research.
  1. Groq-llama3-70b: good and provides fast, high-quality responses, a top choice for similar tasks.
  1. GPT-4o: fast and offers high-quality responses, excellent for historical data annotation.

Data Collection for the Experiment

Data for the experiment were gathered from open historical sources. We randomly selected texts related to Estonia stored on Wikipedia. The data were standardized, resulting in five texts of varying lengths. Three texts had token counts (the smallest meaningful unit for LLMs) not exceeding 400, while the remaining two texts consisted of 2955 and 12488 tokens.

Challenges and Considerations

However, our study also encountered challenges, such as limitations in processing lengthy texts in some models. For our experiment, we aimed to compare several models from OpenAI and selected a few open-source models for a comprehensive evaluation. The criteria for selection included text generation quality, speed, and customization capability. We relied on the LLM Arena on Hugging Face to guide our choice of models. This raises questions about the best approach for large-scale historical data annotation tasks. Further exploration is needed to determine the optimal strategy. 


GPT-4o and GPT-3.5-turbo-instruct stand out as top choices for historical data annotation, balancing speed, accuracy, and reliability. Llama-3 and Phi-3 also perform well. However, other models need refinement to meet high standards for dependable annotation.

Our experiment uncovered pitfalls in model comparison, notably the limitations of context windows. This prompts a key debate: Should we generate multiple responses from a model for different parts of a text and integrate them, requiring more resources? Or should we prefer a single, comprehensive response from a model analyzing the entire text at once, risking some information omission?

This trade-off between resource allocation and the risk of data loss requires further investigation to identify the optimal strategy for large-scale historical data annotation tasks. 

Don't miss out on the latest news!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

People also read

European Parliament Approves the AI Act

European Parliament Approves the AI Act

The European Parliament's approval of the Artificial Intelligence Act is a landmark in EU regulation, aiming to protect rights while fostering innovation. Prohibiting certain AI applications and regulating biometric identification, it sets ethical standards, though implementation challenges remain.

March 18, 2024
min read
Historica logo

Historica Foundation Attains EU Intellectual Property Recognition

Historica proudly secures EU Intellectual Property Office registration, reaffirming our commitment to education and cultural enrichment.

January 23, 2024
min read

Contribute to Historica's blog!

Learn guidelines, requirements, and join our history-loving community.

Become an author


How can I contribute to or collaborate with the Historica project?
If you're interested in contributing to or collaborating with Historica, you can use the contact form on the Historica website to express your interest and detail how you would like to be involved. The Historica team will then be able to guide you through the process.
What role does Historica play in the promotion of culture?
Historica acts as a platform for promoting cultural objects and events by local communities. It presents these in great detail, from previously inaccessible perspectives, and in fresh contexts.
How does Historica support educational endeavors?
Historica serves as a powerful tool for research and education. It can be used in school curricula, scientific projects, educational software development, and the organization of educational events.
What benefits does Historica offer to local cultural entities and events?
Historica provides a global platform for local communities and cultural events to display their cultural artifacts and historical events. It offers detailed presentations from unique perspectives and in fresh contexts.
Can you give a brief overview of Historica?
Historica is an initiative that uses artificial intelligence to build a digital map of human history. It combines different data types to portray the progression of civilization from its inception to the present day.
What is the meaning of Historica's principles?
The principles of Historica represent its methodological, organizational, and technological foundations: Methodological principle of interdisciplinarity: This principle involves integrating knowledge from various fields to provide a comprehensive and scientifically grounded view of history. Organizational principle of decentralization: This principle encourages open collaboration from a global community, allowing everyone to contribute to the digital depiction of human history. Technological principle of reliance on AI: This principle focuses on extensively using AI to handle large data sets, reconcile different scientific domains, and continuously enrich the historical model.
Who are the intended users of Historica?
Historica is beneficial to a diverse range of users. In academia, it's valuable for educators, students, and policymakers. Culturally, it aids workers in museums, heritage conservation, tourism, and cultural event organization. For recreational purposes, it serves gamers, history enthusiasts, authors, and participants in historical reenactments.
How does Historica use artificial intelligence?
Historica uses AI to process and manage vast amounts of data from various scientific fields. This technology allows for the constant addition of new facts to the historical model and aids in resolving disagreements and contradictions in interpretation across different scientific fields.
Can anyone participate in the Historica project?
Yes, Historica encourages wide-ranging collaboration. Scholars, researchers, AI specialists, bloggers and all history enthusiasts are all welcome to contribute to the project.