TimelineKGQA

A universal temporal question-answering pair generator for any temporal knowledge graph, revealing the landscape of Temporal Knowledge Graph Question Answering beyond the Great Dividing Range of Large Language Models.

To download we generated two datasets based on ICEWS Actor and CronQuestions KG, please visit the following link: TimelineKGQA Datasets

Motivation
Timelines
How human brain do the temporal question answering?
- Information Indexing
- Information Retrieval
Temporal Questions Categorisation
TimelineKGQA Generator
Temporal Question Answering Solutions
Development Setup
- Install the package
- Folder Structure

Motivation

Since the release of ChatGPT in late 2022, one of the most successful applications of large language models (LLMs), the entire field of Question Answering (QA) research has undergone a significant transformation. Researchers in the QA field now face a crucial question:

What unique value does your QA research offer when compared to LLMs?

The underlying challenge is:

If your research cannot surpass or effectively leverage LLMs, what is its purpose?

These same questions are also pressing the Knowledge Graph QA research community.

Knowledge graphs provide a simple, yet powerful and natural format to organize complex information. Performing QA over knowledge graphs is a natural extension of their use, especially when you want to fully exploit their potential. Temporal question answering over knowledge graphs allows us to retrieve information based on temporal constraints, enabling historical analysis, causal analysis, and making predictions—an essential aspect of AI research.

So we are wondering:

What's the landscape of Temporal Knowledge Graph Question Answering beyond the Great Dividing Range of Large Language Models after 2022?

The literature seems have not provided a clear answer to this question.

Timelines

We will begin with question answering datasets, as they are fundamental to any progress in this field. Without datasets, we can't do anything. They are our climbing rope, guiding us to the other side of the Great Dividing Range.

Current available datasets for the Temporal Knowledge Graph Question Answering are limited. For example, the most latest and popular TKGQA dataset: CronQuestions, containing limited types of questions, temporal relations, temporal granularity is only to year level.

Our real world temporal questions is way more comphrehensive than this.

We all know that we are living on top of the timeline, and it only goes forward, no way looking back. The questions we are asking are all related to the timeline, which is totally underesimated in current TKGQA research.

If we view all the temporal questions from the timeline perspective, we have this following types of timelines:

Straight Homogenous(Objective) Timeline:
- Exact date when it happens, for example, [2023-05-01 10:00:00, 2023-05-01 10:30:00]
- This is normally asking question about the facts, and upon the facts, we can do the analysis.
- For example, crime analysis, historical analysis, etc.
- Under this timeline, human will focus more on Temporal Logic
Cycle Homogenous(Objective) Timeline:
- Monday, First day of Month, Spring, 21st Century, etc.
- This is normally asking question about the patterns.
- Under this timeline, human will focus more on Temporal Pattern
Straight Homogenous(Subjective) Timeline:
- If you sleep during night, it will be fast for you in the 8 hours, however, if someone is working overnight, time will be slow for him.
- This is normally asking question about the perception of time.
- How is your recent life goes?
- Depending on the person, the perception of the meaning for the "recent" will be different.
- Under this timeline, human will focus more on Temporal Modifier
Cycle Heterogeneous(Subjective) Timeline:
- History has its trend, however, it takes thousands years get the whole world into industrialization.
- And then it only takes 100 years to get the whole world into information age.
- So the spiaral speed of the timeline is not homogenous.
- Under this timeline, human will focus more on Temporal Modifier also, but more trying to understand the development of human society, universe, etc.

We can not handle them all in a one go, and current TKGQA research is in front of the door of the Straight Homogenous( Objective) Timeline.

We will try to advance the research in this area first, and then try to extend to the other areas.

How human brain do the temporal question answering?

Information Indexing

When we see something, for example, an accident happen near our home in today morning. We need to first index this event into our brain. As we live in a three dimension space together with a time dimension, when we want to store this in our memory, (we will treat our memory as a N dimension space)

Index the spatial dimensions: is this close to my home or close to one of the point of interest in my mind
Index the temporal dimension: Temporal have several aspects
- Treat temporal as Straight Homogenous(Objective) Timeline:
  - Exact date when it happens, for example, [2023-05-01 10:00:00, 2023-05-01 10:30:00]
- Treat temporal as Cycle Homogenous(Objective) Timeline:
  - Monday, First day of Month, Spring, 21st Century, etc.
  - (You can aslo cycle the timeline based on your own requirement)
- Treat temporal as Straight Homogenous(Subjective) Timeline:
  - If you sleep during night, it will be fast for you in the 8 hours, however, if someone is working overnight, time will be slow for him.
- Treat temporal as Cycle Heterogeneous(Subjective) Timeline:
  - Life has different turning points for everyone, until they reach the end of their life.
Then index the information part: What happen, who is involved, what is the impact, etc.

So in summary, we can say that in our mind, if we treat the event as embedding in our human mind:

part of the embedding will represent the temporal dimension information,
part of the embedding will represent the spatial dimension information,
the rest of the embedding will represent the general information part.

This will help us to retrieve the information when we need it.

Information Retrieval

So when we try to retrieval the information, espeically the temporal part of the information. Normally we have several types:

Timeline Retrieval:
- When Bush starts his term as president of US?
  - First: General Information Retrieval => [(Bush, start, president of US), (Bush, term, president of US)]
  - Second: Timeline Retrieval => [(Bush, start, president of US, 2000, 2000), (Bush, term, president of US, 2000, 2008)]
  - Third: Answer the question based on the timeline information
Temporal Constrained Retrieval:
- In 2009, who is the president of US?
  - First: General Information Retrieval => [(Bush, president of US), (Obama, president of US), (Trump, president of US)]
  - Second: Temporal Constraint Retrieval => [(Obama, president of US, 2009, 2016)]
  - Third: Answer the question based on the temporal constraint information

Three key things here:

General Information Retrieval: Retrieve the general information from the knowledge graph based on the question
Temporal Constrained Retrieval: Filter on general information retrieval, apply the temporal constraint
Timeline Retrieval: Based on general information retrieval, recover the timeline information

Extend from this, it is retrieve the information for one fact, or you can name it event/truth, etc. If we have multiple facts, or events, or truths, etc, after the retrieval, we need to comparison: set operation, ranking, semantic extraction, etc.

And whether the question is complex or not is depending on how much information our brain need to process, and the different capabilities of the brain needed to process the information.

Temporal Questions Categorisation

timeline

So when we try to classify the temporal questions, especially from the difficulty perspective, we classify the level of difficulty based on how many events involved in the question.

Simple: Timeline and One Event Involved
Medium: Timeline and Two Events Involved
Complex: Timeline and Multiple Events Involved

Simple: Timeline and One Event Involved

Timeline Retrieval:
- When Bush starts his term as president of US?
  - General Information Retrieval => Timeline Recovery => Answer the question
  - Question Focus can be: Timestamp Start, Timestamp End, Duration, Timestamp Start and End
Temporal Constrained Retrieval:
- In 2009, who is the president of US?
  - General Information Retrieval => Temporal Constraint Retrieval => Answer the question
  - Question Focus can be: Subject, Object, Predicate. Can be more complex if we want mask out more elements

Medium: Timeline and Two Events Involved

Timeline Retrieval + Timeline Retrieval:
- Is Bush president of US when 911 happen?
  - (General Information Retrieval => Timeline Recovery) And (General Information Retrieval => Timeline Recovery) => Timeline Operation => Answer the question
  - Question Focus can be:
    - A new Time Range
    - A temporal relation (Before, After, During, etc.)
    - A list of Time Range (Ranking)
    - or Comparison of Duration
  - Key ability here is: Timeline Operation
Timeline Retrieval + Temporal Constrained Retrieval:
- When Bush is president of US, who is the president of China?
  - (General Information Retrieval => Timeline Retrieval) => Temporal Semantic Operation => Temporal Constraint Retrieval => Answer the question
  - This is same as above, Question Focus can be: Subject, Object
  - Key ability here is: Temporal Semantic Operation

Complex: Timeline and Multiple Events Involved

In general, question focus (answer type) will only be two types when we extend from Medium Level

Timeline Operation
(Subject, Predicate, Object)

So if we say Complex is 3 or n events and Timeline.

Timeline Retrieval * n
Timeline Retrieval * (n -1) => Semantic Operation * (n - 1)? => Temporal Constrainted Retrieval

Other perspectives

And based on the Answer Type, we can classify them into:

Factual
Temporal

Based on the Temporal Relations in the question, we can classify them into:

Set Operation
Allen Temporal Relations
Ranking
Duration

Based on the Temporal Capabilities, we can classify them into:

Temporal Constrained Retrieval: Filter on general information retrieval, apply the temporal constraint
Timeline Retrieval: Based on general information retrieval, recover the timeline information
Timeline Operation: From numeric to semantic
Temporal Semantic Operation: From Semantic to Numeric

To be able to answer the temporal question, we need to have the following key abilities:

General Information Retrieval: Retrieve the general information from the knowledge graph based on the question, you can call this semantic parsing, or semantic retrieval

TimelineKGQA Generator

With the above understanding, it will not be hard to programmatically generate the temporal question answering pairs for any temporal knowledge graph, as shown in the following figure:

tkg

And then we can follow the following steps to generate the question answering pairs:

Unify the temporal knowledge graph into the above format
Sampling the facts/events from the knowledge graph
Generate the question answer pairs based on the facts/events
Question paraphrasing via LLM

Generating process is like:

generator

Generated Question Answering Pairs for ICEWS Actor and CronQuestion KG

Source KG		Train	Val	Test	Temporal Capabilities	Count
ICEWS Actor	Simple	17,982	5,994	5,994	Temporal Constrained Retrieval	34,498
	Medium	15,990	5,330	5,330	Timeline Position Retrieval	79,382
	Complex	19,652	6,550	6,550	Timeline Operation	34,894
					Temporal Semantic Operation	24,508
Total		53,624	17,874	17,874		89,372
CronQuestion KG	Simple	7,200	2,400	2,400	Temporal Constrained Retrieval	19,720
	Medium	8,252	2,751	2,751	Timeline Position Retrieval	37,720
	Complex	9,580	3,193	3,193	Timeline Arithmetic Operation	21,966
					Temporal Semantic Operation	15,720
Total		25,032	8,344	8,344		41,720

Answers Detailed Types	ICEWS Actor	CronQuestions KG
Subject	17,249	9,860
Object	17,249	9,860
Timestamp Start	4,995	2,000
Timestamp End	4,995	2,000
Timestamp Range	4,995	2,000
Duration	4,995	2,000
Relation Duration	9,971	4,000
Relation Ranking	4,981	2,000
Relation Union or Intersection	19,942	8,000

For comparison, here is the statistics for the CronQuestions dataset:

Difficulty	Template Count	Question Count	Question Categorization	Count
Simple	158	177,922	Simple.Factual	106,208
Medium	165	90,641	Simple.Temporal	71,714
Complex	331	141,437	Medium.Factual	90,641
			Medium.Temporal	0
			Complex.Factual	67,709
			Complex.Temporal	73,728
Total	654	410,000		410,000

Temporal Question Answering Solutions

solution

RAG

The first hot spot is the Retrieval Augmented Generation(RAG), which will use the large language model embedding as the semantic index, to retrieve question relevant information from the knowledge graph, and then generate the answer based on the retrieved information.

Really this will be the dominant solution in the future?

TKGQA Embedding

On this side of the Great Dividing Range, people focused on the graph embedding ways to solve the question answering over the knowledge graph. However, due to the challenge from the LLMs, people are tend to ignore LLM in their research for this stream, or just give up this area. There is not much work released in the past years regarding this area.

Are this way really out of dated?

From the technical perspective, current temporal knowledge graph embedding ways will not fit with our proposed and generated dataset, because for the complex questions, the relevant fact will be 3, and they should have no difference between this three. If all three hit, then the Hits@1 is True.

So we developed a contrastive learning based temporal knowledge graph embedding way to solve this problem.

Text2SQL

The third way may never think about being a competitor in the KGQA area, but the LLMs provide this potential, as the knowledge graph in theory is just one table with a lot of interconnections. So generate a sql to retrieve related information will be not that hard for the LLM.

Will it really perform well in this area?

Finetuning

The last way in theory should be the easiest way for application if you have QA pairs, because if you want to fine tune the ChatGPT, they will do it for you, all you need to do is to provide the QA pairs. However, one of the main problem is lacking of QA pairs. Which we have solved the problem above.

So what's the real performance of this way if you do have enough QA pairs?

Evaluation Metrics

Hits@K: The percentage of questions where the correct answer is within the top K retrieved answers.
MRR: The mean reciprocal rank of the correct answer.

The Hits@K metric, used to evaluate the accuracy of event retrieval, is defined by the following criteria in our scenario:

\text{Hits@K} = 
\begin{cases} 
  1 & \text{if } \sum_{i=0}^{nN-1}r_i = n \\
  0 & \text{otherwise},
\end{cases}

where $r_i$ is an indicator function described as:

r_i = 
\begin{cases}
  1 & \text{if the $i$-th retrieved triplet matches an event} \\
  0 & \text{otherwise}.
\end{cases}

The $n$ represents the number of involved events for the question. In this framework, $r_i$ functions as an indicator that takes the value 1 if the $i$-th retrieved triplet corresponds to one of the designated events, and 0 otherwise. The indexing for $r_i$ begins at 0.

The Mean Reciprocal Rank (MRR) is defined as follows:

\text{MRR} = \frac{1}{Q} \sum_{q=1}^Q \frac{1}{\text{rank}_q + 1}

where $Q$ denotes the number of queries, and $\text{rank}_q$ is defined as the position of the first relevant document, i.e., $\text{rank}_q = \min { i : r_i = 1 }$. In our scenario, the definition of $\text{rank}_q$ needs to be adjusted to accommodate multiple relevance within the same set of results.

It is defined as:

\text{rank}_q = \sum_{i=0}^{\|\mathcal{F}\|} \left\lfloor \frac{i}{n} \right\rfloor r_i

where $|\mathcal{F}|$ is the number of facts.

Evaluation Results

Systematic Comparison between RAG and TKGQA Embedding

As these two are similar approaches, so we will evaluate them together.

Dataset	Model	MRR (Overall)	MRR (Simple)	MRR (Medium)	MRR (Complex)	Hits@1 (Overall)	Hits@1 (Simple)	Hits@1 (Medium)	Hits@1 (Complex)	Hits@3 (Overall)	Hits@3 (Simple)	Hits@3 (Medium)	Hits@3 (Complex)
ICEWS Actor	RAG	0.365	0.726	0.274	0.106	0.265	0.660	0.128	0.011	0.391	0.776	0.331	0.086
	RAG_semantic	0.427	0.794	0.337	0.162	0.301	0.723	0.164	0.022	0.484	0.852	0.424	0.195
	TimelineKGQA	0.660	0.861	0.632	0.497	0.486	0.782	0.435	0.257	0.858	0.929	0.845	0.805
CronQuestions KG	RAG	0.331	0.771	0.218	0.101	0.235	0.704	0.092	0.009	0.348	0.824	0.249	0.077
	RAG_semantic	0.344	0.775	0.229	0.122	0.237	0.707	0.094	0.010	0.371	0.828	0.267	0.122
	TimelineKGQA	0.522	0.788	0.510	0.347	0.319	0.676	0.283	0.103	0.758	0.759	0.667	0.834

Hits@1 for Text2SQL

Dataset	Model	Hits@1 (Overall)	Hits@1 (Simple)	Hits@1 (Medium)	Hits@1 (Complex)
ICEWS Actor	GPT3.5_base	0.179	0.268	0.170	0.105
	GPT3.5_semantic	0.358	0.537	0.311	0.232
	GPT3.5_semantic.oneshot	0.432	0.611	0.354	0.328
	GPT4o_semantic.oneshot	0.485	0.650	0.392	0.408
	TimelineKGQA	0.486	0.782	0.435	0.257
CronQuestions KG	GPT3.5_base	0.158	0.393	0.079	0.052
	GPT3.5_semantic	0.236	0.573	0.130	0.076
	GPT3.5_semantic.oneshot	0.281	0.583	0.179	0.143
	GPT4o_semantic.oneshot	0.324	0.623	0.201	0.207
	TimelineKGQA	0.319	0.676	0.283	0.103

Finetuning accuracy

Model	Rephrased	Question_as_answer	Simple_for_medium
GPT-3.5-Turbo	0.60	0.18	0.36
GPT-4o-mini	0.62	0.00	0.25

Development Setup

Install the package

# cd to current directory
cd TimelineKGQA
python3 -m venv venv
pip install -r requirements.txt
# if you are doing development
pip install -r requirements.dev.txt

# and then install the package
pip install -e .

If you are doing development, you will also need a database to store the knowledge graph.

# spin up the database
docker-compose up -d

# After this we need to load the data

# for icews_dict
source venv/bin/activate
export OPENAI_API_KEY=sk-proj-xxx
# this will load the icews_dicts data into the database
python3 -m TimelineKGQA.data_loader.load_icews --mode load_data --data_name icews_dicts
# this will create the unified knowledge graph
python3 -m TimelineKGQA.data_loader.load_icews --mode actor_unified_kg

# this will generate the question answering pairs
python3 -m TimelineKGQA.generator

Folder Structure

TimelineKGQA/
├── TimelineKGQA/
│   ├── __init__.py
│   ├── generator.py
│   ├── processor.py
│   └── utils.py
├── tests/
│   ├── __init__.py
│   ├── test_generator.py
│   └── test_processor.py
├── docs/
│   └── ...
├── examples/
│   └── basic_usage.py
├── setup.py
├── requirements.txt
├── README.md
└── LICENSE