A Study Based on Large Language Model (2024)

¹¹institutetext: School of Software, Zhejiang University ²²institutetext: College of Computer Science and Technology, Zhejiang University
²²email: {huangzww, lijuan18, longjin, wangjj2018, mingchentz, 22351088, zhiqiangliu, mjw.cs, zhang.wen}@zju.edu.cn

Zhiwei Huang 11 Juan Li11 Long Jin11 Junjie Wang11 Mingchen Tu11 Yin Hua11 Zhiqiang Liu11 Jiawei Meng22 Wen ZhangCorresponding author.11

Abstract

As the development of academic conferences fosters global scholarly communication,researchers consistently need to obtain accurate and up-to-date information about academic conferences.Since the information is scattered, using an intelligent question-answering system to efficiently handle researchers’ queries and ensure awareness of the latest advancements is necessary. Recently, Large Language Models (LLMs) have demonstrated impressive capabilities in question answering, and have been enhanced by retrieving external knowledge to deal with outdated knowledge.However, these methods fail to work due to the lack of the latest conference knowledge. To address this challenge, we develop the ConferenceQA dataset, consisting of seven diverse academic conferences. Specifically, for each conference, we first organize academic conference data in a tree-structured format through a semi-automated method.Then we annotate question-answer pairs and classify the pairs into four different types to better distinguish their difficulty.With the constructed dataset, we further propose a novel method STAR (STructure-Aware Retrieval) to improve the question-answering abilities of LLMs, leveraging inherent structural information during the retrieval process.Experimental results on the ConferenceQA dataset show the effectiveness of our retrieval method.The dataset and code are available at https://github.com/zjukg/ConferenceQA.

Keywords:

Conference datasetLarge language modelRetrieval augmentation.

1 Introduction

The rapid advancement of computer science has led to an increase in research presented at academic conferences, which are crucial for academic exchange. Given the vast and dispersed nature of conference information, querying is a more efficient method for information retrieval than navigating multiple sources.

Recent advancements in Large Language Models (LLMs) [21, 2, 7] have significantly impacted various NLP tasks, including question answering. LLMs demonstrate capabilities like chain-of-thought reasoning[3] and in-context learning[6], enhanced by increasing model parameters and extensive training data. After instruction fine-tuning [5], LLMs excel in conversational tasks and information retrieval[4].

Despite the success of LLMs, they are related to incompleteness, untimeliness, unfaithfulness, and having limitations in updating timely and domain-specific expertise.This necessitates research efforts to integrate LLMs with external knowledge sources, such as knowledge bases (KBs)[8], search engines[9] and databases[10].Regarding academic conference queries, due to the missing external conference knowledge, LLMs fail to access the latest academic conference information in question answering, such as academic conferences in 2022 and later ones.Existing retrieval methods are efficient but primarily focus on plain text[11], triples[15], and tables [16], which does not align well with the structured nature of conference websites, complicating direct application for conference-specific queries.

In this paper, we introduce ConferenceQA, a benchmark comprising seven recent top-tier academic conferences, these conferences span various research domains such as web science, natural language processing, machine learning, databases, artificial intelligence, and the semantic web, providing a comprehensive dataset that organizes information across all stages of the conferences.To construct this dataset, we initially employ a semi-automatic method to convert the conference information into a tree structure. Subsequently, we utilize ChatGPT to simulate roles with diverse backgrounds, enabling us to generate role-specific questions. These questions are then carefully filtered and annotated with answers to ensure the dataset’s reliability. Additionally, we document the sources of the answers to further enhance the dataset’s credibility.Besides, we categorize the questions into four types given the complexity of getting answers.

On the constructed ConferenceQA dataset, we introduce STAR (STructure-Aware Retrieval), a method leveraging LLMs for hierarchical data, and then proceed to conduct a study on conference QA.Our method generates a textual description for each path based on both its surrounding structural information and its own textual information.We conduct experiments using various LLMs,along with different retrievers.Compared to path retrieval, structural-aware retrieval shows an average relative F1 score improvement of 15.50% across different LLMs and 17.03% when using different retrievers. This highlights the effectiveness of STAR on the tree-structured ConferenceQA dataset.

A Study Based on Large Language Model (1)

Our contributions can be summarized as follows:

2 Dataset Construction

In this section, we introduce the construction of the ConferenceQA dataset. We select the conference information of seven typical academic conferences in 2022 or 2023 to build the dataset based on their official website, where the most accurate information about the conferences is stored. Each conference is assigned to one data annotator with relevant experience in the realm of academic conferences.We use three steps, including hierarchical data transformation, QA pair generation and question classification, to construct each conference dataset. The overview of the construction process is shown in Fig.1.

2.1 Hierarchical Data Transformation

Data transformation in the ConferenceQA dataset involves standardizing the diverse formats of academic conference data sourced from official conference websites into a unified tree structure. Each conference page combines unstructured text, like conference introductions and paper submission guidelines, with structured data such as payment and schedule details. To manage this format variability, we employ a semi-automated method to create tree-structured data for each conference.

Specifically, the automated component converts structured table data into a tree format using ChatGPT, as shown in Fig.1, where registry information is transformed. For other structured data, such as accepted papers with consistent schemas (title, authors, abstract), we employ web crawlers to fetch HTML pages and convert them into corresponding tree-structured data based on the HTML tags. The manual component involves annotating inter-page relationships. Annotators assign page titles to tree nodes based on the linkage among pages, evident in navigation bars and subpage links like ‘calls’, ‘proceedings’ and ‘programs’. Additionally, subtitles within pages are identified and designated as child nodes under the relevant page titles. These manual steps are essential to maintain the dataset’s quality and coherence.

Ultimately, we obtain seven conference datasets organized in a tree-structured format. They are served as accurate and rigorous knowledge sources.

2.2 QA Pair Generation

This step involves generating reliable question-answer pairs through role creation, LLM-generated questions, and manual annotation. For each conference, we utilize ChatGPT to simulate the roles of conference participants, generating relevant questions which are then manually filtered and annotated with answers and their sources to ensure realism and reliability.

We use ChatGPT to create 20 roles characterized by specific attributes such as age, research direction, position, publication history, and conference attendance experience, mimicking real-life researchers with diverse backgrounds interested in the conferences.With these roles, we prompt ChatGPT to engage in role-playing scenarios, generating five varied questions per conference. These questions cover different areas of interest or uncertainty relevant to the roles’ diverse backgrounds. To avoid redundancy and enhance question diversity, we iteratively prompt the model. Specifically, we use the results generated by the ChatGPT as examples for the next iteration and encourage the ChatGPT to generate more diverse questions. In the final step, we manually review and filter the questions to eliminate duplicates and unrealistic queries. We then annotate the answers based on our tree-structured data, ensuring the reliability of the dataset by documenting the source of each answer within the constructed academic conference data.

2.3 Question Classification

To assess the model’s capability in handling questions of varying difficulty, we design a scheme to classify the question-answer pairs based on two criteria: the method used to generate the answer and the complexity of paths required to arrive at the correct answer.

Extraction vs. Reasoning: This category evaluates the process of answer generation. Answers directly pulled from the dataset are labeled as extraction, whereas answers that necessitate reasoning beyond the dataset content are labeled as reasoning. Reasoning questions are more challenging than extraction questions because, unlike direct extraction, reasoning questions require the model to have the capability to infer the relationship between the retrieved paths and the question.

Atomic vs. Complex: This category assesses the complexity of paths needed to generate the answer. Answers that depend on a single path are termed atomic, while those requiring multiple paths are termed complex. Complex questions are more difficult than atomic questions because, instead of a single path, complex questions require recalling multiple paths to derive an answer.

Combining these dimensions results in four levels of difficulty: extraction-atomic, extraction-complex, reasoning-atomic, and reasoning-complex. This classification is vital for analyzing the model’s performance across different complexities and reasoning demands.

2.4 Dataset Validation

Following data construction, a thorough validation process is conducted by three independent assessors who evaluate each QA pair across three critical dimensions. The first dimension assesses the alignment between each question and its answer, ensuring the answer accurately addresses the question. Concurrently, the second dimension examines the reliability of the answer source, ensuring it provides the necessary information for the question. The third dimension evaluates the practical relevance of each question, ensuring it reflects real-world needs and concerns. If a QA pair fails to meet the criteria in any dimension, as agreed upon by at least two assessors, it is marked for removal and redesign. This rigorous process ensures each QA pair is validated comprehensively, maintaining the quality and reliability of the dataset. Detailed statistics of the selection process for each conference are shown in Table LABEL:tab:dataset.

Conference	#Paths	#Depth	#EA	#EC	#RA	#RC
WWW2023	15127	7.01	32	27	17	36
ACL2023	14306	9.05	29	21	30	25
ICML2023	4715	8.52	26	27	28	19
SIGMOD2023	6338	7.46	39	27	23	34
IJCAI2023	15800	6.13	28	26	13	33
ICDE2023	9736	9.14	28	24	22	21
ISWC2022	3594	7.53	33	42	25	18
Avg	9916	7.83	31	28	23	27

3 Method

In this section, we discuss LLM-based methods for academic conference question-answering. The prevalent approach involves using an external knowledge source for retrieval[15, 10, 13], where the reader’s query $q$ extracts relevant content $c$ from a domain-specific knowledge base, and this content is then combined with the query for the LLM to generate an answer. This retrieval-based method can be formalized as $a=LLM(q,c)$ where $c=Retriever(q,\mathcal{KB})$ . It optimizes the retriever, suchthat for each question $q$ , the model can give an answer $a$ that has high accuracy or relevancy with a correct answer.Our approach adheres to this retrieval-based model but is adapted for our conference’s tree-structured dataset. We preprocess this structured data to facilitate content retrieval and introduce a novel method named STAR (STructure-Aware Retrieval), which effectively integrates structural and semantic data for improved retrieval performance.

3.1 Tree-structured Data Processing

The tree-structured data is hierarchically arranged, with each node representing a page or a section heading, and each leaf node corresponding to its specific content. For retrieval, we pair each leaf node with its root node to provide additional context to the LLM. Paths in the tree use the ‘>>’ field to denote hierarchical relationships and contain both structural and semantic information. An example path is: WWW2023>>Attendees>>Registration>>Register Fee>>Virtual Conference>>ACM Members>>$300.After the tree-structured data processing, the knowledge source for retrieval could be represented as a set of paths that $\mathcal{P}=\{p_{1},p_{2},...,p_{m}\}$ where $m$ is the number of paths in the dataset.

3.2 Path Retrieval

Upon receiving a query input $q$ , the retriever selects a subset of paths from $\mathcal{P}=\{p_{1},p_{2},...,p_{m}\}$ that are relevant to $q$ .Following established methods [20], we use a dense retriever based on a dual encoder framework. This framework employs an encoder to transform both the query $q$ and each path $p\in\mathcal{P}$ into embeddings. The similarity between the query and path embeddings is assessed using cosine similarity, and the top- $k$ paths with the highest similarity scores are retrieved, as expressed in (1), where E denotes the embedding function.

c=topk(\{\cos(\textbf{E}(q),\textbf{E}(p))|p\in\mathcal{KP}\})

(1)

A Study Based on Large Language Model (2)

3.3 Structure-aware Retrieval

The limitation of treating a single path as the retrieval object is that it disconnects the structural relationships among paths. For example, the relationship between an author’s name and their affiliated institution is lost when paths are retrieved independently.

To overcome this, we introduce a novel method called STAR (STructure-Aware Retrieval). As shown in Fig.2, STAR employs ChatGPT to iteratively generate textual descriptions for each path $des_{p}$ , from the root to individual nodes, in a top-down manner. We enhance the retrieval process by incorporating structural information in the user input, which includes siblings, parent path descriptions, and the query path itself. This approach helps maintain the contextual relevance of each path, which is crucial for recognizing relationships like those between an author and their institution. For instance, when generating path descriptions, we not only consider the node’s immediate context but also integrate the structural significance of related nodes. This includes the siblings of a node and their parent nodes, ensuring a comprehensive representation of each path’s context. To avoid the loss of information about the siblings of leaf nodes, we append the text of their parent node to each sibling of the leaf nodes. Ultimately, this method effectively preserves and utilizes structural relationships, enhancing the retrieval process.

Thus we can construct a knowledge source of path descriptions $\mathcal{KP}_{des_{p}}=\{(p,des_{p})|p\in\mathcal{KP}\}$ , containing pairs of paths and their descriptions. For retrieval, we use the similarity between the query and each path description as the score for that path. We then retrieve the top- $k$ paths with the highest similarity scores to the query $q$ . The embedding of the element is denoted by E, and this process is formalized as shown in (2).

c=topk(\{\cos(\textbf{E}(q),\textbf{E}(des_{p}))|(p,des_{p})\in\mathcal{KP}_{%des_{p}}\})

(2)

4 Experiments

In this section, we conduct question answering experiments on conference datasets to explore: 1) How does the STAR perform with different LLMs? 2) How does the STAR perform with different retrievers? 3) How does the STAR perform with different academic conferences?

4.1 Experimental Details

Based on the constructed ConferenceQA, we use currently popular LLMs, including Bloom (7B) [31], GPT-J (6B) [30], Flan-T5 (xl and xxl) [29], LLaMA2 (7B and 13B) [7], Mistral (7B) [25] and ChatGPT, as the main evaluation backbone to assess the performance of mainstream LLMs. For ChatGPT, we employ GPT-3.5-turbo and access it via API¹¹1 from https://api.openai.com/.We employ BM25[1], SentenceBert[26], DPR[27], ANCE[28] and text-embedding-ada-002 as our retriever.In addition, we use Chroma²²2https://github.com/chroma-core/chroma as our vector database and employ cosine similarity for matching. In all experiments, we select the top 5 paths retrieved.

4.2 Evaluation Metrics

In line with prior studies, we assess the QA capabilities of LLMs using the F1 score and the exact match (EM) score. Specifically, we employ GPT-4 to compute the EM, referred to as EM-GPT4.

The F1 score quantifies the overlap between the predicted and correct answers by calculating the harmonic mean of precision and recall.

The EM-GPT4 score evaluates the proportion of instances where the LLM’s predicted answer exactly matches the correct answer. Given the generative nature of LLMs, slight textual variations in responses might still represent the same answer. We use GPT-4, a highly advanced LLM known for its semantic understanding capabilities, to precisely assess if the LLM’s response matches the golden answers.

LLMs	F1				EM-GPT4
LLMs	EA	EC	RA	RC	EA	EC	RA	RC
Bloom-7B1	19.60-0.36	11.19+1.96	17.01+0.36	11.58-0.20	30.27+0.42	15.03+4.41	41.70-3.43	17.58-1.36
GPT-J-6B	14.53+1.76	8.81+3.11	15.52+2.46	8.42-0.25	19.11+7.04	12.93+5.53	34.16+5.03	13.08+1.94
Flan-T5-xl	27.74+7.85	14.68+2.77	36.03+0.96	19.00+2.89	35.50+9.56	20.78+0.86	59.38+3.97	25.74+2.27
Flan-T5-xxl	32.31+8.97	14.01+9.69	37.08+4.3	20.86+0.98	40.81+10.69	18.58+12.64	55.76+11.71	25.36+3.58
LLaMA2-7B	14.05+2.23	12.09+1.25	12.47+3.00	8.48+0.12	21.32+0.15	9.22+2.77	23.81+5.83	9.89-1.15
LLaMA2-13B	29.57+2.82	20.92+4.00	25.71+4.20	13.64+2.83	41.16+6.26	24.02+2.21	55.00+6.53	20.23+4.46
Mistral-7B	30.75+4.31	23.67+4.69	25.87+4.11	15.91-0.37	43.33+10.53	27.58+13.95	59.90+6.89	29.23+1.55
GPT-3.5-turbo	28.35+7.5	21.54+4.83	24.66+9.62	16.21+0.78	40.53+13.43	25.10+9.03	49.97+11.34	25.75+1.45

4.3 Experimental Results Analysis

Effect of Different LLMs We analyzed the performance of various LLMs on different types of questions to understand their perception capabilities and limitations. The results, shown in Table 2, provide several insights: (1) Our STAR method significantly improves the answering performance across various LLMs. For instance, on models like Bloom-7B1, GPT-J-6B, and GPT-3.5-turbo, F1 scores increased by 4%, 14.9%, and 25.04% respectively, while EM-GPT4 scores improved by 0.04%, 24.65%, and 24.94%. The least improvement was on Bloom-7B1, suggesting its inherent limitations. However, substantial gains on other models demonstrate our method’s effectiveness. (2) There is an inconsistency between F1 and EM-GPT4 scores; lower F1 scores sometimes align with higher EM-GPT4 scores. This may be due to LLMs generating longer textual responses, affecting F1 accuracy but not EM-GPT4, which better evaluates semantic similarity. (3) The complexity of question types affects performance; atomic questions are simpler than complex ones. Atomic questions, akin to single-hop queries, generally show higher accuracy than multi-hop complex questions. Despite this, LLMs perform comparably or better on reasoning questions than on extraction, likely due to their robust contextual learning and reasoning capabilities. (4) Different LLMs show varied understanding of paths. For example, under the same retrieval conditions, Mistral-7B outperforms GPT-3.5-turbo. Generally, models with more parameters, like LLama2-13B and Flan-T5-xxl, achieve higher accuracy, supporting the notion that larger LLMs perform better.

Retrievers	F1				EM-GPT4
Retrievers	EA	EC	RA	RC	EA	EC	RA	RC
BM25	21.02+9.14	16.81+10.29	25.77-2.79	14.72+1.37	25.90+4.9	14.12+8.1	36.50+2.94	5.16+5.55
SentenceBERT	38.62-3.73	23.70-0.01	12.97+2.05	16.23-0.18	39.83-2.57	14.81+11.12	29.79+17.66	22.88-1.47
DPR	30.56+3.72	23.72+0.59	27.35+2.60	21.16-0.43	30.95+5.91	15.17-0.14	52.70+0.31	10.66+1.38
ANCE	28.24+8.72	17.41+4.75	16.52+9.1	18.00+2.26	41.66+6.14	30.12+4.07	50.84+0.49	15.25+1.21
ada-002	28.35+7.5	21.54+4.83	24.66+9.62	16.21+0.78	40.53+13.43	25.10+9.03	49.97+11.34	25.75+1.45

Effect of Different Retrievers We evaluated four retrievers—BM25 [1], SentenceBert [26], DPR [27], and ANCE [28]—using the gpt-3.5-turbo generator across four question types within the ConferenceQA dataset. The results, detailed in Table 3, reveal: (1) BM25 showed weak performance, especially with extraction-atomic and reasoning-complex questions. In contrast, dense retrievers like SentenceBERT, DPR, and ANCE significantly outperformed BM25, underscoring the advantages of dense retrieval methods. (2) Performance varied among dense retrievers: SentenceBERT was effective in extraction-atomic questions but less so in reasoning-atomic questions. DPR excelled in reasoning-atomic questions, while ANCE showed consistent performance across all question types. This indicates that selecting an appropriate retriever can significantly impact question-answering effectiveness. (3) While STAR occasionally had negative effects in some configurations, it generally enhanced performance across most settings, demonstrating its utility and reliability.

A Study Based on Large Language Model (3)

Effect of Different Conference Fig.3 shows the performance across various conferences using text-embedding-ada-002 and gpt-3.5-turbo as the retriever and generator, respectively. Key observations include: (1) There is notable variability in question difficulty across conferences, highlighting the diversity of our dataset. (2) Significant differences in difficulty are apparent between conferences; for example, the average EM-GPT4 score at ICML is 94.9% higher than at ACL,underscoring the importance of accounting for conference-specific characteristics in question-answering research.(3) Except for reasoning-atomic questions at SIGMOD and reasoning-complex questions at ISWC, our STAR method consistently outperforms traditional path retrieval, demonstrating its versatility and effectiveness across different conferences and question types.

5 Related Work

In academic data science, foundational resources such as CiteSeerX[19], a digital library for scientific literature, and Unarxive[18], which hosts over a million documents from arXiv.org, are crucial for scholarly communication. Zhang et al.[17] developed Maple, a benchmark for tagging scientific literature across 19 disciplines. However, there remains a notable gap in benchmarks specifically designed for academic conference QA, despite the increasing diversity and volume of literature datasets.

Simultaneously, augmenting language models with data from various knowledge bases has significantly improved performance across many NLP tasks[22, 23]. Techniques such as Atlas[11], which fine-tunes an encoder-decoder model with a retriever, and RETRO[12], which integrates retrieved texts into a decoder-only model, utilize large volumes of unstructured text. Other approaches like REPLUG[13] and FLARE[14] dynamically retrieve information based on context, treating LLMs as black boxes. In structured knowledge, methods include extracting triples from knowledge graphs for KGQA tasks[15, 10] and converting them into textual prompts for LLMs[24]However, the use of hierarchical data such as tree-structured data in retrieval augmentation is still limited.

6 Conclusion

In this work, we developed the ConferenceQA dataset, which organizes recent academic conference information into a tree-structured format to support question answering. We introduce a novel approach, STAR, that enhances question-answering performance by generating textual descriptions for each path within the tree, effectively utilizing both structural and textual data. The ConferenceQA dataset and STAR method have advanced the development of robust and adaptable academic conference question-answering systems. Future efforts will focus on integrating LLMs with tree-structured data to improve domain-specific knowledge access and reasoning.

Acknowledgements

This work is founded by National Natural Science Foundation of China (NSFC62
306276), Zhejiang Provincial Natural Science Foundation of China (No. LQ23F02
0017), Yongjiang Talent Introduction Programme (2022A-238-G), Ningbo Natural Science Foundation (2023J291), and Fundamental Research Funds for the Central Universities (226-2023-00138).

References

[1]S.Robertson, H.Zaragoza etal., “The probabilistic relevanceframework: Bm25 and beyond,” Foundations and Trends®in Information Retrieval, 2009.
[2]OpenAI, “Gpt-4 technical report,” 2023.
[3]J.Wei, X.Wang, D.Schuurmans, M.Bosma, F.Xia, E.Chi, Q.V. Le, D.Zhouetal., “Chain-of-thought prompting elicits reasoning in largelanguage models,” Advances in Neural Information Processing Systems, 2022.
[4]T.Kojima, S.S. Gu, M.Reid, Y.Matsuo, and Y.Iwasawa, “Large languagemodels are zero-shot reasoners,” Advances in neural informationprocessing systems, 2022.
[5]H.W. Chung, L.Hou, S.Longpre, B.Zoph, Y.Tay, W.Fedus, E.Li, X.Wang,M.Dehghani, S.Brahma etal., “Scaling instruction-finetunedlanguage models,” arXiv preprint arXiv:2210.11416, 2022.
[6]S.Min, X.Lyu, A.Holtzman, etal., “Rethinking the role of demonstrations: What makesin-context learning work?” arXiv preprint arXiv:2202.12837, 2022.
[7]H.Touvron, L.Martin, K.Stone etal.,“Llama 2: Openfoundation and fine-tuned chat models,” 2023.
[8]A.Modarressi, A.Imani, M.Fayyaz, and H.Schütze, “Ret-llm: Towards ageneral read-write memory for large language models,” 2023.
[9]T.Schick, J.Dwivedi-Yu, R.Dessì, R.Raileanu, M.Lomeli, L.Zettlemoyer, etal., “Toolformer: Language models can teachthemselves to use tools,” 2023.
[10]C.Hu, J.Fu, C.Du, S.Luo, J.Zhao, and H.Zhao, “Chatdb: Augmenting llmswith databases as their symbolic memory,” 2023.
[11]G.Izacard, P.Lewis, M.Lomeli, L.Hosseini, F.Petroni, T.Schick,J.Dwivedi-Yu, A.Joulin, S.Riedel, and E.Grave, “Atlas: Few-shot learningwith retrieval augmented language models,” 2022.
[12]S.Borgeaud, A.Mensch, J.Hoffmann, etal.,“Improving language models by retrieving from trillions of tokens,” inInternational conference on machine learning.PMLR, 2022.
[13]W.Shi, S.Min, M.Yasunaga, M.Seo, R.James, M.Lewis, L.Zettlemoyer, andW.-t. Yih, “Replug: Retrieval-augmented black-box language models,”arXiv preprint arXiv:2301.12652, 2023.
[14]Z.Jiang, F.F. Xu, L.Gao, Z.Sun, Q.Liu, J.Dwivedi-Yu, Y.Yang, J.Callan,and G.Neubig, “Active retrieval augmented generation,” arXivpreprint arXiv:2305.06983, 2023.
[15]P.Sen, S.Mavadia, and A.Saffari, “Knowledge graph-augmented language modelsfor complex question answering,” 2023.
[16]V.Zhong, C.Xiong, and R.Socher, “Seq2sql: Generating structured queriesfrom natural language using reinforcement learning,” arXiv preprintarXiv:1709.00103, 2017.
[17]Y.Zhang, B.Jin, Q.Zhu, Y.Meng, and J.Han, “The effect of metadata onscientific literature tagging: A cross-field cross-model study,” inProceedings of the ACM Web Conference 2023, 2023.
[18]T.Saier and M.Färber, “unarxive: a large scholarly data set withpublications’ full-text, annotated in-text citations, and links tometadata,” Scientometrics, 2020.
[19]C.L. Giles, K.D. Bollacker, and S.Lawrence, “Citeseer: An automaticcitation indexing system,” in Proceedings of the third ACM conferenceon Digital libraries, 1998.
[20]J.Ni, C.Qu, J.Lu, Z.Dai, G.H., etal., “Large dual encoders aregeneralizable retrievers,” arXiv preprint arXiv:2112.07899, 2021.
[21]T.Brown, B.Mann, N.Ryder, M.Subbiah, J.D. Kaplan, P.Dhariwal,A.Neelakantan, P.Shyam, G.Sastry, A.Askell etal., “Languagemodels are few-shot learners,” Advances in neural informationprocessing systems, 2020.
[22]K.Guu, K.Lee, Z.Tung, P.Pasupat, and M.Chang, “Retrieval augmentedlanguage model pre-training,” in International conference on machinelearning.PMLR, 2020.
[23]P.Lewis, E.Perez, A.Piktus, F.Petroni, etal.,“Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, 2020.
[24]Y.Wu, N.Hu, G.Qi, S.Bi, J.Ren, A.Xie, and W.Song,“Retrieve-rewrite-answer: A kg-to-text enhanced llms framework for knowledgegraph question answering,” arXiv preprint arXiv:2309.11206, 2023.
[25]A.Q. Jiang, A.Sablayrolles, A.Mensch, etal., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[26]N.Reimers and I.Gurevych, “Sentence-bert: Sentence embeddings using siamesebert-networks,” arXiv preprint arXiv:1908.10084, 2019.
[27]V.Karpukhin, B.Oğuz, S.Min, P.Lewis, L.Wu, S.Edunov, D.Chen, andW.-t. Yih, “Dense passage retrieval for open-domain question answering,”arXiv preprint arXiv:2004.04906, 2020.
[28]L.Xiong, C.Xiong, Y.Li, K.-F. Tang, J.Liu, P.Bennett, J.Ahmed, andA.Overwijk, “Approximate nearest neighbor negative contrastive learning fordense text retrieval,” arXiv preprint arXiv:2007.00808, 2020.
[29]S.Longpre, L.Hou, T.Vu, A.Webson, etal.“The flan collection: Designing data and methods for effective instruction tuning,” in International Conference on Machine Learning.PMLR, 2023.
[30]B.Wang and A.Komatsuzaki, “Gpt-j-6b: A 6 billion parameter autoregressive language model,” 2021.
[31]T.LeScao, A.Fan, C.Akiki, E.Pavlick, etal., “Bloom: A 176b-parameter open-access multilingual language model,” 2022.