EVENTS: A Dataset on the History of Top-Prestigious Events in Five Computer Science Communities

Information emanating from scientiﬁc events, journal, organizations, institutions as well as scholars become increasingly available online. Therefore, there is a great demand to assess, analyse and organize this huge amount of data produced every day, or even every hour. In this paper, we present a dataset (EVENTS) of scientiﬁc events, containing historical data about the publications, submissions, start date, end date, location and homepage for 25 top-prestigious event series (718 editions in total) in ﬁve computer science communities. The dataset is publicly available online in three diﬀerent formats (i.e., CSV, XML, and RDF). It is of primary interest to the steering committees or program chairs of the events to assess the progress of their event over time and compare it to competing events in the same ﬁeld, and to potential authors looking for events to publish their work. In addition, we shed light on these events by analyzing their metadata over the last 50 years. Our transferable analysis is based on exploratory data analysis.


Introduction
Digitization is of crucial importance to all areas of scholarly communication.Therefore, over the last two decades, many organizations and institutes have begun to organize and establish new scientific events.This paper discusses some facts and figures representing 50 years 4 of history of computer science events, where conferences, symposia, and workshops are of paramount importance and a major means of scholarly communication.A key question is: How does digitization affect scholarly communication in computer science?In particular, we address the following questions: a) What is the trend of submissions and acceptance rates?b) How did the number of publications change?c) Is there an augmentation of publications of a computer science sub-community?d) Has the geographical distribution of events changed across various regions of the world?e) Which events are more geographically diverse than others?
We target some of these questions by analyzing comprehensive scholarly communication metadata from computer science events in the last 50 years.Our analysis methodology is based on exploratory data analysis, which aims at analyzing data to explore the main characteristics, oftentimes with visual methods.We analyze the key characteristics of scientific events over time, including their CORE 5 , Qualis (Q) 6 and GII rankings 7 , geographic distribution, average acceptance rate, time distribution over the year, submissions and publications.We selected five top-prestigious events in five CS communities derived from analyzing the topics covered by each event series, then mapping the event series to the ACM Computing Classification System (CCS) 8 : Information systems (IS), Security and privacy (SEC), Artificial intelligence (AI), Computer systems organization (CSO) and Software and its engineering (SE).Events will only be referred to using their acronym.We believe that the EVENTS dataset will have a great impact on scholarly communication community, particularly for the following stakeholders(cf.[6]): a) event organizers: to trace their events' progress/impact, b) authors: identify prestigious events to submit their research results to, c) proceedings publishers: to know the impact of the events whose proceedings they are publishing.
This article is organized as follows: Section 2 gives an overview of related work.Section 3 presents the main characteristics of the dataset.Section 4 explains the curation process of creating and evolving the dataset.Section 5 discusses the results of our analysis of the dataset.Section 6 concludes and outlines our future work.

Related Work
In our recent review of the literature [1,4,5,9,8] found that most studies tended to focus on grabbing information about scholarly communication from bibliographic metadata.Ameloot et al. presented a comprehensive analysis of the Principles of Database Systems (PODS) conference series including word clouds of most PODS researchers and newcomers, longest streaks and locations of PODS in the period 2002-2011 [2].Similarly, Aumüller and Rahm [3] analyzed affiliations of database publications using author information from DBLP.Fathalla et al. [7] provided an analysis of 40 computer science conference series in terms of continuity, time and geographic distribution, submissions and publications.Barbosa et al. [4] analysed the metadata of 340 full papers published in 14 editions of the Brazilian Symposium on Human Factors in Computing Systems (IHC).Vasilescu et al. [10] presented a dataset of eleven software engineering conferences, containing historical data about publications and program committees in the period 1994-2012.Agarwal et al. [1] presented a bibliometric analysis of the metadata of seven ACM conferences covering different CS fields such as information management, data mining, digital libraries and information retrieval.

Characteristics of the EVENTS Dataset
EVENTS dataset covers historical information about 25 top-prestigious events of the last five decades, including (where available) an event's full title, acronym, start date, end date, number of submissions, number of accepted papers, city, state, country, event type, field and homepage.These global indicators have been used to spot and interpret peculiarities on the temporal and geographical evolution of event series.There are two types of events: conferences and symposia 9 .Table 1 provides high-level statistics for the 25 event series in the five CS communities of IS, SEC, AI, CSO, and SE.Entries refers to all available attributes of all events.Use Cases.Using this dataset, event organizers and chairs will be able to assess their selection process, e.g., to keep, if desired, the acceptance rate stable even when the submissions increase, to make sure the event is held around the same time each year, and to compare against other competing events.Furthermore, we believe this dataset will assist researchers who want to submit a paper to be able to decide to which events they could submit their work, e.g., answering questions such as "which events have a high impact in a particular CS field?".Moreover, when a specific conference is held each year, it helps them to prepare their research within the conference's usual timeline.section 5 presents a part of the analysis that could be performed by using the EVENTS dataset.
Extensibility.EVENTS can be extended in three dimensions to meet future requirements by 1) adding more events in each community, 2) adding events in other communities and 3) adding more attributes such as hosting university or organization, sponsors, and event steering committees or program committee chairs.
Availability.EVENTS is published at https://saidfathalla.github.io/EVENTS-Dataset/EVENTS.html.It is subject to the Creative Commons Attribution license, as documented at https://saidfathalla.github.io/EVENTS-Dataset/EVENTS_Licence.html.The RDF version has been validated using W3C Validation Service10 .The following listing shows the information about the AAAI conference of 2017 in RDF.We defined new vocabularies in the OpenResearch namespace11 .

Data Curation
While we collected the data for the dataset, we faced several technical problems, such as having to eliminate irrelevant and redundant data, to unify event name, to complete missing data, and to correct incorrect data.Therefore, a data curation process is required.The EVENTS dataset is being maintained over time according to the curation process described later in this section.

Data Acquisition
After identifying top events, metadata (raw data) of these events is collected either from structured or unstructured data sources.The metadata of selected events has been manually collected from various sources such as IEEE Xplore Digital Library12 , ACM Digital Libraries13 , DBLP, OpenResearch.organd events websites.The selection is based on several criteria such as CORE ranking, Qualis ranking, GII ranking and Google h-index (the largest number h such that h articles published in the last 5 complete years have at least h citations each).

Data preprocessing
The main objective of the data preprocessing phase is to fill in missing data, to identify and correct incorrect data, to eliminate irrelevant data and to resolve inconsistencies.In order to prepare the raw data for analysis, we carried out three preprocessing processes: data integration, data cleansing, data transformation and Event name unification.. Data integration.This process involves combining data from multiple sources into meaningful and valuable information.In addition, this process also involves eliminating redundant data which occur during the integration process.
Data cleansing.This process involves detecting and correcting incorrect or inaccurate records.For instance, we found several websites providing incorrect information about events' submissions and accepted papers.We double checked this information against the events' official websites or proceedings published in digital libraries.
Data transformation.This process involves converting cleaned data values from unstructured formats into a structured one.For instance, data collected from events websites as text (i.e.unstructured format) is manually transformed to CSV (i.e.structured format) and consequently to XML and RDF.
Event name unification.This process involves integrating all editions of an event series, which had changed its name since the beginning under its most recent name because it is important for the researchers to know the recent name rather than the old name.However, the old name remains important for a researcher who wants to get an overview of the history of an event.For example, PLDI is the unified name of the Conference on Programming Language Design and Implementation, which was named Symposium on Compiler Construction in the period 1979-1986, Symposium on Interpreters and Interpretive Techniques in 1987 and finally it assumed its recent name in the period 1989-2018, i.e., for 30 years.With the completion of these steps, we are now ready to perform our exploratory data analysis.

Data Analysis And Results
Over the last 50 years, we have analyzed metadata of CS events in the EVENTS dataset including the h5-index, the average acceptance rate, the number of editions of each event, the country that hosted most editions of the event, the month in which the event is usually held each year, the year of the first edition, and the publisher of the proceedings.
Submissions and publications.Figure 1 presents accepted and submitted papers measures for the top events, i.e. high-ranked events in terms of h5-index and events ranking services, in the five CS communities from 1985 to 2017.For the CVPR conference, the numbers of submitted and accepted papers were very close in the first edition in 1985, and the gap between them began to slightly increase until 2000, then it increased noticeably until the end of the time span, i.e., 2017.The gap between submission and accepted papers refers to how far the number of submissions from the accepted papers.However, the gap between the number of submitted papers and accepted papers in VLDB remained the same during the whole time span.Overall, we can see a clear upward trend in the number of submitted and accepted papers during the whole time span.The reason is that digitization makes more research papers available to the whole community and submitting papers and even contacting papers' reviewers has become much easier and efficient.Time distribution.We observed that the organizers of the prestigious events always try to keep holding their events around the same month each year, which helps researchers who want to submit their work to expect the date of the next edition of an event.Namely, PLDI has been held 30 times (out of 36) in June and SP has been held 31 times (out of 39) since 1989 in May.Acceptance rate.We analyzed the acceptance rate of the events involved in the study over the last 50 years.As shown in Figure 2, for each event, we computed the average of the acceptance rate of each event since beginning14 .Interestingly, we found that the average acceptance rate for all events, since the first edition, falls into the range 15% to 31% in the time window of 50 years.Overall, the largest acceptance rate is the one of PODC of 31%, while PERCOM has the smallest one of 15%.
H5-index.Figure 3 presents the h5-index of all event series along with their CORE 2018 ranking.The highest h5-index is the one of CVPR of 158, while PODC has the smallest one of Geographical distribution.We analyzed the geographical distribution of each event in the dataset.The key question is which countries hosted most of  the top events in the EVENTS dataset, and how frequently a country has hosted an event during the last five decades.Figure 4 shows how frequently different countries around the world have hosted a top event in the five CS communities considered in the study.We observed that USA leads by far, having hosted most editions of CVPR, ISCA, VLDB, ICSE, and CCS.Canada comes second, hosting most editions of ISCA, VLDB, and ICSE.
Table 2 shows the scientometric profile of all events in the EVENTS dataset in the five considered CS communities ordered by descending h5-index for each community.AI community has the largest average h5-index of 89.9;SEC comes second with 62. Surprisingly, despite the Qualis ranking of RecSys as B1, the h5index of RecSys is relatively high, and it is ranked as A by CORE and as A-by GII.Regarding publishers, we observed that ACM publishes most of the events proceedings, and IEEE comes next.However, we observed that some events such as NDSS and USENIX publish their proceedings on their own website.In terms of the number of editions, ISCA has the longest history with 45 editions since 1969, while RecSys is the newest one, with 12 editions since 2007.Although RecSys is a relatively new conference, it has a good reputation and it is highlyranked in CORE, GII, and Qualis.In this paper, we present a dataset (EVENTS) of metadata about conferences and symposia, containing historical data about 25 top prestigious events in five computer science communities.We presented our methodology of creating the dataset, starting from identifying prestigious events, data acquisition and preprocessing to finally publishing the dataset.To the best of our knowledge, this is the first time a dataset is published that contains metadata of top prestigious events in Information systems, Security and privacy, Artificial intelligence, Computer systems organization and Software and its engineering.This dataset is used to compare scientific events in the same community, which is useful for both events organizers and less-expertise researchers.In summary, we made the following observations: -During data acquisition, we observed that there is not much information about events prior to 1990, in particular on the number of submissions and accepted papers, organizers of the prestigious events try to keep the events held around the same month each year, -There is a clear upward trend in the number of submitted and accepted papers during the whole time span due to the digitization of scholarly communication.However, the digitization of scholarly communication also has negative impacts, most significantly the proliferation of submissions, which significantly increases the reviewing workload, -Among all countries, USA hosted about 76% of the events in the dataset in the last five decades.
To further our research, we are planning to systematically investigate review quality, to update EVENTS to meet future requirements by adding more events in each community and more attributes such as hosting university or organization, sponsors, and event steering committees or program committee chairs.Furthermore, we plan to perform more exploratory analysis by applying more metrics such as geographical distribution and publications by continents, event continuity, event progress rate and acceptance rate stability.

Figure 1 :
Figure 1: Variation of the number of submitted and accepted papers of the top event in each CS community.

Figure 2 :
Figure 2: Average acceptance rate of all events

Figure 3 :
Figure 3: H5-index of all events along with CORE 2018 ranking

Figure 4 :
Figure 4: Geographical distribution of the top event in each CS community since 1973.

Table 2 :
Scientometric profile of all events in EVENTS dataset in five CS communities.N is the number of editions in 2018