Leaderboard | SCROLLS Benchmark

Date	Model	Contributors	#Params	Input Length	Score (Average)	GovRep (R1/R2/RL)	SumScr (R1/R2/RL)	QMSum (R1/R2/RL)	Qspr (F1)	Nrtv (F1)	QALT (EM-T/H)	CNLI (EM)
02/28/2023	CoLT5 XL	Google Research	5.3B	16K	43.51	61.3/32.2/33.8	36.4/10.1/21.7	36.2/12.9/24.2	53.9	31.1	48.1/43.8	88.4
03/07/2023	LongT5 XL	LongT5	3B	16K	42.53	61.1/32.3/33.7	35.8/9.6/21.1	34.9/11.8/23.5	53.1	29.3	46.0/42.1	88.2
02/28/2023	CoLT5 Large	Google Research	1.46B	16K	41.04	60.7/31.3/32.9	36.7/10.6/22.0	34.9/11.5/23.1	49.8	27.7	39.9/36.8	88.7
03/07/2023	LongT5 Large	LongT5	770M	16K	41.03	60.3/31.1/32.8	35.6/9.2/21.2	35.1/12.0/23.3	52.3	27.2	40.6/38.6	87.3
08/23/2022	BART-LS	Meta AI	460M	16K	39.76	59.4/29.8/30.8	37.7/10.2/21.5	35.1/11.0/22.0	48.7	26.2	37.8/34.0	87.1
03/07/2023	LongT5 Base	LongT5	220M	16K	38.6	57.7/30.0/31.4	34.8/9.6/21.1	33.9/11.0/22.8	46.6	23.0	37.9/36.6	85.6
08/27/2022	BART-large SLED	Ivgi et al.,	406M	16K	37.99	57.5/26.3/27.4	35.2/8.7/19.4	34.2/11.0/22.0	46.9	24.1	34.8/34.8	87.3
03/14/2022	UL2	Google Research	20B	2K	37.87	53.6/26.1/28.8	32.9/7.8/19.4	31.1/8.5/20.4	37.6	24.2	45.8/40.7	88.7
02/28/2023	CoLT5 Base	Google Research	433M	16K	37.64	58.7/29.6/31.4	34.5/9.2/20.6	32.0/9.3/21.0	42.1	23.3	36.5/34.0	86.5
01/01/2022	LED Base	SCROLLS team	162M	16K	29.16	56.2/26.6/28.8	24.2/4.5/15.4	25.1/6.7/18.8	26.6	18.5	25.8/25.4	71.5
01/01/2022	BART Base	SCROLLS team	139M	1K	29.01	47.9/18.6/22.7	27.2/4.9/16.7	30.2/8.7/20.7	26.3	15.4	26.0/25.9	77.4
01/07/2022	Naive	SCROLLS team	-	-	19.35	45.3/17.9/20.8	19.6/1.8/11.0	14.2/2.0/9.3	3.4	1.5	25.2/26.1	66.0

Click here for a downloadable version of the leaderboard.

* LongT5 rows are based on revised submissions that use max-output-length of 1024 tokens for GovReport generations.