Staff View: Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques

Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques

Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes...

Full description

Saved in:

Bibliographic Details
Main Authors:	Miah, M. Saef Ullah, Junaida, Sulaiman, Sarwar, Talha, Naseer, Ateeqa, Ashraf, Fasiha, Kamal Zuhairi, Zamli, Jose, Rajan
Format:	Article
Language:	English
Published:	MDPI 2022
Subjects:	QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) TK Electrical engineering. Electronics Nuclear engineering
Online Access:	http://umpir.ump.edu.my/id/eprint/33377/1/Sentence%20boundary%20extraction%20from%20scientific%20literature%20of%20electric%20double%20layer.pdf http://umpir.ump.edu.my/id/eprint/33377/ https://doi.org/10.3390/app12031352 https://doi.org/10.3390/app12031352
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.ump.umpir.33377
record_format	eprints
spelling	my.ump.umpir.333772022-09-06T07:46:24Z http://umpir.ump.edu.my/id/eprint/33377/ Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques Miah, M. Saef Ullah Junaida, Sulaiman Sarwar, Talha Naseer, Ateeqa Ashraf, Fasiha Kamal Zuhairi, Zamli Jose, Rajan QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) TK Electrical engineering. Electronics Nuclear engineering Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes. MDPI 2022-02-01 Article PeerReviewed pdf en cc_by_4 http://umpir.ump.edu.my/id/eprint/33377/1/Sentence%20boundary%20extraction%20from%20scientific%20literature%20of%20electric%20double%20layer.pdf Miah, M. Saef Ullah and Junaida, Sulaiman and Sarwar, Talha and Naseer, Ateeqa and Ashraf, Fasiha and Kamal Zuhairi, Zamli and Jose, Rajan (2022) Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques. Applied Sciences, 12 (3). pp. 1-19. ISSN 2076-3417 https://doi.org/10.3390/app12031352 https://doi.org/10.3390/app12031352
institution	Universiti Malaysia Pahang
building	UMP Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaysia Pahang
content_source	UMP Institutional Repository
url_provider	http://umpir.ump.edu.my/
language	English
topic	QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) TK Electrical engineering. Electronics Nuclear engineering
spellingShingle	QA76 Computer software T Technology (General) TA Engineering (General). Civil engineering (General) TK Electrical engineering. Electronics Nuclear engineering Miah, M. Saef Ullah Junaida, Sulaiman Sarwar, Talha Naseer, Ateeqa Ashraf, Fasiha Kamal Zuhairi, Zamli Jose, Rajan Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques
description	Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes.
format	Article
author	Miah, M. Saef Ullah Junaida, Sulaiman Sarwar, Talha Naseer, Ateeqa Ashraf, Fasiha Kamal Zuhairi, Zamli Jose, Rajan
author_facet	Miah, M. Saef Ullah Junaida, Sulaiman Sarwar, Talha Naseer, Ateeqa Ashraf, Fasiha Kamal Zuhairi, Zamli Jose, Rajan
author_sort	Miah, M. Saef Ullah
title	Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques
title_short	Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques
title_full	Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques
title_fullStr	Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques
title_full_unstemmed	Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques
title_sort	sentence boundary extraction from scientific literature of electric double layer capacitor domain: tools and techniques
publisher	MDPI
publishDate	2022
url	http://umpir.ump.edu.my/id/eprint/33377/1/Sentence%20boundary%20extraction%20from%20scientific%20literature%20of%20electric%20double%20layer.pdf http://umpir.ump.edu.my/id/eprint/33377/ https://doi.org/10.3390/app12031352 https://doi.org/10.3390/app12031352
_version_	1744353870301102080
score	13.211869

Sentence boundary extraction from scientific literature of electric double layer capacitor domain: Tools and techniques

Similar Items