Novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster

Ali, Ahmed Hussein; Abdullah, Mahmood Zaki

doi:https://doi.org/10.7494/csci.2019.20.4.3373

Article

Novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster

creativeworkseries.issn	1508-2806
dc.contributor.author	Ali, Ahmed Hussein
dc.contributor.author	Abdullah, Mahmood Zaki
dc.date.available	2025-06-17T10:43:23Z
dc.date.issued	2019
dc.description	Bibliogr. s. 426-429.
dc.description.abstract	The big data concept has elicited studies on how to accurately and efficiently extract valuable information from such huge dataset. The major problem during big data mining is data dimensionality due to a large number of dimensions in such datasets. This major consequence of high data dimensionality is that it affects the accuracy of machine learning (ML) classifiers, it also results in time wastage due to the presence of several redundant features in the dataset. This problem can be possibly solved using a fast feature reduction method. Hence, this study presents a fast HP-PL which is a new hybrid parallel feature reduction framework that utilizes spark to facilitate feature reduction on shared/distributed-memory clusters. The evaluation of the proposed HP-PL on KDD99 dataset showed the algorithm to be significantly faster than the conventional feature reduction techniques. The proposed technique required >1 minute to select 4 dataset features from over 79 features and 3,000,000 samples on a 3-node cluster (total of 21 cores). For the comparative algorithm, more than 2 hours was required to achieve the same feat. In the proposed system, Hadoop’s distributed file system (HDFS) was used to achieve distributed storage while Apache Spark was used as the computing engine. The model development was based on a parallel model with full consideration of the high performance and throughput of distributed computing. Conclusively, the proposed HP-PL method can achieve good accuracy with less memory and time compared to the conventional methods of feature reduction. This tool can be publicly accessed at https://github.com/ahmed/Fast-HP-PL.	en
dc.description.placeOfPublication	Kraków
dc.description.version	wersja wydawnicza
dc.identifier.doi	https://doi.org/10.7494/csci.2019.20.4.3373
dc.identifier.eissn	2300-7036
dc.identifier.issn	1508-2806
dc.identifier.uri	https://repo.agh.edu.pl/handle/AGH/113238
dc.language.iso	eng
dc.publisher	Wydawnictwa AGH
dc.relation.ispartof	Computer Science
dc.rights	Attribution 4.0 International
dc.rights.access	otwarty dostęp
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/legalcode
dc.subject	big data	en
dc.subject	dimensionality reduction	en
dc.subject	parallel processing	en
dc.subject	Spark	en
dc.subject	PCA	en
dc.subject	LDA	en
dc.title	Novel approach for big data classification based on hybrid parallel dimensionality reduction using spark cluster	en
dc.title.related	Computer Science	en
dc.type	artykuł
dspace.entity.type	Publication
publicationissue.issueNumber	No. 4
publicationissue.pagination	pp. 411-429
publicationvolume.volumeNumber	Vol. 20
relation.isJournalIssueOfPublication	fd4c83ac-93cc-4ab1-9b18-c4b33dfba232
relation.isJournalIssueOfPublication.latestForDiscovery	fd4c83ac-93cc-4ab1-9b18-c4b33dfba232
relation.isJournalOfPublication	020291ee-249b-4dcf-98a3-276a2f7981aa

Files

Original bundle

Now showing 1 - 1 of 1

Name:: csci.2019.20.4.411.pdf
Size:: 872.37 KB
Format:: Adobe Portable Document Format

Download

Collections

Artykuły (CN-csci)