Please use this identifier to cite or link to this item: http://dx.doi.org/10.25673/31719
Title: An evaluation of deep hashing for high-dimensional similarity search on embedded data
Author(s): Pawar, Rutuja Shivraj
Referee(s): Saake, Gunter
Campero Durand, Gabriel
Granting Institution: Otto-von-Guericke-Universität Magdeburg, Fakultät für Informatik
Issue Date: 2019
Type: Master Thesis
Language: English
Publisher: Otto von Guericke University Library, Magdeburg, Germany
URN: urn:nbn:de:gbv:ma9:1-1981185920-318653
Subjects: Informationswissenschaft
Abstract: In today’s era, the rate at which data is accumulating is exponential, which makes it increas-ingly challenging to retrieve relevant information. In such a scenario, high-dimensional similarity search serves as a popular method to extract relevant information from large data volumes or Big Data, and it further drives different Machine Learning (ML) tasks including, Near Duplicate Detection & Location Recognition. However, Big Data, due to its charac-teristics, poses a variety of challenges to ML applications, such as high class imbalance, the need for feature engineering to support heterogeneous data and the need for efficient solutions for queries over array data. Consequently, in this thesis, we aim to optimize the data analytics pipeline for the utilization and effective management of feature engineering data (embedding data), offering as one of the solutions in the context of high-dimensional similarity search. In doing so, we evaluate the impact of similarity-preserving hashing on helping with data blocking and skipping for ML applications of supervised entity resolution and top-k similarity search. Precisely, we make the following contributions: First, we utilize and work with embedding data, as an approach to highlight semantic similarity in the data, thus making it more manageable. In doing so, we experiment with three dataset pairs from two different domains, Bibliographic and E-commerce, with their attributes embedded using a fastText pre-trained model. Further, based on its fast query speed and low memory costs, we consider similarity-preserving hashing as the technique to manage these embedding data and efficiently support high-dimensional similarity search. Specifically, we consider two hashing techniques, Locality Sensitive Hashing (LSH) being data-independent, and Learning To Hash (L2H) being data-dependent. Second, based on well-defined metrics, we experimentally evaluate the efficiency and classi-fication accuracy of LSH - Super-Bit, with a focus on the task of supervised entity resolution. Third, based on the same metrics, we experimentally evaluate and compare LSH - Super-Bit with L2H - Deep Hashing. In doing so, we utilize our designed Deep Hash Neural Net (DHNN), based on the literature. This designed network serves as our main contribution in offering a deep hashing neural network generalized to work with embedding data. In this evaluation, we are able to report a superior performance of L2H - Deep Hashing over LSH - Super-Bit, for the task of supporting supervised entity resolution. Finally, based on the outcome of the experimental evaluation, we further evaluate the runtime performance and speed-up brought by L2H - Deep Hashing to top-k similarity search queries in Apache Spark, using different file formats.
URI: https://opendata.uni-halle.de//handle/1981185920/31865
http://dx.doi.org/10.25673/31719
Open Access: Open access publication
License: (CC BY-SA 4.0) Creative Commons Attribution ShareAlike 4.0(CC BY-SA 4.0) Creative Commons Attribution ShareAlike 4.0
Appears in Collections:Fakultät für Informatik (OA)

Files in This Item:
File Description SizeFormat 
Pawar_Rutuja_Master thesis_2019.pdfMasterarbeit12.73 MBAdobe PDFThumbnail
View/Open