CS6740 - Searching and Indexing in Big Data Sets

Course Data :

Course objective:

Recent scientific and technological advances have caused an explosion in data availability; both in terms of volume and diversity. In the traditional world of data management, we used to design the database schema and the data was forced to comply with that schema. In the current world, we often cannot afford that luxury. Data is generated as part of a naturally evolving process such as in social networks, chemical compounds, sensor readings, etc., and to use datasets more than as a storage mechanism, we need to devise efficient and robust querying techniques. This course is a step towards that goal and will cover the recent state-of-the-art methods for indexing and searching datasets. A strong focus of this course will be on dealing with high-dimensional and noisy datasets, which is typical in various domains.

Course Syllabus:

  • Queries: range queries, top-k queries, reverse top-k queries, multi-attribute top-k queries, top-k diversity queries, skyline queries.
  • Distance measures: lp norm, normalized Euclidean distance, Mahalanobis distance, KL-divergence, earth movers distance.
  • Memory, disk and SSD access: the dynamics of data reads based on the underlying storage architecture and how that affects the index performance. Single-dimensional index structures: B+-tree. Memory-based index structures: kd-tree, quad trees, interval trees, trie, Voronoi diagrams
  • Disk-based structures: R-tree, R-tree variants R+-tree and R*-tree, X-tree, SS-tree, VA-files, M-tree Index structure Vs Hashing in high-dimensional spaces
  • Hashing: extensible hashing, linear hashing, bloom filters, locality sensitive hashing.
  • Indexing and Searching non-traditional queries: multi-attribute top-k queries (Fagins algorithm, threshold algorithm, Onion), indexing skyline queries, indexing diversity queries
  • Dimensionality reduction: SVD, PCA, Fastmap, Lipschitz embedding Index structures and distance functions for Non-vector datasets: text Corpus, time-series datasets, graph datasets

Textbook:

  • "Foundations of Multidimensional and Metric Data Structures" by Hanan Samet, Morgan Kaufmann Publishers, 2005.

References:

  • Relevant papers from database conferences and journals
  • Computational Geometry by de Berg, Cheong, van Krefeld, Overmars. Springer.
  • Introduction to Algorithms by Cormen, Leiserson, Rivest, Stein. Prentice Hall.

Pre-Requisites

    None

Parameters

Credits Type Date of Introduction
3-1-0-4 Elective Jul 2014

Previous Instances of the Course


© 2016 - All Rights Reserved - Dept of CSE, IIT Madras
Website Credits