CS 550: Massive Data Mining and Learning (Spring 2021)


Course Information


Instructor: Hao Wang
Email: hogue.wang@rutgers.edu
Office: CBIM 008

Time: By Arrangement

This course is asynchronous remote. Lectures will be posted on Canvas and/or the course website. There will be office hours via Zoom.

Location: Canvas and/or Zoom

Office Hours: First and Third (and Fifth) Thursdays of Each Month: 3:00-4:00pm (Eastern Time)
Secord and Fourth Thursdays of Each Month: 10:00-11:00pm (Eastern Time)
or by appointment
Office Hour Link: Please use this Zoom link.

TA: Juntao Tan, Naveen Narayanan Meyyappan
Email: jt867 AT cs.rutgers.edu, nm941 AT scarletmail.rutgers.edu
TA Office: Virtual
TA Office Hours: First and Third (and Fifth) Thursdays of Each Month: 10:00-11:00pm (Eastern Time)
Secord and Fourth Thursdays of Each Month: 3:00-4:00pm (Eastern Time)
or by appointment
TA Office Hour Link: Please use this Zoom link.

Textbook: (LRU) Mining Massive Data Sets by J. Leskovec, A. Rajaraman, J. D. Ullman



Announcements

  • Lecture 1 and HW0 are released on Canvas.


  • Course Descriptions

    This course introduces computing infrastructurs, algorithms, theories, and practice of massive data analytics and machine learning, as well as their application in frequently used scenarios, including recommender systems, web search engine, social networks, computational advertising, e-commerce, etc. Students will learn algorithms to store, process, mine, analyze, and synthesize streaming data, or data at rest that does not fit in random access memory. Advanced deep learning techniques, such as Bayesian deep learning and adversarial domain adaptation, for various data mining and healthcare applications will also be covered. The material covered here equips students with the main backend algorithms and infrastructure for the Capstone Project and research tasks closely related with data science and analytics.



    Prerequisites

  • CS 512 or CS 513 (Fundamental Algorithms)
  • Linear Algebra, Basic Probability (Moments, Typical Distributions, MLE)
  • Programming Languages: C++/Java
  • Infrastructure: Hadoop Cluster


  • Expected Work

  • Homework: 4 homework assignments (12.5% each, 50% total)
  • Midterm: Friday, Mar 12, 24-hour take-home exam (15%)
  • Final Project: Complete as a team of at most 4 students, choose one topic from a list of provided project tasks, provide a 10 min presentation, and submit the code, data, and a report (35%)

  • To accommondate students in various time zones and COVID-19 situation, the tentative plan is that the midterm will be a take-home 24-hour exam. Students are expected to finish the exam all by themselves and respect the university honor code.



    Tentative Schedule

    Note that the following schedule for posting the lectures on Canvas may be subject to change. Please check the course website frequently for the latest schedule.

    Week
    Date
    Topics and Assignments
    1

    1/21

      Introduction (Reading: Ch 1, Ch 2.1-2.4)
    2

    1/28

      Frequent Item Sets Mining (Reading: Ch 6)
      Association Rule Mining (Reading: Ch 6)
    3

    2/4

      Locally Sensitive Hashing (Reading: Ch 3)
    4

    2/11

      Clustering, similarity, k-means, BFR (Reading: Ch 7.1-7.4)
    5

    2/18

      Dimensionality Reduction, SVD, CUR (Reading: Ch 11)
    6

    2/25

      Content-based Recommendation (Reading: Ch 9.1-9.2)
      Collaborative Filtering, Latent Factor Models (Reading: Ch 9.3-9.4, PMF, CDL)
    7

    3/4

      Learning to Rank and Deep Learning for RS, Project Description
      Link Analysis, Page Rank (Reading: Ch 5.1-5.3, 5.5)
    8

    3/11

      Web Spam, Trust Rank (Reading: Ch 5.4)
      Mid-term exam (take-home, 24 hours)
    9

    3/18

      No class, Spring recess
    10

    3/25

      Social Networks, Community Detection (Reading: Ch 10.1-10.2, 10.6)
      Spectral Clustering, Trawling (Reading: Ch 10.1-10.2, 10.6)
    11

    4/1

      Overlapping Communities (Reading: Ch 10.3-10.5, 10.7-10.8)
      Large-scale Machine Learning (Reading: Ch 12)
    12

    4/8

      Mining Data Streams (Reading: Ch 4)
    13

    4/15

      Computational Advertising (Reading: Ch 8)
      Learning through Experimentations with Bandit-based Learning
    14

    4/22

      Bayesian Deep Learning (BDL, Reading: BDL Survey)
      Domain Adaptation (DA, Reading: DANN, DA Theory, Continuous DA)
    15

    4/29

      Project Presentations and Summary of the Class