CS 4240

Northeastern University
College of Computer and Information Science
Spring 2018
When: Thursdays 6-9pm
Where: Snell Library 111
Instructor: Heather Miller
Office: WVH 328
Office Hours: Thursdays 1-3pm
Contact us via Piazza

TA: Rutul Patel
Office: WVH 362
Office Hours: Tuesdays 2pm-4pm
Contact us via Piazza
Quick links:
Schedule
Assignments
Resources

This course covers techniques for analyzing very large data sets, with an emphasis on approaches that scale out effectively as more compute nodes are added. Introduces principles of distributed data management and strategies for problem-driven data partitioning through a selection of design patterns from various application domains, including graph analysis, databases, text processing, and data mining.

In this edition of the course, we will be focusing predominantly on Apache Spark and Apache Kafka.

Organization

This course is part traditional lecture, part studio course. That is, the first half of most course sessions will be traditional lectures. The second half will vary from live coding, to hands-on exercises (based on the lecture material), to student codewalks.

Communication

Communication with the instructor and teaching assistants is exclusively through Piazza. If you wish to contact the instructor privately, send a private note.

Grades will be managed and assignments will be collected through the course’s Blackboard page.

There is no required textbook for this course. Given that this is a quickly evolving area, there are several specialized developer books that you will find useful for reference and self-study.

Some of these books are available online (and for free) for Northeastern University students from Safari Books Online.

Policies

  • There are no deadline extensions or make-up assignments/exams unless you have a major emergency with appropriate documentation. The following scenarios aren’t considered emergencies: I have an interview scheduled, my other course has a big homework or project deadline.
  • Please note that you are not allowed to share homework solutions with others, or copy anybody else’s homework entirely or in parts. We will check for originality during the grading process. Violations will be reported both to OSCCR and to the college, and all involved parties will receive an F for the course.

Schedule

Week Date Topic
1 Jan 11 No class (POPL)
2 Jan 18 Intro, Data Parallelism, and Scala
3 Jan 25 Intro to Spark
4 Feb 1 Key-Value Pairs and Joins
5 Feb 8 Shuffling, Partitioning
6 Feb 15 DataFrames & Datasets
7 Feb 22 Midterm
8 Mar 1 Other Big Data Tools, Introduction to Streaming
9 Mar 8 No class (Spring Break)
10 Mar 15 Spark Streaming
11 Mar 22 Stateful & Structured Streaming
12 Mar 29 Apache Kafka
13 Apr 5 TensorFlow & Hadoop
14 Apr 12 Final project presentations

Grading

  • In-class quizzes: 10%
  • Assignments: 20%
  • Midterm: 30%
  • Final project proposal: 5%
  • Final project: 35%

Quizzes

Throughout the semester, there will be short quizzes at the end of most lectures. These quizzes will be very brief, and are not meant to be challenging or cause students grief. They will contain very straightforward questions right out of the current lecture. The purpose of these quizzes is merely to encourage students to follow along with the lecture material.

The lowest quiz score will be dropped; this should make up for a bad day or unexcused absence from class.

Thus, as a consequence, attendance is expected.

Assignments

There will be a handful of projects distributed approximately every two weeks. Projects are due by noon on Thursdays.

Assignments can be submitted on the course’s Blackboard page

Assignment Distributed Due
Assignment 1: Anagrams January 19, noon January 25, noon
Assignment 2: Wikipedia January 25, noon February 1, noon
Assignment 3: StackOverflow February 1, noon February 15, noon
Assignment 4: Time Usage February 8, noon February 22, noon

Final Project

Students join up in pairs, and will propose a significant data processing application as a final project. A one page project proposal will be due midway through the semester describing the project plan.

More details about the final project.

Timeline for Final Project

March 1st Names of partners for final project due in class.

March 15-19th Brainstorming meeting about final project topic with Heather. Claim an appointment slot here.

March 22nd Project proposal due.

April 12th Project demos/presentations in class.

April 22nd Project reports in Blackboard before midnight.

Special Accommodations

If the Disability Resource Center has formally approved you for an academic accommodation in this class, please present the instructor with your “Professor Notification Letter” during the class session, so that we can address your specific needs as early as possible.

Resources

Slides and other materials will be posted here.

Jan 18: Intro, Data Parallelism, and Scala

Slides:

Intro to Scala

Jan 25: Intro to Spark

Slides:

Feb 1: Key-Value Pairs and Joins

Slides:

Feb 8: Shuffling, Partitioning

Slides:

Feb 15: SQL, Dataframes, and Datasets

Slides

Mar 1: Other Big Data Tools, Intro to Stream Processing

Slides

Mar 15: Spark Streaming

Slides

Mar 15: Stateful & Structured Streaming

Slides

Mar 29: Apache Kafka

Slides

Apr 5: TensorFlow & Hadoop

Slides