Final Project: Big Data Systems

Project Overview

Given that the database community has been working on (big) data management for decades, it is natural to ask the question that what's novel about modern Big Data systems. This is the key question we aim to answer in this graduate seminar. In this final project, you can show off what you have learnt by doing one of the projects: (1) Contribute to an open-source data system; (2) Write down your personal view of big data systems.

Please choose ONE project type, and follow the following steps to do the project:

  1. Form a team of 1-3 persons
  2. Pick up a project type and submit an initial plan
  3. Present your project in the poster session
  4. Submit your paper

The table below summarizes the deadline for each phase:

ID
What
When
Where
1 Form a Team Thursday 02/28 at 11:59 PM Create your team in CourSys
2 Initial Plan Tuesday 03/12 at 11:59 PM Submit the filled form to the CourSys activity Initial Plan
3 Poster Session Wednesday 4/10 at 10:00 AM
Wednesday 4/10 at 2:00 PM
Submit your poster to the CourSys activity Poster
Present your poster at T9204 W&E
4 Report Sunday 4/14 at 11:59 PM Submit the report to the CourSys activity Report

Types of Projects

Please choose one project type, and follow the corresponding instruction to do the project.

1. Write a paper about your personal view of big data systems

From a functionality point of view, big data systems are quite similar to traditional RDBMS because both of them are designed to better store and process data. But, why do people get more excited about big data systems than traditional RDBMS right now? After this graduate semester, I believe everyone can form your person view about this question.

If your final project is to write a paper about your personal view of big data systems, the paper should consist of two parts:

  • In the first part (at least 5 pages), you need to list all the aspects in which you think big data systems are different from traditional RDBMS. For each aspect, you need to come up with a list of questions and answer each of them in detail. For example, if you think "flexibility" is one aspect, you need to explain why you think big data systems are more flexible, what are the key ideas/techniques that make them more flexible, why you think being flexible is very important, what they sacrifice for being flexible, and how hard it is to make traditional RDBMS as flexible as big data systems?


  • In the second part (at least 7 pages), you need to pick up one specific topic (e.g., Query Optimization, Transaction Management, In-memory Database, Large-Scale Dataflow Engines), and do a survey on this topic. Here are a list of steps you need to do:

    1. Collecting Related Papers. Please follow the instruction in this paper (Sec 3) to find related papers on the topic.
    2. Reading Papers. You should read at least 20-40 papers in order to write a good survey paper. After reading the papers, please make sure that you are completely aware of the main research themes, trends, challenges/issues, and results of the topic.
    3. Designing a Classification criterion. Determine the classification of the existing literature, and draw a diagram for that. You can get an idea about what the diagram looks like from these examples (Table 1, Fig 2, Fig 1).
    4. Identifying Pros and Cons. Map the collected papers to their corresponding classes. Compare different classes, and identify the pros/cons of each class.
    5. Predicting Future Trends. Think about what problems on this topic should be paid more attention in the future, and why they are challenging?

Here is a suggested outline of the paper. You don't have to use it, but the paper has to cover the materials mentioned above.

  • Abstract
  • Introduction
  • A High-level Comparison: RDBMS vs. Big Data Systems
    • RDBMS Overview
    • Big Data Systems Overview
    • RDBMS vs. Big Data Systems
      • Aspect 1
      • Aspect 2
      • ...
  • Deep Dive into Query Optimization (replace with the specific topic you choose)
    • Background and Problem Statement
    • Classification Criterion
    • Presentation of Class 1, Class 2, ...
    • Comparisons of Classes (pron/cons)
    • Emerging Trends
  • Conclusion

Submission

  • Initial Plan: Download the Initial Plan form template, and submit the filled form to the CourSys activity Initial Plan.
  • Poster Session: Make a poster to present your paper. The poster session is scheduled at 02:00 PM on Wednesday 4/10. Please upload your poster to the CourSys activity Poster before 10:00 AM.
  • Report: Write the report using the templates provided at https://www.acm.org/publications/proceedings-template for Word and LaTeX (version 2e). That is, the format of your paper should be like this. The paper has to be at least 5+7 pages (excluding references).

2. Contribute to an open source data system

This is the best time for being a data system programmer. Almost all the mainstream big data systems are open sourced. As a data system programmer, you can not only learn how the systems work by directly reading their source code, but also make a contribution to the systems (e.g., adding a new feature or fixing a bug). Being a contributor to an open-source project can be highly rewarding.

If your final project is to contribute to an open-source data system, here are a list of steps you need to do.

  1. Which system to choose. Any project on this page or this page will work. If your project is not on the pages, you can still choose it if it meets two requirements: a) it is a project with 100+ stars at Github; b) it belongs to one of the categories listed on the outline (e.g., Distributed Programming, Key-Map Data Model).
  2. How to contribute. A good open-source project will have a detailed documentation to explain how to make a contribution (e.g. Contributing to Spark). Try to find the corresponding doc for the project you want to contribute to. In fact, it is similar to uploading your slides to our course website.
  3. What to contribute. Most open-source projects use JIRA to track issues. For example, you can find the issues of Spark from here. An issue is either a bug or an improvement. Please feel free to choose any issue from your project's JIRA and work on it.
  4. What is a successful contribution. Due to time constraint, I do not expect you to get your contribution (i.e., pull request) accepted by the open-source community. You only need to demonstrate your contribution during our poster session. For example, if you fix a bug, you can first show how the original system behaves abnormally, and then present how your contributed code resolves the abnormal behavior.
  5. Paper Writing. Here is a suggested outline of the paper.
    • Introduction: A brief introduction of the project you contribute to
    • Issue Description: A short description of the issue you resolve
    • Your Contribution: This is the most important part. Explain clearly how you resolve the issue
    • What you have learnt: Anything you've learnt from this experience

Submission

  • Initial Plan: Download the Initial Plan form template, and submit the filled form to the CourSys activity Initial Plan.
  • Poster Session: You are not only required to make a poster, but also demonstrate the system during the poster session. The poster session is scheduled at 02:00 PM on Wednesday 4/10. Please upload your poster to the CourSys activity Poster before 10:00 AM.
  • Report: Write the paper using the templates provided at https://www.acm.org/publications/proceedings-template for Word and LaTeX (version 2e). That is, the format of your paper should be like this. The paper has to be at least 2 pages (excluding references).