CMPT 733: Big Data Programming II (SFU, Spring 2018)

Objectives

From CMPT 726 and CMPT 732, students have learned machine learning algorithms and big data programming tools. However, when facing a real-world data problem, the students will find that there is still a gap between what they have learned in class and what they are going to do in practice. The goal of this course is to fill this gap, making the students be able to apply what they have learned to solve real-world problems. To achieve this goal, our course will cover a set of important topics that a data scientist should know, and teach students about the state-of-the-art approaches. After taking this course, students should feel confident when being asked to extract value from real-world data sets, and know how to ask interesting questions about data, how to choose proper tools, how to design data-processing pipelines, and how to present final data products.

Topics

Logistics

Grading

Final Project

Schedule

Week Date Event Type Description Course Materials
Week 1 Monday
Jan 8
Lecture 1 Course Introduction
What/Why Data Science?
Data Science Lifecycle
Questions that data scientists can answer
Course Logistics
[slides]
Monday
Jan 15
A1 Due Assignment #1 Due
Web Scraping
[Assignment #1]
Week 2 Monday
Jan 15
Lecture 2 Data Preparation
Data Collection
Data Cleaning
Data Integration
[slides]
Monday
Jan 22
A2 Due Assignment #2 Due
Entity Resolution
[Assignment #2]
Week 3 Monday
Jan 22
Lecture 3 Visualization (I)
Why Visualization?
Principles of Visualization Design
Visualization Toolkits
[slides]
Monday
Jan 29
A3 Due Assignment #3 Due
Visualization
[Assignment #3]
Week 4 Monday
Jan 29
Lecture 4 Statistics (I)
Statistical Thinking
Exploratory Data Analysis
Bootstrapping
[slides]
Monday
Feb 5
A4 Due Assignment #4 Due
EDA and Bootstrap
[Assignment #4]
Week 5 Monday
Feb 5
Lecture 5 Deep Learning (I)
Renaissance of neural networks
Background
Construction and training of layered learners
Frameworks for deep learning
[slides]
Monday
Feb 19
A5 Due Assignment #5 Due
Deep learning CNNs using PyTorch
[Assignment #5]
Week 6
Reading Break
Week 7 Monday
Feb 19
Lecture 6 Practical Machine Learning
Feature Selection
Crowdsourcing
Active Learning
Spark MLlib and ML Pipeline
[slides]
Monday
Feb 26
A6 Due Assignment #6 Due
Practical Machine Learning
[Assignment #6]
Week 8 Monday
Feb 26
Lecture 7 Anomaly Detection
What is anomaly detection?
Clustering and K-Means
Feature Scaling
Introduction to AWS
What is cloud computing?
SaaS/PaaS/IaaS
AWS and its success
[slides]



[slides]
Monday
Mar 5
A7 Due Assignment #7 Due
Anomaly Detection and AWS
[Assignment #7]
Week 9 Monday
Mar 5
In-Class Presentation Course Project Milestone Presentation [Slides]
Week 10 Monday
Mar 12
Lecture 8 Statistics (II)
Correlation Analysis
Hypothesis Testing
A/B Testing
[slides]
Guest Speaker: Reynold Xin Big Trends in Big Data [slides]
Monday
Mar 19
A8 Due Assignment #8 Due
Correlation Analysis and Hypothesis Testing
[Assignment #8]
Week 11 Monday
Mar 19
Lecture 9 Further Topics in Deep Learning:
Sequence learning, Sentiment, DL Visualization
[Slides]
Wednesday
Mar 28
A9 Due Assignment #9 Due
Word embeddings, Recurrent nets for NLP & Visualization to tune DL algorithms
[Assignment #9]

Project Showcase


Prioritizing Aid from Above [Code, Report, Poster]
Jillian Anderson, Brian Gerspacher, Brie Hoffman

Our project aims to use computer vision and machine learning to automatically assess the damage caused by cyclones in the South Pacific. By training a convolutional neural network to detect and count different kinds of trees present in aerial images, we seek to improve the ability of aid organizations to respond efficiently in the immediate aftermath of a natural disaster. Training data was provided as part of the challenge and we used it to train an object detection system using the Darknet framework and the YOLOv2 CNN architecture. We trained and tuned our models using the GPUs on the computers in ASB10928 throughout March and early April 2018, evaluating our results using the metric of mean average precision (mAP). In the end, our best results achieved a mAP of 0.52 using a trained from scratch model. This model was integrated into a user-facing web application. These results were submitted to Patrick Meier on April 16 as part of WeRobotic’s Open AI Challenge.


Vancouver Housing Market Decoder [Code, Report, Poster]
Yuyi Zhou, Junbo Bao, and Yabin Guo

Vancouver Housing Market Decoder (VHMD), a tool that embedded with high-quality machine learning models can predict estimated listing prices for sellers, predict estimated purchasing price for buyers and show everyone future market trends. By using VHMD, users can easily understand the Vancouver housing market and make informed decisions.


Socio-Political Analysis in Regions of World (SPAROW) [Code, Report, Poster]
Anindita Saha, Arul Bharathi, Namita Shah

The project SPAROW (Socio-Political Analysis in Regions of World) gives a quantitative analysis on how socio-political conditions impacts the HDI (Human Development Index) of a country especially when it faces conflicts. The project also includes the news media impact on social political events and can also help NGOs and resource planners to predict the HDI of a country. The datasets used for this project were collected from various sources- V-DEM, ACLED and UNDP and were integrated to create a master dataset. News articles were collected from New York Times API. The integrated data was used to perform Exploratory Data Analysis (EDA) to gather interesting and relevant insights. The predictive modelling went hand in hand with the EDA and it could be seen that the features that were significant in predicting the HDI actually gave some really interesting insights in EDA. Finally, a web application was developed which consolidates all the analysis and can be useful in getting country specific insights as well as insights about the whole world in relation to the socio-political events that occurred between 2005 and 2015.


Visualizing and Forecasting the Cryptocurrency Ecosystem [Code, Report, Poster]
Shawn Anderson, Ka Hang Jacky Lok, Vijayavimohitha Sridha

In this work we collect cryptocurrency data from coinmarketcap.com, GitHub, Twitter, and wikipedia. We use the data to produce visualizations of the cryptocurrency ecosystem and to perform deep learning price forecasting.


Detecting Parkinson's Disease from Typing Behaviour [Code, Report, Poster]
Kyle Imrie, Jessica Moloney

The project prepared covers the topic of predicting Parkinson’s Disease using data collected from everyday typing activities. Multiple individual models were trained using bagging, in order to reduce variability, and tuned with cross validation using the Scikit-learn library. These results were fed into an ensemble model that aggregated predictions. An F-score of 0.83 and a recall of 1.00 were achieved on our best model.


Book Recommendation and Intelligence Engine (B.R.I.E.) [Code, Report, Poster]
Sethuraman Annnamalai, Lakshayy Dua, Supreet Kaur Takkar

A large portion of the reading community depends on either of word of mouth, bestseller lists or e-commerce websites to find the next book that they wish to read. However, these can be biased and unsatisfactory as they do not take a reader’s personal genre types into consideration or they are just based on finding similar books. There isn’t any dedicated data science product available today that caters to the needs of everyone involved in the publishing industry. Book Recommendation and Intelligence Engine (B.R.I.E.) has been created as a full-fledged interactive application to address all these needs.


Topic Modeling and Sentiment Analysis on Canadian News Articles and Comments [Code, Report, Poster]
Chithra Bhat, Ruoting Liang, Tianpei Shen

In summary, our project focused on topic modeling and sentiment analysis on Canadian news articles and comments. We used Latent Dirichlet Allocation and Non-negative Matrix Factorization to build topic models on new articles to observe the ‘Topic Trends’ in Canada over last 5 years and discover the surrounding topics of comments under a given article. Additionally, we cleaned up the comment environment by removing nonconstructive and toxic comments. We also did positive/negative sentiment analysis of comments to get an overview of the public opinions. Finally, we built the web application with interactive graphs to visualization all the learning results.


Topic, Entity and Sentiment Discerning System [Code, Report, Poster]
Andy Chen, Pushkar Sinha, Maria Babaeva

Text analysis with NLP concepts (NER, Sentiment analysis) and topic modeling with visualizations of the above approaches and creating a base with intermediate data and code for further much complex visualizations.


Detecting Misstatements in Financial Statements [Code, Report, Poster]
Vincent Chiu, Vishal Shukla, Kanika Sanduja

We created a program using the random forest model that is able to detect misstatements with 82.7% accuracy. Other metrics include 83.6% misstatement precision and 81.7% misstatement recall. This model can be beneficial to financial institutions for three main reasons: First, it is easy to use for personnel without programming backgrounds, such as auditors or investors, making the model highly accessible. Second, it utilizes a wide range of data sources (three diverse datasets) which provides a balanced view of the financial status of an organization, making the model robust. Third, we developed the model on Spark, which can scale up to even larger datasets having many features and records, making the model scalable.


Hawk: Object Detection in Aerial Imagery [Code, Report, Poster]
Mayank Vachher, Anna Mkrtchyan

Disasters in the south pacific are an unfortunate reality, and their consequences can be devastating for the local population​. WeRobotics, together with the OpenAerialMap and the World Bank, attempts to signicantly accelerate the analysis of aerial imagery before and after major humanitarian disasters. Their "Open AI Challenge: Aerial Imagery of South Panic Islands" has a goal to develop machine learning classifiers for this task. We propose a classifier based on pre-trained Faster-RCNN deep neural network, the state-of-the-art neural network for the object detection. Our data science pipeline converts dataset provided in the challenge, a single high resolution aerial image that covers roughly 50 square-kilometre area along with the geometric locations and classes of the objects of interests, to suitable training dataset for the proposed classifier. After training, our classifier is able to detect coconut trees with > 91% precision and > 97% recall.


Fall Detection using Wearable Sensor Data [Code, Report, Poster]
Gustavo Felhberg, Jorge Marcano, Muhammad R Myhaimin

Project with the objective of detecting falls based on data obtained from sensors on the waist,thighs,ankles, sternum and head of subjects with the main objectives being: (1) Doing data analysis and visualization in order to find interesting insight from the data. (2) Creating Machine Learning Models to detect falls. (3) Using these models to detect falls in real time.


BOOMERANG: Greater Vancouver House Price Analysis [Code, Report, Poster]
Hyelim Moon, Joanne Yoon

Boomerang is an all-in-one Greater Vancouver property value assessment program that swifts through past, present and future to deliver the answers to your fingertips like a boomerang. By web scraping, we have collected realistic, up-to-data data. By referring to municipal open data, we obtained historical house values. We then merged data from multiple sources, and used machine learning, statistics, and analytics skills to assess the value of each house and area. Since Surrey and Vancouver has many schools, we compared their houses' relationship with nearby schools and statistically analyzed correlation of these features and property prices. We displayed our findings on the web using Google Cloud Services. It includes a prediction tool to estimate a property's future price and analysis at postal area and feature level.


RightFluencer [Code, Report, Poster]
Arin Ghosh, Karanjit Singh Tiwana, Manoj Karthick Selva Kumar

RightFluencer is a web application and dashboard that allows you to find the right social media influencer for your product and category by analyzing their posts, images and videos. The dashboard analyzes the social media profiles of many influencers and finds the best influencer for your product based on not just their metrics but based on their niche/expertise. RightFluencer provides a search engine where a brand marketer can enter a product and category to find right influencer for that product by analyzing their Instagram, Facebook, Twitter and YouTube profiles. The marketers can then get more detailed information about the influencer and visualize the metrics and insights related to the influencer. RightFluencer also allows influencers to gain deeper insights about their online presence and understand their strongholds and weaknesses.


Micro-Ventures -- Predicting the Success of Potential Startups for Micro-Investments [Code, Report, Poster]
Immad Imtiaz, Ravi Bisla, Shariful Islam

In this project, we use publicly available data to predict the success of a startup in its early ages to help micro-investors to make a more informed decision about their investment. We use logistic regression for classifying the companies that goes beyond series C. We also perform topic modeling on the articles found from techcrunch. We observe that, for most of the categories our model achieve true positive rate from 60% to 80% while the false positive rates remains as low as 1% for most of the cases. While using the topics found from the pool of techcrunch articles as additional features, the performance of the models in terms of the true positive rate was enhanced for the companies that fall into technology category.


Topic Modelling based Recommender System (for Zomato) [Code, Report, Poster]
Siddharth Kanojiya, Keerthana Jayaprakash, Sneha Bezawada

We relied on the intuition that if we can populate the sparse user-item rating matrix by using latent features from reviews rather than just relying on explicit ratings then we can improve the recommendations. So we performed EDA on user reviews to analyze user's preferences in terms of food, service, take-away, delivery and other factors and also studied the restaurants data to find interesting insights about the cuisines, cost, location etc. After studying the relationships between the facts derived from above step, we performed LDA Topic modelling to assist the Collaborative filtering between users, simply put, if two people talk about same topics they could be similar. As the final step, we wanted to deploy this model such that it can be easily integrated into an existing food restaurant portal system without hampering its current user experience, specifically the response time. As a result, we used Spark and Celery to run jobs in background and in parallel. Furthermore, visualizing the datasets from restaurants, reviews and food inspections gave us interesting insights about the food and restaurants people like in Greater Vancouver area, which can be viewed here - http://35.227.63.2:5005/.


Detecting Misstated Financial Statements with Deep Learning and Interactive Dashboard [Code, Report, Poster]
Katrina Ni, Leiling Tao

In this project, our goal is to automate the process of pre-screening potential misstated financial statements. We constructed a complete data pipeline to process and clean financial statement data, engineered relevant features for two neural network models (autoencoder and LSTM), and visualized model output interactively. Based on half a million financial statements from 1980-2018, our model was able to reach a recall score of 0.7 for misstated statements, a 40% of improvement compared to a random forest classifier while retaining the same precision score. Written in Plotly Dash, our final product is a web UI of a interactive dashboard incorporating yearly trends of each accounting term and the model output, which is designed for domain experts to understand the results of our model and visually explore potential features of interests.


Identification of Toxic Comments in On-line Platforms [Code, Report, Poster]
Mehvish Saleem, Ramanpreet Singh, and Ehsan Montazeri

In this project, we used NLP and supervised machine learning techniques to come up with a model for detecting toxic comments. We trained our models on datasets from Wikipedia and SOCC and explored TF-IDF, Doc2Vec, and word embeddings to featurize them. We tried several machine learning models, and found GRU RNNs to perform the best on the validation set. We used the model on the data we scraped from multiple sports, news, and entertainment Facebook pages. Among the comments classified as toxic, we identified those that contain racism, sexism, and homophobia. News pages were found to be most toxic, whereas news and entertainment were similar. The most prevalent type of toxicity in news and sports was racism and in entertainment sexism. With the help of statistical hypothesis tests, this analysis can safely be extended to the whole Facebook data.


Fall Detection Using Wearable Sensor Data [Code, Report, Poster]
Inderpreet Singh, Amandeep Singh Kap

This project aims at detecting fall in real time so that in the event of fall, caretakers can be informed and the impact of fall on older adults can be minimized. Data collected from trials conducted on 10 different specimen in the lab was used to train various Machine Learning classifiers. The project covers the Exploratory Data Analysis of the collected data to find out the optimal time window which has to be fed into the Machine Learning classifiers to get the best classification results. We have also explored the usage of these classifiers on the streaming data which has more practical significance. For future scope, we have considered applying the concept of Active Learning for improving the model in real-time.


References

 


  © Jiannan Wang 2018