CMPT 733: Big Data Science (Spring 2019)

Project Showcase

Predicting Stable Portfolios Using Machine Learning [Code, Report, Poster]
Kiran, Nandita Dwivedi and Muhammad Rafay Aleem

We aim to make the process of portfolio management better and simpler by using predictive modeling and deep learning techniques. We generate stable portfolios on predicted stock prices for next quarter.

Measuring observable influence and impact of scientific research beyond academia. [Code, Report, Poster]
Chhavi Verma, Shray Khanna, Honghui Wang

In this project, we have observed the impact of Genome BC’s academic publications on the real world by observing the references in downstream documents. We have provided the insights in the form of pyhton-igraphs and plotly bar graphs which help us understand the connections between Genome BC publications and the downstream documents. The more impactful the publication is, the more connections it will have and higher the depth is, more powerful the academic publication is. We have also found the top influencers of Genome BC publications for different depths.

Real-time Cryptocurrency Prediction and Analysis Platform [Code, Report, Poster]
Chengxi Li, Haopeng Wang, Michael Yang, Hao Zheng

In this project, we build a one-stop web application for cryptocurrency enthusiasts. By fetching our data through API, our web application is able to provide ever-updating, comprehensive information regarding cryptocurrencies in all aspects. By integrating cryptocurrency price data with news sentiment analysis and social status information, we build a deep learning model that is able to provide predictions on coin prices or binary returns to help investors with their decisions. By further integrating the model with our real-time pipeline, our web application is able to provide dynamic prediction curve every hour, minute or even second.

Developing an NLP based PR platform for the Canadian Elections [Code, Report, Poster]
Abhishek Sunnak, Sri Gayatri Rachakonda, Oluwaseyi Talabi

In this project, we developed an NLP-based application which analyzed the sentiment and bias of news articles and tweets related to the Canadian 2019 elections to understand the public opinion of the candidate. We also analyzed the approval ratings of the top 3 candidates across different provinces. We used the latest NLP techniques to train deep-learning models for sentiment and bias analysis to classify news and tweets about the election. Using these results, an interactive dashboard was developed to provide a PR manager a visual platform to gain insights about the public’s perception and the media coverage of a candidate. This project can be further extended to any public relations team for their candidates.

Music Analysis & Recommendation System (M.A.R.S) [Code, Report, Poster]
Kashish Kohli, Kanksha Masrani

We aim to get inferences from our data that would help companies and startups fare better than they already do in the music market. In this regard, we extracted several findings and useful results. This is done by extracting insights from the songs from the last 5 decades. These insights are then used to create a Popularity predictor module which can predict if a song will be famous or not by just inputting the metadata. The accuracy of this module is 70%. Secondly we have created a Sentiment Analyzer which tells you the sentiment of the music heard in major countries. This can help shape the music of an artist to create better songs that would resonate with a wider audience. Thirdly, we have created a Recommendation engine which based on the taste of music of the user can suggest other songs more suited to him or her. It also suggests the top songs that might be on the chart that time as done by Netflix and Hulu.

CRYPTOIntel - Digging Deep Into The Crypto World [Code, Report, Poster]
Tushar Chand Kapoor, Syed Ikram, Mehak Parashar

CryptoIntel is a one stop dashboard which gives all the information about cryptocurrencies. All the inquisitive users can get their answers related to cryptocurrencies from CRYPTOIntel.

Distributed News Monitor System [Code, Report, Poster]
Dao Xiang, Yi Xiao, Hang Hu, Shi Heng Zhang

When social media has become the most cost-efficient way of communication among people, it is extremely intriguing to analyze people’s reaction to a popular news post while eliminating the false information online. Therefore, we designed the Distributed News Monitor System that concentrates on the news content to alert the public about the fake news and produce analysis of the public opinions from the Twitter comments on the news. Deep learning model is deployed and able to detect the integrity of the news according to its content. Big data streaming analysis is expanded to reach real-time news monitoring and thus guide people to think deeper about the news to make their own judgement. This system encompasses advanced modelling, real-time analytics, and scalability all in one.

Herald: Know the Stock Movement Before It Happens [Code, Report, Poster]
Andong Ma, Angel Zhang, Changsheng Yan, and Denise Chen

In this project, we build up a data science pipeline for stock movement prediction and a real-time prediction web platform. Specifically, we perform topic modeling on news articles to discover the general topics discussed in news and visualize the frequent words of positive and negative news to observe the similarity of these words with t-SNE. Besides, we aim at applying NLP methods to generate words embeddings and built the deep learning models based on the crawled news and stocks data to predict future stock price and stock price trend. With CNN+RNN model, we get the best performance of the model with the accuracy rate reaching 58%, outperforming baseline models (52% - 57% of accuracy rate). One of our most important final products is a web application accomplishing two main goals, acquiring the latest news, twitter, and stock dataset from different sources, and achieving real-time process and prediction on the future stock price. This web is designed for investors to get insightful news and tweets associated with each stock ticker and take the predicted stock price as a reference to better make their investing decisions.

End-to-End Solution For IoT devices Predictive Maintenance and Management [Code, Report, Poster]
Chuangxin Xia, Risheng Wang, Yifan Li

In this project, we want to achieve an end-to-end solution for IoT device predictive maintenance and management. We performed ETL and EDA using pySpark, incorporated feature selection and anomaly detection on top of prediction neural network model trained using LSTM RNN. Those layers ensure our prediction to be precise. Our dashboard console communicate with live Node.js server and live model served on Google Cloud Machine Learning Engine while providing an interactive user experience and easy-to-interpret data visualization. The whole pipeline was built with flexibility and scalability in mind.

TradeSpade - Price Signal Forecast for Financial Assets [Code, Report, Poster]
Anurag Bejju, Rishabh Singh, Nikitha Ravi , Manan Parasher

TradeSpade is a one-stop solution that provides day traders assistance with intraday trading by predicting Buy and Sell signals in order to maximize profits and make optimized decisions. It is targeted for both traditional and exploratory stock and cryptocurrency traders by providing a robust web application that can help them make data-driven decisions. It actively supports novice traders by providing intuitive financial predictions based on historical and contextual information collected for the last one year. As part of this project, we have also effectively depicted the influence of social media and everyday news on market fluctuations.

Duplicate Questions across multiple Question-Answering Forums [Code, Report, Poster]
Neda Zolaktaf and Vaishnavi Malhotra

In this project, we worked on 6 individual (Quora, Apple, Android, Sprint, Superuser, and AskUbuntu) and an integrated dataset (Quora and AskUbuntu) to tackle the problem of duplicate question detection across multiple question-answering platforms.

FootWizard - Predict The Unpredictable [Code, Report, Poster]
Sagar Parikh, Chirayu Shah, Abhi Savaliya

We aimed at predicting the outcomes of the EPL matches on the basis of their previous records based on winning streaks, head-to-head and overall rating. We implemented these models using Machine Learning techniques and found that SVM provides the best accuracy among the the other 4 techniques which was 61%. Currently, we have predicted the Football Matches outcomes. However, betting is next challenge as it involves predicting the matches with higher accuracy and predict the dynamic odds in real-time. We plan to recommend best platform to maximize the betting profits.

Skills Job Advisor [Code, Report, Poster]
Bhuvan Chopra, Btara Truhandarien, Grace Kingsly, Mohammad Ullah

In this project we have demonstrated various techniques that we used to tackle the challenging problem of giving advice on the matter of skills that one should cultivate to be more suitable for a particular job. Our data reflects real world scenarios and people by using information from resumes, gathered by searching for resumes of people that have worked in one of 15 different jobs. With information retrieval techniques such as TF-IDF we are able to build a corpus of relatable skills for a particular job. Attempts at job normalization, required as part of our initial modeling plan, yielded unsatisfactory results, ultimately leading us to another kind of model. The final model that we built is a K-Means clustering model, with the supplied data points being document and word embeddings of job titles and skills. This final model though rudimentary, allows us to give a basic notion of advice on skills that needs to be cultivated through comparison of a given skill-set and the skills within a particular job cluster.

Metro Vancouver Housing Market Analysis and Prediction [Code, Report, Poster]
Krishna Chaitanya Gopaluni, Nitin Misra, Harish Bhargav Dasika, Manjur Rahaman

Vancouver is always in the bubble. Potential buyers take the current increasing price trends for granted to invest. But the prices may suddenly fall and it takes a really long time to get Return On Investment. Keeping this in mind we have come up with the following goals. (1) Identifying Bubble Prone Areas in Metro Vancouver. (2) Predicting the housing price based on current trends. (3) Predicting HPI Benchmark Prices future trends. (4) Predicting property price range based on a property image. In the end, the project can help a potential buyer in warning about bubble-prone areas and he will be able to make informed decisions based on future trend prediction.

Real-time Cryptocurrency Analysis (financial-analysis) [Code, Report, Poster]
Fatemeh Renani, Jaskaran Kaur Cheema, Mohammad Mazraeh

Stock price forecasting is a popular and important topic in financial and academic studies and cryptocurrency market is not an exception. In this project we have created a general scalable platform for real-time cryptocurrency price prediction. The platform received the news and price history as its input and it performs feature extraction, feature aggregation, and price movement prediction. Finally the platform outputs the predicted Bitcoin price movement for next minute. At each stage in the pipeline the data is read from a kafka and the new data is written into another kafka. Hence, other cryptocurrency can be easily integrated using this architecture to produce the most realtime, robust and accurate cryptocurrency price prediction project!

Internet Media Influence [Code, Report, Poster]
Aroun Amitabh Dalawat, Aisuluu Alymbekova, Shreejata Bhattacharjee

Internet media platforms have evolved from low-quality entertainment content to global media and tech companies, whose articles go viral and have great influence on people’s opinions all around the world. Every company needs efficient marketing to thrive, grow and effectively communicate to their potential customers. With a rapid growth of e-commerce segment, the influence of internet media platforms can be leveraged as a strong marketing tool to promote goods. Hence, platforms such as BuzzFeed, BestProducts.com, etc. can be used for digital advertising in e-commerce. These are the websites you look to when you’re trying to get information, opinions, even suggestions on the kind of products that we want to buy or should buy. The project is focused on two things in particular. First, identification and evaluation of the impact of internet media platforms on e-commerce. Second, development of a tool that will automate the creation of articles for internet media platforms. So, for example, from the point of view of a Buzzfeed employee, the time and labor spent in manually searching for potential products to be featured in articles and writing descriptions individually for each of them will be reduced. Hence, we might say that the practical application of this project will be in the digital marketing sphere.

Intelligent Travel Recommendation System [Code, Report, Poster]
Savitaa Venkateswaran, Subikshaa Senthilkumar, Sachin Prabhu Thandapani

Our project provides a Tailor-made Travel plan for Users using their travel details like destination, budget, start and end dates of travel and their preferences of attraction categories, hotel amenities and cuisine type. Our project significantly reduces the time spent on planning for a satisfactory vacation.

News Sentiment Tracker: A Targeted Opinion Mining Interface [Code, Report, Poster]
Andrew Wesson, Prashanth Rao

In this project, we developed an end-to-end NLP-based application that automatically detects fine-grained sentiment towards a specific target query (such as a person, event, product or organization) in news articles. We applied novel combinations of techniques from big data, NLP and time series visualization to provide the end user targeted insights into press coverage on a specific entity. Our system was shown to identify large-scale shifts in sentiment in news coverage towards a target reliably, and we foresee numerous commercial applications that could benefit from this approach and help guide the relevant personnel in making data-driven decisions.

VANREAPER - Vancouver Real Estate Analysis and Predictions [Code, Report, Poster]
Chirag Ahuja, Ekramul Hoque, Pavan Kosaraju, Rohith Sooram

VANREAPER is an online tool which aims to improve the process of how people in Vancouver buy and sell homes, empowering them with the information they need to make a decision before making the purchase. In this project, we have scraped data from multiple sources like property tax data from BC Assessment Authority, property listings data from REW, school ratings data from Fraser Institute and historical interest rates from Bank of Canada. Upon collecting the data, we have applied various time series models (ARIMA, LSTM), Regression models (GBTR, Linear), and Recommendation models (KNN). Finally, all the statistical models were serialized, persisted and deployed using an interactive Flask Web Application.

Predicting Stock Prices using Social Media [Code, Report, Poster]
Mihir Gajjar, Gaurav Prachchhak, Tommy, Betz, Veekesh Dhununjoy

We predict the future closing stock price using historical stock data in combination with the sentiments of news articles and twitter data. We collected the historical stock price, twitter and news data by web scraping and through various data sources. In the preprocessing stage, we filter the unwanted records and carry out aggregations to extract useful features. Sentiment analysis has been performed using TextBlob on the news and twitter data to generate weighted average sentiments. By using various statistical techniques, we generate new features using the stock data. Using the ‘Date’ field we combine all the features. The usefulness of the features is validated by performing correlation analysis. After performing feature engineering, we provide these features as an input to our LSTM model and predict the future closing stock prices. The best results were obtained by using the stock data along with the new data. On the other hand, when including twitter sentiments, the error was higher which indicates that the vast number of tweets were not directly related to Apple’s success which can interfere with predictions.

Assessment and Visual Analysis of Trends using Article Reviews [Code, Report, Poster]
Jamshed Khan, Padmanabhan Rajendrakumar, Jaideep Misra

In the age of Big Data, an estimated 2.5 quintillion bytes of data is generated every day and a huge amount of this is of a textual nature. With scores of documents available on the web and more pouring in day after day, how can one make sense of a general summary without actually diving in and reading every word? Searching for insights from such an enormous amount of information can become very tedious and time-consuming.