This repository contains the project for my thesis as well as the thesis itself. Information about this project beyond this README, for example about the project structure, can be found in the thesis itself. Technical documentation is done in-code.
The following steps need to be taken to launch the dashboard.
The following dependencies need to be installed and (if applicable) added to PATH
or otherwise set up as per their respective documentations.
- Python 3.6
- Python dependencies listed in
requirements.txt
- Bower
- Bower dependencies listed in
bower.json
- Docker 1.13.1
- Apache Spark 2.2.0
- Apache Kafka (Go through integration guide here)
It is strongly advised to use a `virtualenv
To enable the dashboard to connect to Twitter, create an App in Twitter Application Management,
download the access information, and place it in the root directory with the name twitter.access.json
.
- Start Kafka by running
docker-compose up
. Make sure Docker is running - Set the environment variables:
- SPARK_HOME="/path/to/spark/"
- PYSPARK_PYTHON=python3
- PYSPARK_SUBMIT_ARGS=--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell
- Run
python3 run.py
fromsrc/visualization/dashboard
NoBrokersAvailable
: Pyspark cannot reach the Kafka message broker. Make sure docker-compose ran without errors.
Set PYSPARK_DRIVER_PYTHON=jupyter
and PYSPARK_DRIVER_PYTHON_OPTS="notebook"
to run pyspark
in Notebook-mode.
Spark can now be used in Jupyter notebooks.
Add '--packages org.mongodb.spark:mongo-spark-connector_2.10:1.1.0'
to PYSPARK_SUBMIT_ARGS
to be able to
use Spark with MongoDB instead of a Kafka message queue. This is useful for development/debugging, since it doesn't require connecting to the actual Twitter stream.
All models are trained in Notebooks under /notebooks
.