DPsort

A distributed, parallel mergesort system

(The project was conducted as a educational-purpose course project : POSTECH CS434 - Advanced Programming)

Goal

In this project, the ultimate goal is set to succeed in distributed / parallel sorting of huge K/V records which are stored in multiple machines. For achieving the project, MapReduce approach for distributed sorting is recommended.

Another goal is to learn advanced programming techniques such as following, which are most crucial programming concepts which developers must keep in mind.

  • Documentation techniques
  • Testing
  • Functional programming + OOP (for this purpose, Scala is used)
  • Design patterns
  • Concurrency control & Parallelism
  • Writing clean codes

Progress

The project was conducted during one semester, including 2 milestones and 1 final demonstration phase. For the first milestone, I presented initial system design and plans for the remaining period. For the second milestone, I presented the detailed architecture, core logics and algorithms.

Results

I successfully finished the whole process of the project (design, implementation, testing and demo) in a semester. The system was consisted of 4,000 lines of scala codes across 50 code files. Several tests were done by myself and the system showed enough robustness and scalability.

My project was the only one in the class that could succeed to sort large files exceeding 10GB in a distributed environment. Due to the lack of testing environment, I used a 12-threaded single computer with 3 workers running, as a test environment. The system succeeded to sort 36GB of unordered files (0.36 billion records) without any OOM error.

- A complimentary message from the course instructor

System at a Glance

Some of the concepts were borrowed from the design principles of Apache Spark. Still, pretty much of the logics and designs are designed by my own rationale.

Personal Growth

This is the first trial for myself to fully implement a working data system from scratch. The experience was memorable in the sense that I had to tackle all system design-specific issues by myself without any external help, which was distingushable from other typical course projects. I tried as much as possible to follow all rules / concepts / paradigms I’ve learned from the class. Through this trial, I could reach to the true understanding of the intentions and principles of systems design concepts.

I took this class along with OS and DBMS class in a semester, thus took 3 systems class at once. Beforehand, I thought of majoring deep learning or NLP systems. After taking these courses I changed my mind to major in big data systems after I found out that systems programming is fun and fits me well with what I do best.

Resources

  • Project repository link
  • Youtube demo video link
  • Milestone #1 report link
  • Milestone #2 report link
  • Final report link