Map showing the cpython repositiory, highlighting the files that Guido van Rossum changed the most
Basic use guide
- Generate database with
./generate-db.sh {path_to_repo_dir}
- Run web server with
python flask_app.py
(flask must be installed, can be install from pip) - Connect on
127.0.0.1:5000
- Available repos will be displayed, select the one you want to view
- Select email to search for in the form at the bottom and click submit
This project consists of two parts:
- Git log -> database
- Database -> treemap
Git log -> database
Scans through an entire git history using git log
, and creates a database using three tables:
- Files, which just keeps track of filenames
- Commits, which stores commit hash, author, committer
- CommitFile, which stores an instance of a certain file being changed by a certain commit, and tracks how many lines were added/removed by that commit
- Author, which stores an author name and email
- CommitAuthor, which links commits and Author in order to support coauthors on commits
Using these we can keep track of which files/commits changed the repository the most, which in itself can provide useful insight
Database -> treemap
Taking the database above, uses an SQL query to generate a JSON object with the following structure:
“val”:
“children”: [
file:
“name”:
“val”:
directory:
"name":
"val":
"children": [, ...]
file:
"name":
"val":
then uses this to generate an inline svg image representing a treemap of the file system, with the size of each rectangle being the val
described above.
Then generates a second JSON object in a similar manner to above, but filtering for the things we want (only certain emails, date ranges, etc), then uses this to highlight the rectangles in varying intensity based on the val
s returned eg highlighting the files changed most by a certain author.
Performance
These speeds were attained on my personal computer.
Database generation
Repo | Number of commits | Git log time | Git log size | Database time | Database size | Total time |
---|---|---|---|---|---|---|
linux | 1,154,884 | 60 minutes | 444MB | 462.618 seconds | 733MB | 68 minutes |
cpython | 115,874 | 4.6 minutes | 44.6MB | 36.607 seconds | 74.3MB | 5.2 minutes |
Time taken seems to scale linearly, going through approximately 300 commits/second, or requiring 0.0033 seconds/commit.
Database size also scales linearly, with approximately 2600 commits/MB, or requiring 384 B/commit.
Querying database and displaying treemap
For this test I filtered each repo by its most prominent authors:
Currently treemap.js
uses a global variable MIN_AREA
to not render smallest files for better performance.
While these performances are not as fast as desired, a more typically sized repo should perform fine.
Wanted features
Submodule tracking
Currently the only submodule changes that can be seen are the top level commit pointer changes. In the future would like to recursively explore submodules and add their files to the database.
Faster database generation
Currently done using git log which can take a very long time for large repos. Will look into any other ways of getting needed information on files.
Asynchronous javascript
Currently no async functions are used. I believe the performance of the webpage could be imporved if things such as file loading and svg drawing was done asynchronously.
Remembering filters
Filters must be re-entered every time the page is loaded. Ideally filters would be remembered either through cookies or by storing the filters as a url query, which would allow users to bookmark queries.
Filter builder sidebar
Current interface is very barebones. Want to have a sidebar that will allow users to select authors, date ranges, etc, to control the highlighting.
Selectable colours per author
Currently red is hardcoded for all results. In order to show multiple authors we want to highlight in different colours, will need to decide how to colour files edited by both authors.
Leave A Comment