6 files changed, 47 insertions, 1 deletions
diff --git a/README.md b/README.md
index ecd44d4..344225c 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,13 @@
 # UKGenderPayGap
 Visualisations of the UK gender pay gap data. Publically avaliable, not often seen: https://gender-pay-gap.service.gov.uk/
 
-## Datasets
+## Dataset Notes
+
+Most of the information comes from the [UK Government website](https://gender-pay-gap.service.gov.uk/). The data is provided as a single CSV for each year, so I made a script that parses these CSV files and builds a relational database. Using a relational database has a number of advantages including fast processing thanks to MySQL queries and significantly lowering the amount of repeated data. Data from this source alone would not be enough, however. The information about emloyers is limited, so I HTML scrape the *companies house* government website to fetch additional data about all of the employers. Each employer has a number of SIC codes, a code system for what business an employer is in, but there are thousands of them, too many to use in practice. Thankfully the government provides a list of SIC section headers which give a more palatable idea of what an employer is doing. Making the chloropleth map was a significant challenge. The datasets provide us with an address of the employer, but this is not enough to pinpoint its location because the county or local authority in the postal system is not the same as the `offical' one used by the local authorities system. Instead, we use ONS data to get the local authority or county name of an employer proper. The ONS data is very large and comprehend, thankfully the [London Borough of Camden](https://www.data.gov.uk/dataset/7ec10db7-c8f4-4a40-8d82-8921935b4865/national-statistics-postcode-lookup-uk) produced a processed version of it which is what we use. We scrape this data and append it to the database.
+
+There are some issues with these datasets. For instance, the chloropleth map might not be especially useful, since it bases off of the employers' registered address, if an employer has many branches, only their head office will appear on the map. Sometimes there are only a few data points in a dataset, which can sometimes lead to distortion, even with using the median- for example female bonus pay in the waste management industry.
+
+## Dataset Usage
 
  - The main dataset is from the [UK Government's gender pay gap service](https://gender-pay-gap.service.gov.uk/). As more years are released, they can be downloaded and appended to the database:
 
diff --git a/requirements.txt b/requirements.txt
index dc0ba2d..b326d82 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -6,3 +6,5 @@ requests
 pandas
 PasteScript==3.2.0
 waitress
+mistune
+houdini.py
diff --git a/src/app.py b/src/app.py
index bd58cfc..25ea52f 100644
--- a/src/app.py
+++ b/src/app.py
@@ -2,6 +2,8 @@ from paste.translogger import TransLogger
 from waitress import serve
 import database
 import urllib.parse
+import mistune
+import houdini
 import flask
 import sys
 import json
@@ -27,6 +29,33 @@ def serve_index():
         charts = get_charts()["index"]
     )
 
+class MDRenderer(mistune.HTMLRenderer):
+    def blockcode(self, text, lang):
+        return '\n<pre><code>{}</code></pre>\n'.format(houdini.escape_html(text.strip()))
+
+    def heading(self, text, level):
+        if level == 1:
+            return ""
+        else:
+            return "<h%d>%s</h%d>" % (level + 1, text, level + 1)
+
+@app.route("/datasets")
+def serve_datasets():
+    md = mistune.create_markdown(
+        renderer = MDRenderer(),
+        plugins = ["url"]
+    )
+
+    with open(os.path.join(os.path.dirname(__file__), "..", "README.md"), "r") as f:
+        markdown_txt = f.read()
+    md_html = md(markdown_txt)
+
+    return flask.render_template(
+        "datasets.html.j2",
+        title = "Notes on Datasets",
+        md_html = md_html
+    )
+
 def get_charts():
     with open(os.path.join(os.path.dirname(__file__), "charts.json"), "r") as f:
         return json.load(f)
diff --git a/src/static/style.css b/src/static/style.css
index d3235c7..3c2b4f3 100644
--- a/src/static/style.css
+++ b/src/static/style.css
@@ -183,6 +183,10 @@ aside dl dt {
   overflow: auto;
 }
 
+#datasetnotes {
+  font-size: x-small;
+}
+
 footer {
   padding-left: 10%;
   padding-right: 10%;
diff --git a/src/templates/datasets.html.j2 b/src/templates/datasets.html.j2
new file mode 100644
index 0000000..4cf2e30
--- /dev/null
+++ b/src/templates/datasets.html.j2
@@ -0,0 +1,4 @@
+{% extends "template.html.j2" %}
+{% block content %}
+    <p id="datasetnotes">{{ md_html|safe }}</p>
+{% endblock %}
+\ No newline at end of file
diff --git a/src/templates/template.html.j2 b/src/templates/template.html.j2
index be51c12..48c5da9 100644
--- a/src/templates/template.html.j2
+++ b/src/templates/template.html.j2
@@ -91,6 +91,7 @@
     </div>
 
     <footer>
+        <p><a href="/datasets">Notes on datasets used</a></p>
         <p><a href="https://github.com/jwansek/UKGenderPayGap">Source code</a> released under GPLv3 - <a href="https://git.eda.gay/UKGenderPayGap">Non-Github mirror</a></p>
     </footer>
 </body>
 \ No newline at end of file