A Peek Under the Hood - this Website's Architecture and Analytics

Explaining how this website infrastructure works

Posted by Albert Cheng on 01 October 2019

Last Updated: 02 June 2020

The Portfolio Website’s Stack

As the beginning blog of this website, I felt it would be appropriate to discuss how this website works and the technology under the hood.

The main requirement is a low-cost, scalable and highly-available website, without being too complex to manage. Therefore, a ‘serverless’ solution was chosen, where the heavy and complex parts of the web infrastructure is managed by a cloud provider (AWS).

As a serverless and scalable website, it uses the following technology stack:

  • Front-end:
    • html5 HTML
    • css3 CSS
    • js Vanilla Javascript
    • bs Bootstrap Framework

  • Back-end:
    • s3 AWS Simple Storage Service (S3) Static Web Hosting

  • Content Management System and Static Site Generator:
    • jekyllJekyll

  • Content Delivery Network (CDN) and Security:
    • cdn AWS CloudFront
    • shield AWS Shield

  • Web Security:
    • lambda AWS Lambda @ Edge

  • Domain Registrar and Domain Name System (DNS) Management:
    • route53 AWS Route 53

  • SSL Certificate Manager:
    • cert AWS Certificate Manager

  • Continuous Delivery/Continuous Integration (CI/CD) and Infrastructure-as-Code (IaC):
    • cdk AWS Cloud Development Kit (CDK)

  • Traffic and Service Analytics:
    • ga Google Analytics

My Portfolio Website - Serverless Stack Visualised by http://cloudcraft.co AWS Architectural icons provided by Amazon Web Services. Jekyll Icon by Vorillaz on Iconscout

Serverless highly-available and scalable hosting

AWS S3

The website uses a simple static web page framework - no dynamic routing or single-page application framework is used. Therefore, AWS S3 Static Web Hosting is used.

As a serverless cloud service, AWS S3 provides scalable object-storage, as well as scalable access to content stored in it. There’s also requirement to manage bandwidth and server capacity!

AWS S3 has a SLA of 99.99999999999% (also known as the ‘eleven nines’).

Best of all, it is pay-as-you-use, with no upfront costs.

AWS Certificate Manager

AWS manages the issuing and renewal of Public SSL Certificates for CloudFront distributions (free of charge) to ensure authenticity of the domain website, and allows secure HTTPS to access content from the website.

AWS CloudFront

AWS CloudFront, a Content Delivery Network (CDN) is used to:

  • ensure HTTPS secure access to content (will redirect HTTP connections to HTTPS)

  • protect against attacks, such as distributed denial of service (DDoS) attacks - provided by AWS Shield (free for all AWS accounts)

  • improve load times, as the content is cached in AWS Edge Network locations globally (over 200 locations worldwide)

  • manages the geographical routing and load balancing. Depending on where the user is, CloudFront will automatically route to the user’s nearest edge location cache to minimise load times.

AWS Edge Locations as at 1 May 2020 - as you can see, it spans the entire globe!

As a highly-available managed service, I don’t need to manage the global Edge Network, the actual propagation or caching. I just need to select an AWS S3 bucket as the Origin and the data is automagically globally cached and propagated!

CloudFront caches the static data for 24 hours so if a user visits the site with the 24 hour period, content will be served from the edge location, rather than the origin. Fortunately, AWS does not charge for data transfer between AWS S3 and Amazon CloudFront.

Lambda@Edge for Security

Lambda@Edge is a feature of CloudFront that allows you to run serverless functions (Lambda functions) at the edge locations where your CloudFront distribution is.

It is an event-driven function - i.e. only run when a user visits the website and hits the cloudfront cache. Best of all, there is no server administration required, as the code is automatically distributed to the edge locations.

If a user hits, for example, the Irish edge location, the function code sitting in that location will be invoked.

I created a Lambda@Edge function that adds the below security-related HTTP headers every time a response is sent back to the user.

What HTTP headers are added?

Whenever a user visits a website, the user’s web browser requests a web page, and the server responds with the content along with HTTP headers. Lambda@Edge is used to add special types of HTTP headers (i.e. security headers).

HTTP headers are needed because, by default, web browsers are very trusting - they just load anything that is sent. This makes users vulnerable to malicious attacks, such as cross-site scripting (XSS) and Clickjacking.

The main HTTP Security Headers are added/enforced as part of my Lambda@Edge implementation:

  1. Content Security Policy (CSP) - prevents injection-based attacks (e.g. Cross Site Scripting). Basically only allows whitelisted sources load CSS, images, Javascript, etc.

  2. HTTP Strict Transport Security (HSTS) - forces the web browser to only connect via HTTPS

  3. X-Content-Type-Options - forces the web browser not to load scripts and stylesheets unless the server indicates the correct MIME type

  4. X-Frame-Options - prevents Clickjacking, so the user is protected from clicking on invisible iframes on the page

  5. X-XSS-Protection - forces the web browser to stops loading pages when they detect reflected cross-site scripting (XSS) attacks

  6. Referrer-Policy - controls how much referrer information (i.e. the user’s originating website) is sent to the web server from the web browser

You can use the Mozilla Observatory to see how this website implements the above.

I got a B! Pretty good score! From Mozilla Observatory

Domain Name System (DNS) and SSL Certificate Management

AWS Route 53 and AWS Certificate Manager DNS routing and SSL Certificate management services integrate nicely with the above AWS S3 and CloudFront implementation.

Being integrated, there is no/very little extra cost to use these services.

With high-availability I don’t need to manage the underlying DNS servers. Furthermore, being integrated, I can create DNS records to route to CloudFront and S3 resources via its AWS alias, rather than the underlying IP addresses.

That way, I don’t need to manage routing parameters or routing tables that handle the underlying DNS or IP addresses of the CloudFront and S3 endpoints. Pretty neat!

Furthermore, AWS will act as registrar to and register and maintain your domain registration with a top-level domain (such as .com).

CI/CD - AWS Cloud Development Kit (CDK)

CDK is an infrastructure-as-code framework that allows developers to programmatically provision AWS infrastructure (via TypeScript, Python, etc.).

It is open-source and allows me to manage infrastructure that would otherwise be hundreds of lines of code. CDK uses high-level object-oriented programming to create abstraction of AWS resources so it becomes logically much easier to deal with resources.

Under the hood, CDK will compile the code into a CloudFormation template and apply it. This allows other non-CDK users to also see the status of deployments via the CloudFormation web UI.

For example, in Typescript, this is how I would create a Lambda function and a S3 bucket (including assuming the relevant IAM roles to it):

  const lambda = new lambda.Function(this, 'Lambda', { /* ... */ });

  const bucket = new Bucket(this, 'MyBucket');
  
  /* This grants the relevant IAM roles for reading and writing 
   a S3 to the Lambda function
  */
  bucket.grantReadWrite(lambda); 

CDK is definitely a time-saver - 200 lines of CDK code is equivalent to up to 1,000 lines of CloudFormation template code!

Google Analytics

Google Analytics (GA) is a free tool used to analyse traffic and behaviour to the website, providing an aggregated and anonymised data to ensure compliance with privacy laws.

The main use-case for Google Analytics in this website is to analyse user behaviour. That is, data collected using a user’s session on the website, such as:

  • How long they stayed on each webpage
  • Which pages they visited
  • How often do new visitors come to the website
  • The common ‘pathways’ in which they traverse the website
  • Whether they are accessing the website through desktop vs mobile device
GA Sankey Example Example of Sankey diagram tracking user flow Powered by Google Analytics

This web analytics is essential to ensure the website is optimised for the most common use cases. For example, as part of responsive web design, the website should be optimised for viewing on a mobile device as well.

Tracking is kept to a minimum and therefore the website does not opt-in to tracking of more specific details, such as demographics and interests (e.g. age and gender).

GA Tabular example Example of tracking site visits - no IP addresses are provided Powered by Google Analytics

As a note, this website’s privacy policy is accessible here.

Google Analytics Dashboards

A ‘one-stop shop’ approach can be taken with Google Analytics metrics - you can create your own custom dashboards. GA is a service that gets you quicker to insights, without having to worry about data collection, web logging, data aggregations and databases.

GA Dashboard example Example of Google Analytics dashboard Powered by Google Analytics

Costings

Putting it all together, the total costs per month for running this blog work out to be less than US$1/month.

Note that the CloudFront Distribution uses ‘Price Class ALL’ (ie all edge locations).However, for simplicity purposes, only the most expensive edge location pricing is used.

Edge Location India Australia
Route 53
DNS Hosted Zone US$0.50 US$0.50
DNS Standard Queries to CloudFront is Free US$0.00 US$0.00
AWS Certificate Manager
SSL Cert for CloudFront is free US$0.00 US$0.00
AWS Shield
Standard is free US$0.00 US$0.00
CloudFront
Region 'Transfer Out'
ie serving to visitors
First 10TB / Month
US$0.170
/GB
US$0.114
/GB
HTTPS requests $0.0120
/10,000 requests
$0.0125
/10,000 requests
First 1,000 Cache Invalidations free US$0.00 US$0.00
Transfer between S3 Origin and CloudFront Free US$0.00 US$0.00
Lambda@Edge
Invocation (Global) US$0.0000006
/request
US$0.0000006
/request
Compute @ Edge
(128mb 3 seconds)
US$0.00000625125
/second
US$0.00000625125
/second
S3
First 50 TB / Month US$0.025
/GB
US$0.025
/GB
GET requests US$0.00044
/1,000 requests
US$0.00044
/1,000 requests
PUT requests US$0.0055
/1,000 requests
US$0.0055
/1,000 requests


Therefore, costs will be (for most expensive region):

Using May 2020 web traffic (GA report) 1,500 visits
Cache Site Content Size (CloudFront report) 15 mb
CloudFront Data Egress Costs US$0.0026
HTTP requests US$0.0019
Lambda@Edge US$0.0103
S3 GET requests US$0.0007
Route 53 Hosted Zone US$0.5000
Total US$0.5154

Closing Remarks

That was a short introduction to the underlying technology of this website, as well as the analytics done on top of it. I figured it is a good way to start a website on data blogging!