Friday 24 July 2015

Micro Architectures vs Modern Web Architectures (Macro Architectures)

What is this article all about?

Well, some interesting stuffs mostly. Some nice bedtime stories for crazy technology persons. OK, lets get started with the story.

What I am going to talk (write) about today is how similar the design of a microprocessor looks to a modern web architecture. So much so that I am just going to call it a macro-architecture. The implications could be interesting as microprocessor architects could quickly find a way to jump into that next social networking or e-commerce disruptions and do some really interesting work.

Some definitions/comparative terms to get into the mood

Let's take these correspondences between web and micro architecture with a pinch of salt.

  1. Requests per seconds <=> Instructions per cycle (IPC)
  2. REST API <=> Instruction Set Architecture (ISA)
  3. App-servers <=> Functional or Executions Units (FU)
  4. Load-Balancers <=> Instruction Schedulers
  5. In-memory databases <=> L1/L2/L3 caches
  6. OLTP (online transaction processing) DB <=> Main Memory
  7. Backup and restore <=> Hard Disk Drive
  8. Frontend <=> Graphics Processor
  9. Full Page Caching <=> Image Buffering
  10. Notifications <=> Interrupts
  11. JSON <=> Instruction data and results
I hope I have the computer architect's attention.

How does a micro-architecture look like?

Here's how the state-of-the-art Intel SandyBridge Micro-architecture looks like:

This is a very high level architecture diagram and I am sure it doesn't capture the micro-architecture in its entirety.But it tells us enough for the present discussions. One complaint I do have is that, it doesn't explicitly show the various queues which should sit in front of the various execution units and Register Files which are important citizens of the micro-architecture world.

What is the micro-architecture trying to optimize?

The micro-architecture is all about providing performance and throughput. First comes from the silicon, that is how fast is the clock-cycle (roughly) and second from how many instructions get processed per cycle, mainly the domain of the micro-architects.

The program execution time is rougly: (num of instructions)/(IPC x clock cycle time).

Eg: Execution Time = 1000 Billions Instructions / ( 10 Instructions/cycle * 1 Giga cycles/seconds) = 100 seconds

The micro-architecture's job is to increase IPC so that execution time can go down. The job of the EDA(electronic design automation) tools is to increase clock cycle time so that the execution time can go down. The programmers job is to write good code resulting in fewer number of instructions for the program (through the use of an efficient compiler) in the executable so that the execution time can go down.

What should the macro-architecture try to optimize then?

In the last section I introduced the concept of throughput but did not explain it. That was strategic. The macro-architecture's job should be to improve throughput and performance. Hey, what's the difference? You just interchanged this two terms.

Well, everyone loves performance, but macro-architectures main purpose is throughput. Throughput is a slightly different concept. Imagine many programs contending for the same resources (micro-processor). The interface (OS in the computer system) gives you the illusion that you are the only user of the system. But behind the scenes the OS is the gatekeeper of the micro-processor allowing fair and equal access to each programs (from different users) to the computing resources. Now individual execution times for each microprocessor is going to go down obviously (since you have a mandatory waiting period for the resources). Throughput is a measure of aggregate performance of the system in this case. A simple metric in this case would execution times/programs. This definition is for the lay-person but should convey the idea.

In the web-world, latency is roughly the measure of performance of the system from the individual perspective, and throughput is the measure of how much load the system can handle when the number of users and requests incoming to the system grow, at the same average latency . The latency would be measured in seconds (e.g 15 ms to load a page), and throughput measured in Requests per seconds (RPS) or queries per second (QPS) serviced. (e.g 1000 requests/second or 1 million queries/second).

How are micro-architecture and macro-architecure achieving scalability?

In the micro-architecture world, the processing power is scaling "horizontally" by going multi-core. Alas, the shared memory will always be a pain in the neck. Different threads of execution can run concurrently in the various cores, but when they have to contend for a shared memory, they have to synchronize essentially meaning serialize. And serialization is the arch-enemy of performance and parallelization. You cannot deal with serialization by throwing in more resources. Moore's Law enabled vertical scaling for decades through smaller process nodes, but on-chip horizontal scaling seems to be gaining some traction now.

In the world of macro-architecture, database, persistent or not is the serialization agent, as it is a shared resource. Brewer's CAP theorem gives an insight as to what trade-offs can be done with databases for performance and scalability. Scaling vertically is expensive (run system on more powerful machines), scaling horizontally by adding commodity machines is more efficient.

What are important factors to consider for performance in a micro-architecture?

The performance of the microarchitecture depends on how it resolves resource, control and data dependencies. The three kinds of data dependencies is WAW (Write after Write), RAW (Read after Write) and WAR (Write after Read). Two of this dependencies WAW and WAR can be resolved by renaming the write locations (so they do not clash), i.e by renaming the registers if the destination is the register or allocating a space in the store buffer. The RAW is the real data dependencies, as the read must be ordered after the write finishes. This dependency can be broken by value prediction but unless there is a pattern, it is hard to predict. RAR (Read after Read) does not pose a dependency at all.

Control dependency has to do with the sequence of instructions being issued to the processor. This isn't always necessarily linear. When there is a branch in the instruction sequence, the branch resolving instructions may stall the execution pipeline. This is so because the correct branch cannot be determined until the execution of the branch resolution instruction completes. This dependency is resolved by branch predictors, which predict the direction and target address of the branch and execute speculatively while the branch is being resolved.

Resource dependency is when the instructions cannot execute because there are no available resources for execution. Of course this easy to replicate, but a general purpose processor cannot always accurately figure out what is the optimal number for a particular applications. The numbers are usually tuned for a particular benchmark.

The arithmetic logic unit (ALU) can be seen as implementing the business logic of the micro-architecture system. It is stateless, so it has the desired properties for scalability. Likewise, different functional units such as floating point units (FPU) can also be regarded as business logic of the micro-architecture.

What are important factors to consider for scaling a macro-architecture?

The desirable property of the macro-architecture is that the app-server components implementing the business-logic should be state-less, delegating state-keeping to other components. This way the resource dependencies can be resolved.

Coming to data dependencies, we have roughly the following schemes to resolve real data dependencies:

  • RAR (Read after Read): Replication of the databases for increasing read throughput.
  • WAW (Write afer Write): We have two flavors, WAW of different data segments or same data segments.
    • Sharding: Separating different write segments to different databases, increases write throughput in this case
    • Versioned write more comparable to register/location renaming to scale write to same location
  • WAR (Write atter Read): Handled the same way as WAW

Architectural Savior: Caching

Caching is a way to hide data access latency. The data may reside in a hierarchical storage with increasing access latencies, like registers < L1 cache < L2 cache < L3 cache < RAM < SSD < HDD < Network (LAN) < Network (WAN). The latencies increases with hierarchical components with increasing storage capacities. Some of this storage structures are persistent, others are not.

When considering caching, we need to know about atleast two things: spatial locality, and temporal locality. This is best explained using an example. Imagine that we have to operate on a "working set" of 1 GB of data, however 90% of our access involves only 256 KB of data and rest 10% involves access to rest of the data. A smart choice would be to put the heavily accessed 256 KB of data in a faster memory (cache), than putting all of 1 GB together in a slower and bigger memory hierarchy. Let's play with some numbers, and imagine that access from my fast memory is 1 ms, while access from the bigger memory is 10 ms. If 100% of my accesses go to the bigger memory, my average data access time is 10 ms. If I do the caching, my average access time is (1*90 + 10*10)/100 =1.9 ms, orders of magnitude faster !!! This is how we would increase data access throughput.

Spatial locality means that the data being accesses are placed closer together in space, which means certain blocks of data are heavily accessed while temporal locality is data accesses are closer in time for a single unit of data. Data displaying any of this traits is a good candidate to be kept in the cache.

In a macro-architecture, the data is usually fetched from a database which has a higher access latency. If that particular data is to be accessed frequently then it makes more sense to put it in the cache to hide db latency. Temporal locality plays a big role here.

Queues: A first class citizen of an architecture

If we look at any of the state-of-the-art micro-architecture, we will see queues everywhere. When we think of queues, we think of pipelines (a hardware queue), and when we think of pipelines, we think of throughput. The very first micro-architecture was a simple pipeline: Fetch, Decode, Issue, Execute, Commit. Later on as the micro-architecture evolved, lot of components got added into this simple architecture. Queues were one of the most important components to be added. Reservation Stations, Reorder Buffer, Load/Store Queues, are all good examples. Queues serve two important functions, it handles a temporary increase in demand for a particular resource, and secondly serializes the commit stage. We shall see both this factors also play a role in macro-architectures.

Let us look at some special type of queues in micro-architecture: Point-to-Point can be used for Single Instruction Single Data (SISD), Multi-Point-to-Point can be used for Single Instruction Multiple Data (SIMD), a topic based message passing queue can be used for Multiple Instruction Single Data (MISD) and Multiple Instruction and Multiple Data (MIMD) cases. Basically MISD is a single producer multiple consumer scenario and MIMD is a multiple producer multiple consumer scenario.

OLAP and OLTP processing for the architecture

For both architectures we have Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP) as integral components. OLTP is immediately visible, this is what is absolutely needed for functional working of the system. Interestingly we have a lot of OLAP going on behind the scenes as well. Micro-processors captures the characteristics of the instruction and data streams through simple or fairly complex analytics to provide realtime decisions for pre-fetching of instructions, pre-fetching of data, branch prediction, Branch target address generation, memory address speculation, and sometimes value prediction. The end result is a better performance and throughput as a result of highly accurate prediction.

The OLAP and OLTP part of the systems must be independently designed for maximum efficiency and modularization. The components have different requirements, and can work together to deliver increased performance.

Architecture Evolution

Both micro-architecture and macro-architecture started off as monolith systems. This is a good starting point which provides a consistent interface to the system. But as the demands on the system grows, the monolith must be broken to enable each components of the system to evolve independently. Profiling to identify bottleneck is the main driver for the evolution. As long as there are identifiable bottlenecks there is potential way to alleviate it, mitigate it, and I would prefer eliminate it. The micro-structure regime saw this evolution in moving from in-order execution to out-of-order execution, thereby providing an elegant way of neutralizing resource bottlenecks. Other innovations helped boost performance by hiding data-access latency through caching, and efficient organization of the memory/data storage hierarchy. Out-of-Order execution and in-order committing essentially provides a illusion of sequential running of the programs but at a much superior performance.

The web product system also started off as a monolith system. However it to evolve over time for increases throughput. Next in the path of evolution was Service Oriented Architecture (SOA). This can also be seen as move to grab the low-hanging fruits of resource dependencies. There is even the concept of Enterprise Service Bus (ESB) which is reminiscent of the shared CPU bus(data/address/control bus). Processor architects know how quickly a bus turns into a formidable bottleneck, and there are other elegant solutions which reduces bus contention. Next in the path of this evolution is Micro-Service Architecture (MSA). However data-access latency will continue to remain a challenging bottleneck. Inter-component communications can also be a bottleneck.

Docker is a very promising technology which can enable MSA going forward, however it does need to solve some architectural issues first, mainly that of communication between different containers.

Different components of a Macro-architecture

Fortunately there is a lot of high-quality open-source solution to quickly put together a scalable system. But just as in the case of designing a micro-architecture, a lot of tuning needs to be done to push the system to deliver maximal performance and throughput.

Here's some example of open sources components:

  1. Load Balancers: Nginx, HAProxy
  2. Queueing: RabbitMQ, ActiveMQ, ZeroMQ, Kafka, Kestrel
  3. Caching: Redis, Memcached, MemBase,
  4. Full-page Caching: Varnish, Squid
  5. DataStores: RDBMS (MySQL, MariaDB, Postgresql), Document-based (MongoDB, CouchDB)
  6. Analytics system: Column-Oriented (Cassandra, HBase), Graph (Neo4j, Hypergraphdb)
  7. Time-Series Databases: InfluxDB, OpenTSDB
  8. Batch Processing : Hadoop, Spark
  9. Stream Processing : Apache Storm, Apache Spark
  10. Monitoring/Alerting: Nagios, Graphite, Graphana, Ganglia, Icinga, zabbix
  11. Search/Indexing: Sphinx, Solr, ElasticSearch
  12. Log Analysis: ELK stack (ElasticSearch, LogStash, Kibana)
  13. Messaging: Ejabberd, Openfire
  14. Service Discovery: Zookeeper, Eureka
  15. Networked Storage: Ceph, GlusterFS
Most of the components are independently scalable.

Conclusions and Recommendations

As pointed out in the article there is a lot of common-ground between the two architectures. Mostly each area can learn and apply the architectural patterns from each-other to deliver performance and scalability. The web macro-architecture is inherently concurrent in nature and concurrent execution is the forte of the hardware designers. So I think there can be a lot of healthy symbiosis between the folks. The good thing about architecture is, it critically examines the bottlenecks of the systems, and quantitatively characterized what can be scaled, and by how much. It studies the limits of performances. A better design pattern leads to a clean visualization of the system, and that clarity enhances innovation.

No comments:

Post a Comment