Wednesday 29 July 2015

Virtualization for Techies and Non-Techies

What is virtualization, really?

In this post, I will write about virtualization, its omnipresence, and its impact.

A First Example.

I am sure we all know about the Phantom from the comics. And we all know he stands for justice. He doles out punishment to the the wrong-doers, and evil bad persons. However, none of that is as interesting as the fact that his legend never dies. The readers of the comics are actually not put in a suspense about how this is accomplished. When the Phantom of current generation dies, his successors dons on the suit (Phantom Suit) to carry on the legacy. The antagonists of the comic are never able to figure out that in-spite of all their machinations, Phantom keeps coming back to take down their wicked designs. This is our first exposure to the virtual or in more sophisticated terms logical Phantom, who never dies being realized by every-day mortal, a physical human being, who just lives out life to its maximum expectancy. We see an interface of virtue virtualized, and the physical form which perpetuates the notion of continuity through generations.

So we see that attributes can be separated out from the entities. The logical is separated from the physical. And the virtual isolated from the real, to provide an illusion of continuity and capacity which is strictly not possible only with the real entity.

Another Example

From time immemorial mankind has engaged in commerce, which we can loosely define as some form of exchange. Prehistoric commerce was mostly a physical exchange. Commodities could be exchanged for other commodities: e.g 5 Apples for 8 Oranges. Commodities could be exchanged for services: e.g 1 bucket of rice for 1 day of labour. Services could be exchanged for services: e.g 1 day of working in the field for 1 hour of teaching martial arts. This was known as the barter system. It had its merits and it demerits. Commerce was possible but it was simply, not that convenient. What if I wanted apples and could teach you martial arts, but you had oranges and wanted to learn martial arts? So some form of token evolved to virtualize the barter system or physical-commerce. You could learn martial arts from me in exchange for equivalent amount of tokens that could be exchanged for the amount of apples I wanted. The assignment of the value to a unit of certain commodity or service, would depend on the demand-and-supply. The value of all things that can be exchanged therefore could be mapped to the number of tokens, which greatly simplified doing commerce. What I call tokens, is what is known in the modern world as money. If tokens are different, an exchange value for that has to be agreed upon, and this is known as foreign exchange or FOREX. Commerce today is synonymous with virtual commerce. E-commerce is a form which does not require a physical market-place. This can be seem as the virtualization of the physical shop or place of commerce.

Examples from Computer Technology

This point onwards, I will talk about how different physical components of the computer was virtualized.

Virtual Memory

One of the notable examples of virtualization is the concept of Virtual Memory. The computers execute programs from the RAM (Random Access Memory). The physical capacity of RAM is limited (typically 1GB, 4GB , 16GB etc.) What if the program size and the memory needed by the program itself (dynamic and static memory) exceeds that of the physical RAM? This is precisely the problem Virtual Memory aims to solve. The Virtual Memory provides an illusion of a memory of capacity many more times to the program (64GB of virtual memory as compasred to 512MB of RAM). This is achieved by the system providing a bigger address space to the process (running instance of the program), than the physical address space of the RAM. The Operating System's (OS) responsibility is to map virtual addresses to the physical addresses of the RAM. What about the other chunks of virtual address space? They are simply stored in the designated swap-space in the hard-disk. The OS fragments the virtual and physical address spaces into pages, and swaps out pages from RAM to swap and back on demand.

Virtual FileSystem

The physical persistent storage space of the hard disks is vitualized with a virtual filesystem (VFS) interface. Hence the disk can be partitioned into different logical volumes each giving the illusion of a single system drive. VFS also makes it simple to access data from different physical storage (hard disk, network attached storage (NAS), usb drives) using a single virtual file system, by mounting them in the root file system.

Multitasking: Virtual Computing

A single core microprocesser is only cable of running a single stream of machine instructions. Single stream execution is virtualized to provide an illusion of multitasking by interleaving streams of instructions from different processes into one stream. In other words a single cpu is time-sliced to provide shared access to multiple processes. When a running process relinquishes control of the CPU, its process state (whatever is needed to resume execution) is saved to the memory or disk. The context of the process selected to run is restored and control handed over to it by the OS for the duration of its slice. The process may voluntarily give up the CPU back to the OS before it time slice (epoch) finishes, if it terminates, or blocks waiting for IO, or is interrupted by a higher priority process.

Virtual CPU

I had to make this a distinct entry to distinguish it from multitasking. Intel hyperthreading allows two threads of execution to run within the a single physical CPU by duplicating execution states inside the CPU. However all other resources are shared. Hence, Intel provides two logical CPU using resources of the single underlying physical CPU.

Virtual Private Network

There are a lot of examples of virtualization from computer networking like Virtual Routing, but we'll talk about Virtual Private Network as it is more interesting. Internet is a public infrastructure, which means any computer is accessible from any other computer connected to the internet. Usually a private (Local Area Network) LAN is implemented by building a LAN using private networking address space and interfacing with the internet through a gateway. The gateway controls what gets into the network and what gets out. The computers inside the LAN do not have public IP (Internet Protocol) addresses, and hence is not directly accessible by the computers in the public internet. If a bunch of computers on the internet having a public IP address wish to behave as if inside a particular LAN, then (Virtual Private Networking) VPN is the solution.

Virtual Hosts

This is how a server serving internet traffic is virtualized. Suppose a physical machine (server) is accessible using different domain names, e.g blog.example1.com and wiki.example2.com. Usually a Domain Name Server (DNS) would resolve both the names to the same IP address of the machine. Depending on what domainname was used to access the server, the server could hand out different content, e.g blog.example1.com would be a blog from company example1, and wiki.example2.com would be a wiki system from company example2, both being hosted in a single server. This is also known as shared hosting.

Virtual Machine

This is a fascinating concept. We usually have a host OS on the computer e.g. linux or windows. Lets assume we have a windows system, and we want to run various linux versions in our machine. Also we do-not want to physically partition the machine Hard Disk Drive (HDD). In some cases we may not even have that privilege. The solution is Virtual Machine(VM). This is a software which provides the entire machine interface to us. This means we can run whatever OS we want to run inside it. We can run multiple instances of VM within a single physical machine. Ofcourse, the resources will need to be divided among the running instances of the VM. Examples are VMware and VirtualBox.

Virtual OS: Containers

Docker is the state-of-the-art technology disruption. The OS rather than machine gets virtualized by providing a namespace and process isolation to the container. The container shares the OS of the host machine, so the overhead is lower than that of the VM. VM has to do machine emulation. Within the container the running processes can only see the allocated resources of the container. It cannot see the processes of the other container, or directly access its resources.

Friday 24 July 2015

Micro Architectures vs Modern Web Architectures (Macro Architectures)

What is this article all about?

Well, some interesting stuffs mostly. Some nice bedtime stories for crazy technology persons. OK, lets get started with the story.

What I am going to talk (write) about today is how similar the design of a microprocessor looks to a modern web architecture. So much so that I am just going to call it a macro-architecture. The implications could be interesting as microprocessor architects could quickly find a way to jump into that next social networking or e-commerce disruptions and do some really interesting work.

Some definitions/comparative terms to get into the mood

Let's take these correspondences between web and micro architecture with a pinch of salt.

  1. Requests per seconds <=> Instructions per cycle (IPC)
  2. REST API <=> Instruction Set Architecture (ISA)
  3. App-servers <=> Functional or Executions Units (FU)
  4. Load-Balancers <=> Instruction Schedulers
  5. In-memory databases <=> L1/L2/L3 caches
  6. OLTP (online transaction processing) DB <=> Main Memory
  7. Backup and restore <=> Hard Disk Drive
  8. Frontend <=> Graphics Processor
  9. Full Page Caching <=> Image Buffering
  10. Notifications <=> Interrupts
  11. JSON <=> Instruction data and results
I hope I have the computer architect's attention.

How does a micro-architecture look like?

Here's how the state-of-the-art Intel SandyBridge Micro-architecture looks like:

This is a very high level architecture diagram and I am sure it doesn't capture the micro-architecture in its entirety.But it tells us enough for the present discussions. One complaint I do have is that, it doesn't explicitly show the various queues which should sit in front of the various execution units and Register Files which are important citizens of the micro-architecture world.

What is the micro-architecture trying to optimize?

The micro-architecture is all about providing performance and throughput. First comes from the silicon, that is how fast is the clock-cycle (roughly) and second from how many instructions get processed per cycle, mainly the domain of the micro-architects.

The program execution time is rougly: (num of instructions)/(IPC x clock cycle time).

Eg: Execution Time = 1000 Billions Instructions / ( 10 Instructions/cycle * 1 Giga cycles/seconds) = 100 seconds

The micro-architecture's job is to increase IPC so that execution time can go down. The job of the EDA(electronic design automation) tools is to increase clock cycle time so that the execution time can go down. The programmers job is to write good code resulting in fewer number of instructions for the program (through the use of an efficient compiler) in the executable so that the execution time can go down.

What should the macro-architecture try to optimize then?

In the last section I introduced the concept of throughput but did not explain it. That was strategic. The macro-architecture's job should be to improve throughput and performance. Hey, what's the difference? You just interchanged this two terms.

Well, everyone loves performance, but macro-architectures main purpose is throughput. Throughput is a slightly different concept. Imagine many programs contending for the same resources (micro-processor). The interface (OS in the computer system) gives you the illusion that you are the only user of the system. But behind the scenes the OS is the gatekeeper of the micro-processor allowing fair and equal access to each programs (from different users) to the computing resources. Now individual execution times for each microprocessor is going to go down obviously (since you have a mandatory waiting period for the resources). Throughput is a measure of aggregate performance of the system in this case. A simple metric in this case would execution times/programs. This definition is for the lay-person but should convey the idea.

In the web-world, latency is roughly the measure of performance of the system from the individual perspective, and throughput is the measure of how much load the system can handle when the number of users and requests incoming to the system grow, at the same average latency . The latency would be measured in seconds (e.g 15 ms to load a page), and throughput measured in Requests per seconds (RPS) or queries per second (QPS) serviced. (e.g 1000 requests/second or 1 million queries/second).

How are micro-architecture and macro-architecure achieving scalability?

In the micro-architecture world, the processing power is scaling "horizontally" by going multi-core. Alas, the shared memory will always be a pain in the neck. Different threads of execution can run concurrently in the various cores, but when they have to contend for a shared memory, they have to synchronize essentially meaning serialize. And serialization is the arch-enemy of performance and parallelization. You cannot deal with serialization by throwing in more resources. Moore's Law enabled vertical scaling for decades through smaller process nodes, but on-chip horizontal scaling seems to be gaining some traction now.

In the world of macro-architecture, database, persistent or not is the serialization agent, as it is a shared resource. Brewer's CAP theorem gives an insight as to what trade-offs can be done with databases for performance and scalability. Scaling vertically is expensive (run system on more powerful machines), scaling horizontally by adding commodity machines is more efficient.

What are important factors to consider for performance in a micro-architecture?

The performance of the microarchitecture depends on how it resolves resource, control and data dependencies. The three kinds of data dependencies is WAW (Write after Write), RAW (Read after Write) and WAR (Write after Read). Two of this dependencies WAW and WAR can be resolved by renaming the write locations (so they do not clash), i.e by renaming the registers if the destination is the register or allocating a space in the store buffer. The RAW is the real data dependencies, as the read must be ordered after the write finishes. This dependency can be broken by value prediction but unless there is a pattern, it is hard to predict. RAR (Read after Read) does not pose a dependency at all.

Control dependency has to do with the sequence of instructions being issued to the processor. This isn't always necessarily linear. When there is a branch in the instruction sequence, the branch resolving instructions may stall the execution pipeline. This is so because the correct branch cannot be determined until the execution of the branch resolution instruction completes. This dependency is resolved by branch predictors, which predict the direction and target address of the branch and execute speculatively while the branch is being resolved.

Resource dependency is when the instructions cannot execute because there are no available resources for execution. Of course this easy to replicate, but a general purpose processor cannot always accurately figure out what is the optimal number for a particular applications. The numbers are usually tuned for a particular benchmark.

The arithmetic logic unit (ALU) can be seen as implementing the business logic of the micro-architecture system. It is stateless, so it has the desired properties for scalability. Likewise, different functional units such as floating point units (FPU) can also be regarded as business logic of the micro-architecture.

What are important factors to consider for scaling a macro-architecture?

The desirable property of the macro-architecture is that the app-server components implementing the business-logic should be state-less, delegating state-keeping to other components. This way the resource dependencies can be resolved.

Coming to data dependencies, we have roughly the following schemes to resolve real data dependencies:

  • RAR (Read after Read): Replication of the databases for increasing read throughput.
  • WAW (Write afer Write): We have two flavors, WAW of different data segments or same data segments.
    • Sharding: Separating different write segments to different databases, increases write throughput in this case
    • Versioned write more comparable to register/location renaming to scale write to same location
  • WAR (Write atter Read): Handled the same way as WAW

Architectural Savior: Caching

Caching is a way to hide data access latency. The data may reside in a hierarchical storage with increasing access latencies, like registers < L1 cache < L2 cache < L3 cache < RAM < SSD < HDD < Network (LAN) < Network (WAN). The latencies increases with hierarchical components with increasing storage capacities. Some of this storage structures are persistent, others are not.

When considering caching, we need to know about atleast two things: spatial locality, and temporal locality. This is best explained using an example. Imagine that we have to operate on a "working set" of 1 GB of data, however 90% of our access involves only 256 KB of data and rest 10% involves access to rest of the data. A smart choice would be to put the heavily accessed 256 KB of data in a faster memory (cache), than putting all of 1 GB together in a slower and bigger memory hierarchy. Let's play with some numbers, and imagine that access from my fast memory is 1 ms, while access from the bigger memory is 10 ms. If 100% of my accesses go to the bigger memory, my average data access time is 10 ms. If I do the caching, my average access time is (1*90 + 10*10)/100 =1.9 ms, orders of magnitude faster !!! This is how we would increase data access throughput.

Spatial locality means that the data being accesses are placed closer together in space, which means certain blocks of data are heavily accessed while temporal locality is data accesses are closer in time for a single unit of data. Data displaying any of this traits is a good candidate to be kept in the cache.

In a macro-architecture, the data is usually fetched from a database which has a higher access latency. If that particular data is to be accessed frequently then it makes more sense to put it in the cache to hide db latency. Temporal locality plays a big role here.

Queues: A first class citizen of an architecture

If we look at any of the state-of-the-art micro-architecture, we will see queues everywhere. When we think of queues, we think of pipelines (a hardware queue), and when we think of pipelines, we think of throughput. The very first micro-architecture was a simple pipeline: Fetch, Decode, Issue, Execute, Commit. Later on as the micro-architecture evolved, lot of components got added into this simple architecture. Queues were one of the most important components to be added. Reservation Stations, Reorder Buffer, Load/Store Queues, are all good examples. Queues serve two important functions, it handles a temporary increase in demand for a particular resource, and secondly serializes the commit stage. We shall see both this factors also play a role in macro-architectures.

Let us look at some special type of queues in micro-architecture: Point-to-Point can be used for Single Instruction Single Data (SISD), Multi-Point-to-Point can be used for Single Instruction Multiple Data (SIMD), a topic based message passing queue can be used for Multiple Instruction Single Data (MISD) and Multiple Instruction and Multiple Data (MIMD) cases. Basically MISD is a single producer multiple consumer scenario and MIMD is a multiple producer multiple consumer scenario.

OLAP and OLTP processing for the architecture

For both architectures we have Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP) as integral components. OLTP is immediately visible, this is what is absolutely needed for functional working of the system. Interestingly we have a lot of OLAP going on behind the scenes as well. Micro-processors captures the characteristics of the instruction and data streams through simple or fairly complex analytics to provide realtime decisions for pre-fetching of instructions, pre-fetching of data, branch prediction, Branch target address generation, memory address speculation, and sometimes value prediction. The end result is a better performance and throughput as a result of highly accurate prediction.

The OLAP and OLTP part of the systems must be independently designed for maximum efficiency and modularization. The components have different requirements, and can work together to deliver increased performance.

Architecture Evolution

Both micro-architecture and macro-architecture started off as monolith systems. This is a good starting point which provides a consistent interface to the system. But as the demands on the system grows, the monolith must be broken to enable each components of the system to evolve independently. Profiling to identify bottleneck is the main driver for the evolution. As long as there are identifiable bottlenecks there is potential way to alleviate it, mitigate it, and I would prefer eliminate it. The micro-structure regime saw this evolution in moving from in-order execution to out-of-order execution, thereby providing an elegant way of neutralizing resource bottlenecks. Other innovations helped boost performance by hiding data-access latency through caching, and efficient organization of the memory/data storage hierarchy. Out-of-Order execution and in-order committing essentially provides a illusion of sequential running of the programs but at a much superior performance.

The web product system also started off as a monolith system. However it to evolve over time for increases throughput. Next in the path of evolution was Service Oriented Architecture (SOA). This can also be seen as move to grab the low-hanging fruits of resource dependencies. There is even the concept of Enterprise Service Bus (ESB) which is reminiscent of the shared CPU bus(data/address/control bus). Processor architects know how quickly a bus turns into a formidable bottleneck, and there are other elegant solutions which reduces bus contention. Next in the path of this evolution is Micro-Service Architecture (MSA). However data-access latency will continue to remain a challenging bottleneck. Inter-component communications can also be a bottleneck.

Docker is a very promising technology which can enable MSA going forward, however it does need to solve some architectural issues first, mainly that of communication between different containers.

Different components of a Macro-architecture

Fortunately there is a lot of high-quality open-source solution to quickly put together a scalable system. But just as in the case of designing a micro-architecture, a lot of tuning needs to be done to push the system to deliver maximal performance and throughput.

Here's some example of open sources components:

  1. Load Balancers: Nginx, HAProxy
  2. Queueing: RabbitMQ, ActiveMQ, ZeroMQ, Kafka, Kestrel
  3. Caching: Redis, Memcached, MemBase,
  4. Full-page Caching: Varnish, Squid
  5. DataStores: RDBMS (MySQL, MariaDB, Postgresql), Document-based (MongoDB, CouchDB)
  6. Analytics system: Column-Oriented (Cassandra, HBase), Graph (Neo4j, Hypergraphdb)
  7. Time-Series Databases: InfluxDB, OpenTSDB
  8. Batch Processing : Hadoop, Spark
  9. Stream Processing : Apache Storm, Apache Spark
  10. Monitoring/Alerting: Nagios, Graphite, Graphana, Ganglia, Icinga, zabbix
  11. Search/Indexing: Sphinx, Solr, ElasticSearch
  12. Log Analysis: ELK stack (ElasticSearch, LogStash, Kibana)
  13. Messaging: Ejabberd, Openfire
  14. Service Discovery: Zookeeper, Eureka
  15. Networked Storage: Ceph, GlusterFS
Most of the components are independently scalable.

Conclusions and Recommendations

As pointed out in the article there is a lot of common-ground between the two architectures. Mostly each area can learn and apply the architectural patterns from each-other to deliver performance and scalability. The web macro-architecture is inherently concurrent in nature and concurrent execution is the forte of the hardware designers. So I think there can be a lot of healthy symbiosis between the folks. The good thing about architecture is, it critically examines the bottlenecks of the systems, and quantitatively characterized what can be scaled, and by how much. It studies the limits of performances. A better design pattern leads to a clean visualization of the system, and that clarity enhances innovation.

Monday 4 May 2015

Derivation of Strassen's Algorithm for the multiplication of $2 \times 2$ matrices.


Abstract

Strassen's multiplication algorithm for the multiplication of $2 \times 2$ matrices using seven multiplication instead of the naive 8 multiplication heralded a new era of research into asymptotically more efficient algorithms. It showed that the multiplication of two $n \times n$ matrices could be achieved in less than $O(n^3)$ time. In this blog, we attempt to present a derivation of this seminal work, which only requires a modest background in linear algebra.


Introduction

The $2 \times 2$ matrix multiplication $A.B = C$ is usually written as: \[ \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{bmatrix} \] The classic "definition" of matrix multiplication says that \[ C = \begin{bmatrix} a_{11}.b_{11} + a_{12}.b_{21} & a_{11}.b_{12} + a_{12}.b_{22} \\ a_{21}.b_{11} + a_{22}.b_{21} & a_{21}.b_{12} + a_{22}.b_{22} \end{bmatrix} \] As can be seen an upper bound on the number of multiplications, is quickly identified to be 8.

In vector notation, we have row vectors for $A$ and column vectors for $B$: \begin{array}{lcl} A_1 & = & a_{11}e_1 + a_{12}e_2 \\ A_2 & = & a_{21}e_1 + a_{22}e_2 \end{array} and \begin{array}{lcl} B_1 & = & b_{11}e_1 + b_{21}e_2 \\ B_2 & = & b_{12}e_1 + b_{22}e_2 \end{array} Then \[ C = \begin{bmatrix} A_1.B_1 & A_1.B_2 \\ A_2.B_1 & A_2.B_2 \end{bmatrix} \]

Strassen's Algorithm

Strassen's algorithm provided the first lowering of the upper bound for the matrix multiplication of $2 \times 2$ matrices. Replicating the construction from wikipedia page for easy reference: \begin{array}{lcl} M_1 & = & (a_{11} + a_{22})(b_{11} + b_{22}) \\ M_2 & = & (a_{21} + a_{22})b_{11} \\ M_3 & = & a_{11}(b_{12} - b_{22}) \\ M_4 & = & a_{22}(b_{21} - b_{11}) \\ M_5 & = & (a_{11} + a_{12})b_{22} \\ M_6 & = & (a_{21} - a_{11})(b_{11} + b_{12}) \\ M_7 & = & (a_{12} - a_{22})(b_{21} + b_{22}) \end{array} \begin{array}{lcl} c_{11} & = & M_1 + M_4 - M_5 + M_7 \\ c_{12} & = & M_3 + M_5 \\ c_{21} & = & M_2 + M_4 \\ c_{22} & = & M_1 - M_2 + M_3 + M_6 \end{array}

Main ideas behind the derivation

We now outline the main ideas behind the derivation.

  1. The dot-product in vector notation remains invariant with simultaneous indices rotation .
  2. Keeping the dot-product constant for $c_{i,j}$, we explore various transformations of the vectors $A_i$ and $B_j$.
  3. We do not restrict the vector transformation to 2-dimensions, we may transform 2-D vectors to 4-D, if the dot-product invariance is maintained.
  4. When transforming to higher dimension vectors, we try to reuse as much computation from the same-dimension transforms. This is suspiciously similar to embedding in Group Theory.
  5. We search for symmetries in the computation of the various dot-product.
WLOG, we transform $c_{21} = A_2.B_1 = \hat{A_2}.\widehat{B_1}$, and then use simultaneous indices rotation to get $c_{12}$. Similarly, we transform $c_{11} = A_1.B_1 = \tilde{A_1}.\widetilde{B_1}$, and then use simultaneous indices rotation to get $c_{22}$. We shall see that we identify one symmetric computation in the computation set.

Transformations and Computations

We now propose the transformations and computations, which leads to Strassen's algorithm. For transformation (1), we will use the following transformer $2 \times 2$ matrices, such that: \begin{array}{lcl} L_1 &=& \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} &and& L_1^{-1} &=& \begin{bmatrix} 1 & 0 \\ -1 & 1 \end{bmatrix} &and& L_1.L_1^{-1} &=& I \end{array} The matrix $L_1$ is easily recognized as the first of Pascal's matrices . Here is the link.

Transformation is as follows: \[ c_{21} = \hat{A_2}.\widehat{B_1} = \begin{bmatrix} a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ -1 & 1 \end{bmatrix} \begin{bmatrix} b_{11} \\ b_{21} \end{bmatrix} \] \begin{array}{rcccr} c_{21} & = & \begin{bmatrix} (a_{21} + a_{22}) & a_{22} \end{bmatrix} &.& \begin{bmatrix} b_{11} & (-b_{11} + b_{21}) \end{bmatrix}^T \end{array}

Calculating $c_{21}$ and using simultaneous rotation of both indices for $c_{21}$, we arrive at: \begin{array}{lcl} c_{21} &=& \begin{bmatrix} (a_{21} + a_{22}) b_{11} + a_{22}(-b_{11} + b_{21}) \end{bmatrix} \\ c_{12} &=& \begin{bmatrix} (a_{12} + a_{11}) b_{22} + a_{11}(-b_{22} + b_{12}) \end{bmatrix} \end{array}

For transformation (2), we will use the following transformer $4 \times 4$ matrices, such that: \begin{array}{rcr} L_2 &=& \begin{bmatrix} L_1 & 0 \\ I & L_1^{-1} \end{bmatrix} &and& L_2^{-1} & = & \begin{bmatrix} L_1^{-1} & 0 \\ -I & L_1 \end{bmatrix} &and& L_2.L_2^{-1} &=& \begin{bmatrix} I & 0 \\ 0 & I \end{bmatrix} &=& I_{4 \times 4} \end{array} A point to note here is that $L_2$ is not a Pascal's matrix, but a matrix of higher order derived out of it.

Transformation is as follows: \[ c_{11} = \tilde{A_1}.\widetilde{B_1} = \begin{bmatrix} 0 & a_{11} & a_{12} & x \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & -1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 \\ -1 & 1 & 0 & 0 \\ -1 & 0 & 1 & 0 \\ 0 & -1 & 1 & 1 \end{bmatrix} \begin{bmatrix} y \\ b_{11} \\ b_{21} \\ 0 \end{bmatrix} \] Therefore, \begin{array}{rcccr} c_{11} & = & \begin{bmatrix} (a_{11} + a_{12}) & (a_{11} + x) & (a_{12} - x) & x \end{bmatrix} &.& \begin{bmatrix} y & (-y + b_{11}) & (-y + b_{21}) & (-b_{11} + b_{21})\end{bmatrix}^T \end{array}

Substituting $x = a_{22}$ and $y = -b_{22}$, we get: \begin{array}{rcccr} c_{11} & = & \begin{bmatrix} (a_{11} + a_{12}) & (a_{11} + a_{22}) & (a_{12} - a_{22}) & a_{22} \end{bmatrix} &.& \begin{bmatrix} -b_{22} & (b_{22} + b_{11}) & (b_{22} + b_{21}) & (-b_{11} + b_{21})\end{bmatrix}^T \end{array}

Calculating $c_{11}$ and using simultaneous rotation of both indices for $c_{11}$, we arrive at: \begin{array}{lcl} c_{11} &=& \begin{bmatrix} -(a_{11} + a_{12}) b_{22} + (a_{11} + a_{22})(b_{22} + b_{11}) + (a_{12} - a_{22}) (b_{22} + b_{21}) + a_{22}(-b_{11} + b_{21}) \end{bmatrix} \\ c_{22} &=& \begin{bmatrix} -(a_{22} + a_{21}) b_{11} + (a_{22} + a_{11})(b_{11} + b_{22}) + (a_{21} - a_{11}) (b_{11} + b_{12}) + a_{11}(-b_{22} + b_{12}) \end{bmatrix} \end{array}

Observations

We can observe the fact that all computations of $c_{i,i+1}$ are reused in the computations of $c_{i,i}$, and that there was one symmetric computation. This reduced the number of multiplications from 8 to 7. The order of the transformations also do no matter, i.e transforming for $c_{i,i}$ before $c_{i,i+1}$ would also have yielded the same reduction in computation. In this case, appropriate choice of free variables $x$ and $y$, needs to be done. Furthermore, the placement of the free variables also does not matter for second order transform, as long as the dot-product constraints are satisfied. From this exercise, we can safely say that the computation set of {$c_{12}$, $c_{21}$}, is embedded in the computation set of {$c_{11}$, $c_{22}$}. And the size of the second computation set is 7. If this is the smallest such set, then the number must be optimal too.

Future Directions for budding researchers

  1. A natural question to ask is, can we extend these ideas to $3 \times 3$ and higher-order matrix multiplication?
  2. Can we "translate" the derivation here to the language of Group Theory?
  3. If we do not find an "optimal embedding" for higher-order matrix multiplication, can we find an approximate embedding?
  4. Can invariances, symmetries, embeddings solve one of the most exciting problems in computer science?

Conclusions

We have systematically derived Strassen's formula for computing $2 \times 2$ matrix multiplication using 7 multiplications. We modeled matrix multiplication as a vector dot-product computation (which is a well studied area). We explored transformations of vectors which kept the dot-product constant. We discovered degrees of freedom, when transforming the vectors to a higher dimension under the dot-product constraints. We utilized transformations and degrees of freedom to hunt for symmetries which reduced computation from 8 to 7.

The derivation presented in this work was deduced by analyzing and working backwards from Strassen's formula. This is interesting in and itself, to be presented in textbooks and problem-solving workshops alongside discussions of the seminal original work. The insights gained in the derivation can be applied in future work, to study its potential application in higher dimension matrix multiplications. The author feels that the generalization of this derivation to higher dimensions is non-trivial and is a worthy research challenge. However, the author is enthusiastic that a generalization, even to a smaller dimension greater than $2 \times 2$ matrix may offer new insights, which might lead us closer to the holy grail of matrix multiplication research, i.e a step towards $\tilde{O}(n^2)$ asymptotic complexity.

Disclaimer

I do not claim to be an expert in the area of matrix multiplication research. However, I love to solve challenging problems, and find this area intellectually appealing. Readers' feedback is highly anticipated and will be deeply appreciated. I apologize in advance for any factual errors in my presentation, and will do my due diligence to correct them, once they are pointed out to me.