Notes while reading DDIA by Martin Kleppmann
Preface
An application is called data-intensive, if data is it's primary challenge (quantity, complexity or speed). This is opposite of compute-intensive, where CPU cycles are the bottleneck.
Reliability: Tolerating hardware & software faults. (Human Error)
Scalability: Measuring load & performance. (Latency percentiles, throughput)
Maintainablity: Operating, simplicity & evolvability.
Chapter 1
Fault: One component of the the system deviating from its spec.
Failure: The system as a whole stops providing the required service to the user.
It is impossible to reduce the probability of fault to zero, that is why we must design fault-tolerance mechanisms that prevent faults from causing failures.
In these type of systems, it makes sense to trigger faults deliberately. (Netflix Chaos Monkey)
Types of Fault: Hardware, Software, Human
Describing Load (Load Parameters)
Example: Twitter
Post tweet: A user can publish a new message to their followers (4.6k requests/sec on avergage, over 12k requests/sec at peak)
Home timeline: A user can view tweets posted by the people they follow (300k requests/sec)
Handling 12k writes per second would be easy, however Twitter's scale challenge is that each user follows many people, each user is followed by many people.
We can implement this in two ways:
(1): Writing a SQL query which just inserts the new tweet into a global collection. When a user requests there home timeline look up all the people they follow, find all tweets for each of those users, and merge them (by time)
(2): Maintain a cache for each user's home timeline, like a mailbox of tweets for each recipient user. When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. This makes the request to read the home timeline cheap, because its computed ahead of time.
Describing Performance
(coming soon, still reading)