Modern applications operating at massive scale need to continually ingest, process, and store incoming data. Traditional application frameworks aren’t always up to the task of handling this much data.
Fortunately, we can use tools specifically designed to handle large volumes of streaming data. Apache Kafka and Amazon Kinesis are two of the most popular options. Let’s compare the two solutions to help you choose which one is right for your needs.
Many organizations are required to comply with laws and regulations regarding security, privacy, and data residency such as GDPR, PCI DSS, HIPAA, and CCPA. As a result, it’s important to know where — and how — data is stored.
Self-managed Apache Kafka clusters store data inside your own servers. This can be helpful — and sometimes necessary — for meeting compliance requirements. Complete control over your servers allows your organization to know exactly where and how its data is stored.
Amazon Kinesis, in contrast, covers data privacy and security under a shared responsibility model. The data remains in Amazon data centers, and they provide you with access controls to prevent unauthorized access. Strict compliance requirements don’t mean you can’t use cloud services like Kinesis, but you should be careful to ensure Kinesis can meet your compliance obligations before beginning a project. Check Amazon’s documentation to learn about compliance with Kinesis.
Licensing, Infrastructure, and Costs
Both Amazon Kinesis and Apache Kafka start with no payment in advance, but their infrastructure setup and costs differ significantly.
Apache Kafka is open-source and freely distributed. It’s available to anyone who can download and configure it on their infrastructure. Though you can use Kafka for free, your costs will include the infrastructure needed to keep your cluster up and running 24/7.
Plus, you need to factor in the costs of setup and management. Kafka requires complex infrastructure setup, so you will need to employ a team with prior experience in deploying highly available software or a managed service like Confluent or Amazon Managed Streaming for Apache Kafka (MSK). As well, be aware that Kafka comes with no liability on its vendor license, so if you decide to host it yourself and run into issues, you’re on your own.
On the other hand, Amazon offers the proprietary Amazon Kinesis service alongside their Amazon Web Services (AWS) products. Kinesis runs on a pay-as-you-go model, so you only pay for the data you send through it.
AWS is responsible for the service’s performance and availability. You can claim refunds based on your service-level agreements (SLAs) if your service doesn’t provide the agreed-upon performance.
Kafka maintains messages within a topic. Kafka’s architecture comprises topics, producers, consumers, and brokers (the cluster nodes that receive and process messages sent to the Kafka cluster). When a producer publishes a message, Kafka delivers it to one of the topic’s partitions. This topic partitioning helps increase the cluster’s throughput and enables parallel message processing. Each Kafka partition can have its own starting position, called an offset. Kafka also allows you to run through the history of all messages as they were “produced.”
Kinesis works in much the same way, though it uses different terminology. Kinesis Data Stream (KDS) is equivalent to a Kafka topic. A Kinesis shard is like a Kafka partition. Streams are made up of one or more shards, which can process messages in parallel. You can scale up a stream by adding more shard to it. Kinesis uses a sequence number for the messages, similar to the offsets in Kafka partitions.
Kinesis keeps data for up to 24 hours, but you can extend the data retention to up to 7 or 365 days by enabling extended data retention or long-term data retention, respectively. In Kafka, you can configure the retention you require, depending on your available storage, although the default retention period is 168 hours (seven days).
Amazon Kinesis is quick to launch since it is a managed solution. AWS scales the cluster by increasing the number of shards without limit. Your producers write the data on shards that your service’s consumers can retrieve.
Kafka and Kinesis store the messages opaquely and immutably in a partition or shard. Consumers can pass an offset or a sequence number to fetch the requested record when reading the messages from a Kafka topic or Kinesis stream.
Performance and Reliability
Both data streaming solutions offer competitive performance. Amazon uses Kinesis to power live video streaming services and manages events from Internet of Things (IoT) devices. Kinesis can handle up to 5 million records per second if you provision enough shards in the cluster.
Apache Kafka’s architecture has a significant advantage of immutable messages. Depending on the message size, it can process up to 30,000 records per second or even a few million records per second.
Kafka and Kinesis persist data in underlying storage, either object or block storage on hard drives, to prevent data loss. You can deploy Apache Kafka in a replicated environment to provide high availability. However, this is slightly more complicated than using Amazon Kinesis, since AWS delivers high availability for your Amazon Kinesis service.
Amazon Kinesis and Apache Kafka both offer SQL language support for data transformation queries. Kinesis provides SQL capabilities to applications requiring stream processing on relational data. Common examples include data analytics and data warehouses. Kinesis also supports advanced SQL operations, such as defining custom streams, running continuous queries, and creating windowed queries.
Apache Kafka provides a proprietary SQL, a streaming SQL (KSQL), to support analytics on real-time data as it streams batches. KSQL uses Kafka’s Streaming API internally and allows developers to abstract their code on either a stream or a table. The table is a view of the Kafka stream or another database table. KSQL has evolved into ksqlDB, a database engine to support real-time analytics on real-time data.
Both Kinesis and Kafka offer similar functions, though some programming constructs might differ.
Overall, both products have mature features and provide best-in-class security and performance. Apache Kafka supports security on your streams and allows the configuration of finer details such as the security protocol (SSL versus SASL) and certificate configuration. It also supports encrypting data at rest and while in transit to the consumers. Additionally, you can configure authorization to control who can access the stream data.
Kafka’s cutting-edge security features include host-name verification. This feature prevents man-in-the-middle attacks by ensuring the correct host is requesting the services. For self-managed clusters, you can create custom certificate authorities (CA) to make a local enclave.
Amazon Kinesis offers similar security and access control features, although not as broad out-of-box. You can always connect third-party and partner security solutions to manage your cluster. You can also define your own policies for AWS to grant or revoke access to users and cloud resources.
Both products’ communities have produced diverse technical content. You can easily find getting started guides to deploy and run your clusters and SDKs in your preferred language and framework. Third-party support services are available or you can hire an engineer to help you, too.
Connecting Kinesis or Kafka to your data sources is accessible, and they support most data sources out of the box. It makes more sense to use Kinesis if your data is stored inside Amazon DynamoDB or Amazon Relational Database Service (RDS). Similarly, it makes sense to use Kafka on an on-premise PostgreSQL server.
Apache Kafka does not come with a support contract, but you can purchase one from Confluent and use community tutorials and guides to manage the service on your infrastructure. Amazon Kinesis uses the contract agreement between the organization and Amazon partners and has community-produced articles and technical manuals.
Both Apache Kafka and Amazon Kinesis enable developers to process streaming data and connect microservices in a distributed environment. They do have their differences, though, and knowing when each product is more appropriate helps decide which to use.
If you’ve deployed your infrastructure to AWS, it makes more sense to purchase Amazon Kinesis, because Amazon SDK integrates well with other Amazon services. Apache Kafka is more valuable than Amazon Kinesis if you have an infrastructure and develop your solutions using Java programming language since Kafka is mainly Java-based and you can embed a broker inside your Java application.
Ultimately, your choice between Amazon Kinesis and Apache Kafka depends on your application’s needs and your existing infrastructure.
If you’re interested in developing expert technical content that performs, let’s have a conversation today.