16 Mar 2016, 09:30

Around the Data - March 2016 - Kafka Connect and Streams, Hortonworks HDP 2.4, Elasticsearch cluster sizing, Spark Graph Frames, Flink 1.0

Kafka

  • Kafka 0.9.0.1 and Confluent 2.0.1 were released (bugfix release)
  • A more detailled presentation about Kafka Connect.
  • Confluent released a custom version of what will be in Kafka 0.10 and it's called Kafka Streams ; it aims to manipulate data within Kafka without requiring any external system such as Spark & co. Interesting move to make it feature richer and more autonomous to enirch data but challenge may be that kafka was performant due to it's low level of features. Will it remain performant with such additions ?

Hortonworks (Hadoop Distribution)

Elasticsearch

Spark

  • Databricks introduces Graph Frames ; it"s built on top of Spark Data Frames (and related APIs) and aim to ease manipulating graph data.

Flink

10 Feb 2016, 09:30

Around the Data - February 2016 - ELK, Kafka, Flink, Spark

Elasticsearch

Kafka

Flink

Spark

  • Spark 2015 year in review : Databricks (core developper of Spark) made a review of 2015 : the 4 release, the features, how spark is used, etc. I was surprised to see that majority of spark usage was as a standadlone cluster and not in an Hadoop context.

09 Dec 2015, 09:30

Aroun the Data - December 2015
  • HDFS is the filesystem of choice in Hadoop, with the speed and economics ideal for building an active archive.
  • For online data serving applications, such as ad bidding platforms, HBase will continue to be ideal with its fast ability to handle updating data.
  • Kudu will handle the use cases that require a simultaneous combination of sequential and random reads and writes – such as for real-time fraud detection, online reporting of market data, or location-based targeting of loyalty offers.
  • Kudu is to be released as an apache projet and Impala should become an apache project too.
  • Kafka 0.9 is released :
    • Better securtity : SSL certificates, kerberos, wired encryption, improved permissions
    • "Kafka connect" to ease pushing/pulling data from/to kafka. Kafka will include a file connector, Confluent platform will have database & hadoop connector.
    • User defined quota to throttle connections & bandwith
    • New consumer
    • Confluent, the core contributor of Kafka releases their distribution Confluent Platform 2.0, with all features above and the schema registry which allow versionning at least of your message schemas (and compatibility for what I understood). This platform is open-source too with paid support if needed.
    • How to Build a Scalable ETL Pipeline with Kafka Connect : a sample to use Kafka Connect and Schema Registry to pull data from MySQL to HDFS/Hive via Kafka.

     

     

    11 Nov 2015, 09:30

    Around the Data - November 2015 - Diving in Kafka and ELK 2.0 releases

    Kafka

    • A 2 part blog post serie (part 1, part 2) to learn the genese of Kafka within LinkedIn before being opensourced and hosted by the Apache foundation ; always interesting to know the first use cases the software was built for and how it became the distributing messaging system we now know. It also stand current use cases for Kafka.
    • Putting Apache Kafka to use : a practical guide to building a streaming data platform (part 1 ; part 2) : the first part is about the shift to the event based approach and the definition of a streaming data platform. The second part is about about implementation best practices.
    • A kafka presentation from MixIt event (in French), which introduces Kafka and how it is used at EDF for the "Linky" device (energy monitor)

    ElasticSearch / Logstash / Kibana