Hadoop 3.0 Security by Ben and Joey

Hadoop 3.0 Security by Ben and Joey: Protecting Your Big Data Platform is an excellent, practical, well-written book which describes the Apache Hadoop and the numerous security features within Apache Hadoop that can be implemented. This book starts with a basic history of how and why Hadoop was developed and then breaks down how Apache Hadoop can be secured in three (3) sections which are: 1) Security Architecture, 2) Authentication, Authorization & Accounting (AAA) and 3) Data Security. A fourth section entitled putting It All Together summarizes the first three sections. The best thing about this book is it does a very thorough job of not only explaining the functionality of the very complex Apache Hadoop system and all it’s components but also explaining how to configure the built-in security features within these components. Actual code segments that provide more details for enhancing the security for these components makes this book an excellent reference guide for any security professional. I would highly recommend adding this book to your IT Security collection if you are facing the daunting task of securing an Apache Hadoop system.

 

Hadoop 3.0 Security

  1. Author : Ben Spivey & Joey Echeverria
  2. Paperback: 364 pages
  3. Publisher: Shroff Publishers & Distributers Private Limited – Mumbai (2015)
  4. Language: English
  5. ISBN-10: 9352131428
  6. ISBN-13: 978-9352131426
  7. Product Dimensions: 17.8 x 22.9 cm

Hadoop 3.0 Security by Ben and Joey : Table of contents

  1. Introduction : this capture gives security overview
  2. Part I : Security Architecture
    1. Securing Distributed System
    2. System Architecture
    3. Kerberos
  3. Part II : Authentication, Authorization and Accounting
    1. Identity and Authentication
    2. Authorization
    3. Apache Sentry (Incubating)
    4. Accounting
  4. Part III : Data Security
    1. Data Protection
    2. Secure Data Ingestion
    3. Data Extraction & Client Access Security
    4. Cloudera Hue
  5. Part IV : Putting it all together
    1. Case Studies

Hadoop 3.0 Security by Ben and Joey : Why to Read

As more corporations turn to Hadoop to store and process their most valuable data, the risk of a potential breach of those systems increases exponentially. This practical book not only shows Hadoop 3.0 administrators and security architects how to protect Hadoop data from unauthorized access, it also shows how to limit the ability of an attacker to corrupt or modify data in the event of a security breach. Authors Ben Spivey and Joey Echeverria provide in-depth information about the security features available in Hadoop, and organize them according to common computer security concepts. You’ll also get real-world examples that demonstrate how you can apply these concepts to your use cases.Understand the challenges of securing distributed systems, particularly HadoopUse best practices for preparing Hadoop cluster hardware as securely as possible Get an overview of the Kerberos network authentication protocol Delve into authorization and accounting principles as they apply to Hadoop Learn how to use mechanisms to protect data in a Hadoop cluster, both in transit and at rest Integrate Hadoop data ingest into enterprise-wide security architectureEnsure that security architecture reaches all the way to end-user access About the Author Ben is currently a Solutions Architect at Cloudera. During his time with Cloudera, he has worked in a consulting capacity to assist customers with their Hadoop deployments. Ben has worked with many Fortune 500 companies across multiple industries, including financial services, retail, and health care. His primary expertise is the planning, installation, configuration, and securing of customers’ Hadoop clusters. In addition to consulting responsibilities, Ben contributes a vast amount of technical writing on customer document deliverables, to include Hadoop best practices, security integration, and cluster administration.

Part by Part Book Comprehension

The book starts with very brief few pages of an intro to security concepts, it get straight into things, which for reader is always a good indication that there won’t be much padding in this book on Hadoop Security.

The book first starts off with a section on security architecture starting with a basic look threat modelling for distributed systems, which is a nice touch as really threat modelling should be part of any security architecture discussion and even touching on this at a high level is great, as is puts the whole book in context.

The next chapter “System Architecture” moves onto general security architectures in a Hadoop 3.0 environment, covering network level segregation, OS level security and an overview of the different types of Hadoop node roles. This was a great start to the book as immediately it starts working through the different nodes, what user roles need access to them, what nodes can be segregated from direct access and how at a high level they interact for data loads and job submission.

The final chapter “Kerberos” of the architecture section finishes up with an overview of Kerberos, which while initially seemed a bit strange, it becomes obvious why later on as Kerberos plays such a key role in Hadoop security. If you need to get up to speed quickly on Kerberos, I’d highly recommend Kerberos: A Network Authentication System… it’s a quick and easy read that I read over ten years ago and it’s still as good now as it was then.

The next section “Part II“deep dives more into authentication and at this point the book gets straight into the hands on configuration guide, covering detailed configuration steps required to map Kerberos principles into the Hadoop world, how to map to local users, how user groups work in Hadoop and mapping to LDAP groups. The chapter then moves on to cover the various authentication protocols in use across the Hadoop ecosystem, before explaining the differences between simple and Kerberos authn and then a nice dive into token auth, including the flows of how delegation tokens are created to allow various systems to impersonate users. The chapter finishes off with a fully worked Kerberos authn configuration guide, which to be fair I skimmed over as I don’t need that level of detail at the moment.

The next chapter “Identity & Authentication” moves onto authz covering HDFS ACLs and extended ACLs and various service level authorisations before moving on to MapReduce (1 and 2) and YARN, and Zookeeper ACLs, HBase, and Oozie. There’s a few nice worked examples here of the effects of authorization restrictions and what errors users will see when their access is restricted.

The book then moves on to cover Sentry, which is Apache’s attempt to centralise authz within the Hadoop ecosystem, which after reading through the previous few chapters it’s obvious it’s needed! The basic architecture on which Sentry works is covered and how it integrates with the various applications and then walks though how to configure each application to use Sentry. Again a very practical oriented approach is taken here with a lot of detail on the configuration steps.

The last chapter in this section covers the logging available by default in each of the various applications and their basic config. This is a quick chapter and really just goes to show the configuration aspect, rather than any analysis approaches to the logs.

The third section “Part III” of the book moves onto data security, specifically to cover encryption of data in transit and at-rest, starting with great coverage of how HDFS file encryption works. What was particularly good in this chapter was the strong emphasis it places on the key management and also making the reader conscious of potential lack of encryption on temporary data such as logs. The second half of the chapter covers encryption of data in transit, mainly focusing on the configuration of SSL/TLS in the various applications in the ecosystem.

The next chapter is a short one and looks at security of data as it is loaded into the Hadoop ecosystem, covering both the confidentiality and integrity of the data, but mainly focusing on confidentiality/encryption. The following chapter then covers how client access of data in the Hadoop environment can be performed securely, focusing of course on the edge nodes and how users interact with them, through command line RPC or APIs. From an architecture perspective, I found this chapter particularly helpful as it does a good job of describing the trust boundary that will exist in most deployments and how this should be architected securely.

The last chapter in this section covers Cloudera Hue and to be honest I just skimmed this one as it wasn’t relevant to me.

The final section of the book covers some use cases nicely, outlining scenarios with business and security requirements, before walking through how to architect and configure the right mix of controls to meet the requirements. For me I would have loved more examples here as this is more at the level I’m working at, rather than the technical configuration. But still, great to see it presented in this way.

Overall, this was a great book that to be fair goes into a lot more depth in terms of technical configuration settings than I needed. This can make it a tough read if you’re just looking for the high level, however, if you’re setting up a Hadoop cluster then this should be your go-to book.

However, it also works great at the level I was looking for as it’s got a strong focus on architecture considerations and puts the security functionality into context rather than just explaining the feature sets available. You just may need to skim some of the more detailed sections like I did!