Friday, March 3, 2017

Configuring and Restarting the Hive Metastore Service on an active AWS EMR cluster

If you have ever tried to troubleshoot connecting EMR to a persistent remote metastore, you know it can be challenging. Here are the steps I've taking to test changes.

1. SSH into the master node of the cluster
2. sudo cp /usr/lib/hive/conf/hive-site.xml /usr/lib/hive/conf/hive-site.xml.old
2. sudo vi /usr/lib/hive/conf/hive-site.xml
3. Make changes as needed
4. ps -ef | grep metastore
5. kill <pid returned from previous step>
6. nohup hive --service metastore &
7. beeline -h jdbc:hive2://localhost:10000 -u hadoop

Now test your changes. Once you iterate and get the correct hive-site.xml settings, you can put them in your EMR config file and try launching a fresh cluster.

Tag-Based Security for EMR Clusters in a Shared AWS Environment

In an shared AWS environment with multiple developers, you often want to ensure developers have the ability to launch personal resources, and embed sensitive information into those resources, without the fear that sensitive information would be visible to your entire group of developers. Consider the following scenario:

  1. Developers want to launch personal resources for developing code
  2. Developers need to embed personal credentials for external services like databases into the services they launch, either by supplying EC2 user data at launch or via a resource config file like the EMR config file
  3. You do not want developers to see any resources or sensitive information other than their own
  4. You want an account admin to be able see all resources even if that admin is not an account owner
I had always read and heard that tag-based authorization could be used to achieve this, but there seemed to be the loophole that if you could add and edit tags, couldn't you just edit tags on other users' resources to gain access to those resources? Up until now, I had not found a sufficient example of how to use truly secure tag-based authorization. Below is an example IAM policy that seems to meet this goal for the EMR service. Before reading the policy, it is helpful to note that in our current development environment, all developers are granted privileges to use all AWS services except IAM.
Therefore the only modifications needed were

  1.  Allow users to add a limited set of IAM roles to EC2 instances
  2. Prevent users from seeing EMR service information for clusters they did not launch

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "statement1",
            "Condition": {
                "StringEquals": {
                    "elasticmapreduce:RequestTag/owner": "${aws:username}"
                }
            },
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "iam:AddRoleToInstanceProfile"
            ],
            "Resource": [
                "arn:aws:iam::012345678910:role/EMR_DefaultRole",
                "arn:aws:iam::012345678910:role/EMR_EC2_DefaultRole"
            ]
        },
        {
            "Sid": "statement2",
            "Condition": {
                "StringNotEquals": {
                    "elasticmapreduce:ResourceTag/owner": "${aws:username}"
                }
            },
            "Effect": "Deny",
            "Action": [
                "elasticmapreduce:AddTags",
                "elasticmapreduce:Describe*"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

The key to tag-based authorization that was always missing from the picture in my mind was the RequestTag element and an IAM policy variable within a conditional allow statement. Statement 1 above effectively states that a developer is only allowed these elevated IAM privileges if he or she puts a tag on their resource request where tag name = owner and tag value = their username. This means they must put their username on every EMR resource request they make or the request will fail. Statement 2 above denies the developer the ability to add or remove tags, or to describe EMR resource unless the resource contains a tag where tag name = owner and tag value = their username.