Blog

Installing r studio server on AWS [edit it! before publish]

https://www.rstudio.com/products/rstudio/download-server/

https://cran.rstudio.com/bin/linux/ubuntu/README.html

https://cran.r-project.org/mirrors.html

https://askubuntu.com/questions/995484/data-from-such-a-repository-cant-be-authenticated

sudo nano /etc/apt/sources.list

add the following to the source.list file: 

deb https://cran.dcc.uchile.cl/bin/linux/ubuntu xenial/ 

sudo apt-get update --allow-unauthenticated
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/rstudio-server-stretch-1.1.447-amd64.deb

sudo gdebi rstudio-server-stretch-1.1.447-amd64.deb

https://support.rstudio.com/hc/en-us/articles/200552306-Getting-Started

adduser omid

remeber the password – it is used by r.
adding omid to sudoers.

Pending Packages list from alex:


 


packages installed (list from alex):

rJava – done.

sudo apt-get install default-jdk

RJDBC – done.

randomForest- done.

rpart done.

svMisc -done.
sp - done.
dplyr - done.

rgeos – Done.  sudo apt-get install libgeos-dev



rgdal – done.

sudo apt-get install libgdal1-dev libproj-dev

aws.s3 –

sudo apt-get install libxml2-dev

install httr and xml2 on r.

sudo apt-get install openssl

sudo apt-get install libssl-dev

install open ssl in r.

sudo apt-get install curl
install curl in r.

How to increase disk space on master node root partition in EMR

I have tested the following solution on my side using the r4.4xlarge instance type

Please find the steps-

Step-1) Increase the root volume of the EMR master node

To navigate to the EMR cluster’s master node’s root EBS volume the following steps can be taken:

– Open the EMR cluster in the EMR console
– Expand the hardware dropdown
– Click on the Instance Group ID that is labelled as the MASTER
– Click on the EC2 instance ID shown in this table, which will open the master node in the EC2 console
– In the “Description” tab in the information panel at the bottom of the console, scroll down and click on the linked device for the “Root device” entry
– The EBS volume will now open in the EC2 console, this is the root EBS Volume for the master node, and should look like the screenshot below
– Now you should be able to choose the “Modify Volume” action from the “Actions” dropdown, and change the volume size!

In this case, I adjusted the size of the EBS volume from 10GB to 50GB, simple as that! Alternatively, you could give the customer the exact CLI command to do this rather than trying to guide them through this process on the console:

aws ec2 modify-volume –region us-east-1 –volume-id vol-xxxxxxxxxxxxxxxxxxxx –size 50 –volume-type gp2

Step-2) Login to the Master node with SSH and check run the following comand to check the newly attached size information under /dev/xvda.

lsblk
df -h

Step-3) However, it is important to note that at this point you will still not see the additional space on the file system (root volume /dev/xvda1). Run the following command to add the following space to root volume.

sudo /usr/bin/cloud-init -d single -n growpart
sudo /usr/bin/cloud-init -d single -n resizefs

Step-4) Now run below command to see “/” volume is increased to 50GB.
df -h
lsblk

Step-5) After the volume is increased, you can do a test create a sample test file and see the size of “/” volume increased its usage.
sudo fallocate -l 10G /test (This will create a test file inside “/” with 10GB in size)
df -h (Verify root volume mount point increased its usage)

Step-6) After verifying, just delete sample /test file.
sudo rm -rf /test
Note: Please back up any important file or configuration files before performing the operation.

 

Need to learn more about aws big data (demystified)?

Questions and answers on AWS EMR Jupiter

 

1. Can we connect from the jupiter notebook to: Hive, SparkSQL, Presto

EMR release 5.14.0 is the first to include JupyterHub. You can see all available applications within EMR Release 5.14.0 listed here [1].

2. Are there any interpreters for scala, pyspark

When you create a cluster with JupyterHub on EMR, the default Python 3 kernel for Jupyter, and the PySpark, SparkR, and Spark kernels for Sparkmagic are installed on the Docker container. You can use these kernels to run ad-hoc Spark code and interactive SQL queries using Python, R, and Scala. You can install additional kernels within the Docker container manually i.e. you can install additional kernels, additional libraries and packages and then import them for the appropriate shell [2].

3. Is there any option to connect from jupiter notebook via JDBC / secured JDBC connection?

The latest JDBC drivers can be found here [3]. You will also find an example here that uses SQL Workbench/J as a SQL client to connect to a Hive cluster in EMR.

You can download and install the necessary drivers from the links available here [4]. You can add JDBC connectors at cluster launch using the configuration classifications. An example of presto classifications and an example of configuring a cluster with the PostgreSQL JDBC can be seen here [5].

4. What would be steps to bootstrap cluster with jupyter notebooks

aws dedicated blog post states [6], aws provide a bootstrap action [7] to install Jupyter on the following path:

‘s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh’

5. Any way to save the jupiter notebook on a persistent storage like s3 automatically like in zeppelin?

By default, this is not available, however, you may be able to create your own script to achieve this.

EMR enables you to run a script at any time during step processing in your cluster. You specify a step that runs a script either when you create your cluster or you can add a step if your cluster is in the WAITING state [8].

6. Is there a way to add HTTPS to the Jupiter notebook GUI? if so how?

By default, JupyterHub on EMR uses a self-signed certificate for SSL encryption using HTTPS. Users are prompted to trust the self-signed certificate when they connect.

You can use a trusted certificate and keys of your own. Replace the default certificate file, server.crt, and key file server.key in the /etc/jupyter/conf/ directory on the master node with certificate and key files of your own. Use the c.JupyterHub.ssl_key and c.JupyterHub.ssl_cert properties in the jupyterhub_config.py file to specify your SSL materials [9].

You can read more about this in the Security Settings section of the JupyterHub documentation [10].

7. Is there a way to work with API & CMD of jupyter?

As is the case with all AWS services, you can create an EMR cluster with JupyterHub using the AWS Management Console, AWS Command Line Interface, or the EMR API [11].

8. Where is the config path of jupiter nootbook ?

/etc/jupyter/conf/

You can customize the configuration of JupyterHub on EMR and individual user notebooks by connecting to the cluster master node and editing configuration files [12].

As mentioned above, we provide a bootstrap action [7] to install Jupyter on the following path:

‘s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh’

9. Any common issues with jupyter?

here are a number of considerations you need to consider:

User notebooks and files are saved to the file system on the master node. This is ephemeral storage that does not persist through cluster termination. When a cluster terminates, this data is lost if not backed up. We recommend that you schedule regular backups using cron jobs or another means suitable for your application.

In addition, configuration changes made within the container may not persist if the container restarts. We recommend that you script or otherwise automate container configuration so that you can reproduce customizations more readily [13].

10. Orchestration options for jupiter notebook ? i.e how to schedule a notebook to run daily

JupyterHub and related components run inside a Docker container named jupyterhub that runs the Ubuntu operating system. There are several ways for you to administer components running inside the container [14].

Please note that customisations you perform within the container may not persist if the container restarts. We recommend that you script or otherwise automate container configuration so that you can reproduce customisations more readily.

11. User / Group / credentials  management in jupiter notebook? 

You can use one of two methods for users to authenticate to JupyterHub so that they can create notebooks and, optionally, administer JupyterHub.

The easiest method is to use JupyterHub’s pluggable authentication module (PAM). However, JupyterHub on EMR also supports the LDAP Authenticator Plugin for JupyterHub for obtaining user identities from an LDAP server, such as a Microsoft Active Directory server [15].

You can find instructions and examples for adding users with PAM here [16] and LDAP here [17].

12. notebook collaborations features?

TBD.

13. import/export options?

As stated above, you can install additional kernels within the Docker container manually i.e. you can install additional kernels, additional libraries and packages and then import them for the appropriate shell [2].

14. any other connections build in Jupyter?

As stated above, EMR release 5.14.0 is the first to include JupyterHub and will include all available EMR applications within EMR Release 5.14.0.

15. Working seamlessly with AWS GLUE in terms share meta store?

If you are asking for example with regards to configuring Hive to use the Glue Data Catalog as its metastore, you can indeed do this since EMR version 5.8.0 or later [18].

Finally, I have included the following for your reference:

1. JupyterHub Components

The following diagram depicts the components of JupyterHub on EMR with corresponding authentication methods for notebook users and the administrator [19].

2. Sagemaker

As you are more than likely aware. AWS have recently launched a ML notebook service called SageMaker which uses Jupyter notebooks only. As Sagemaker is integrated with other AWS services you can achieve greater control. For example, with Sagemaker you can utilize the IAM service to control user access. You can also connect to it from an EMR cluster, for example EMR version 5.11.0 [20] added the aws-sagemaker-spark-sdk component to Spark, which installs Amazon SageMaker Spark and associated dependencies for Spark integration with Amazon SageMaker. You can read more

You can use Amazon SageMaker Spark to construct Spark machine learning (ML) pipelines using Amazon SageMaker stages. If this is of interest to you, you can read more about it here [21] and on the SageMaker Spark Readme on GitHub [22].

 

Resources

[1] Amazon EMR 5.x Release Versions – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html
[2] Installing Additional Kernels and Libraries – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html
[3] Use the Hive JDBC Driver – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/HiveJDBCDriver.html
[4] Use Business Intelligence Tools with Amazon EMR – https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-bi-tools.html
[5] Adding Database Connectors – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/presto-adding-db-connectors.html
[6] Run Jupyter Notebook and JupyterHub on Amazon EMR – https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
[7] Create Bootstrap Actions to Install Additional Software – https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
[8] Run a Script in a Cluster – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html
[9] Connecting to the Master Node and Notebook Servers – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-connect.html
[10] JupyterHub Security Settings – http://jupyterhub.readthedocs.io/en/latest/getting-started/security-basics.html
[11] Create a Cluster With JupyterHub – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-launch.html
[12] Configuring JupyterHub – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-configure.html
[13] Considerations When Using JupyterHub on Amazon EMR – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-considerations.html
[14] JupyterHub Configuration and Administration – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-administer.html
[15] Adding Jupyter Notebook Users and Administrators – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-user-access.html
[16] Using PAM Authentication – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-pam-users.html
[17] Using LDAP Authentication – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-ldap-users.html
[18] Using the AWS Glue Data Catalog as the Metastore for Hive – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
[19] JupyterHub – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html
[20] EMR Release 5.11.0 – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew-history.html#emr-5110-whatsnew
[21] Using Apache Spark with Amazon SageMaker – https://docs.aws.amazon.com/sagemaker/latest/dg/apache-spark.html
[22] SageMaker Spark – https://github.com/aws/sagemaker-spark/blob/master/README.md
[*] What Is Amazon SageMaker? – https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

 

Need to learn more about aws big data (demystified)?

39 Tips to reduce costs on AWS EMR

There will be a time you will think on ways to reduce costs on your EMR cluster. Notice some of the tips will simply let your ran faster, and this usually imply less costs simply b/c of CPU utilization. If this is the case ,consider the below:

  1. Conceptually , switching to the Transient cluster instead 24X7 will be the best. It requires some preparation work of the automation, but it may be worth your time.
  2. Consider using R instances if your EMR jobs are memory intensive, remember the the over head of the Yarn, JVM  is about 50% of what you have on the machine.  You can confirm this via EMR ganglia, and check the amount of available RAM while the jobs are running.
  3. If you use a larger instance type, the NIC attached will have a faster quite of network bandwidth. the difference may be worth in in term of time vs cost while working with large external tables. the difference may be 300MB/s instead of 100MB/s on a smaller instance. If you witnessed something much faster – please let me know.
  4. Consider using the configuration “Maximise resource allocation “, this may shorten your running time.
  5. Naturally tuning of cluster specific / job specific configuration may be worth your time as well. but personally i am too lazy to try this, and the risk of these per query customs config propagating to there jobs and slowing them down is high due to human element. if u must tune, use the tuning per session – run it before your query, and set back to default values after you are done. you may want to consider cluster per jobs types. example: cluster for joins jobs, and cluster for aggregations. or cluster per DEV, QA, STG, PROD
  6. Consider using Task nodes. they provisioned quickly, can be killed and resized freely with zero impact on your cluster in terms of stability. peace of advice – make sure the task node is as same size of your data node, other wise your risk of underutilising your task nodes as the executors confirmations are as same of the data node.
  7. Consider using Spot instances…. especially on task nodes. save upto 80%!
  8. Consider using Reserved instances for your Data Nodes and master nodes. save upto 35% on costs. 
  9. consider using several task groups, with different size/configs
  10. Auto scaling will help be more accurate in terms of costs, you can auto scale in a resolution of 5 mins (time it takes to provision an EMR node). Autoscaling is highly useful on 24X7 EMR cluster. using Auto scaling require some testing, as it does not behave exactly as you think all the time. In Cluster with Dynamic Resources allocation, the resources may be ready, but the boost in performance may take its time. auto scaling and task saved me about 50%. Naturally, when you save costs using tasks nodes and auto scaling, your get greey on a simple performance test, well until the auto scale in kicks in 🙂
  11. In good we trust, all the rest must bring data – use Ganglia to track the exact amount of resources your need (perhaps you are over provisioning).
    1. yarn.QueueMetrics.AvailableMB
    2. yarn.QueueMetrics.AvailableVCores
    3. yarn.NodeManagerMetrics.AvailableGB
    4. yarn.NodeManagerMetrics.AvailableVCores
  12. Minimal recommended cluster size is 3 machine, one 1 Master, 2 Data nodes. conceder the below suggestions
    1. EMR with only one machine (new feature), which is the master node, data node in the same nachine
    2. EMR with 1 master, 1 Data, and if you must scale, add Task nodes, with auto scaling. notice the minimal amount of machines in the task group can be zero. notice, this should not be used in production as the stability of your cluster is much lower, even if you are not using your data on local tables. if your data node dies, the entire cluster becomes unusable, and this is unrecoverable.
  13. Encryption at rest and Encryption in motion, may be good for security reason, but may have a massive impact on production in terms of resources, running time etc. confirm security is a must before you apply security on transient cluster. Consult your CISO for this. Notice the encryption on S3 is hardware based, but still i would perform a simple benchmark test to see the cost benefit ration.
  14. If you can afford it, and it is technically valid , please test your jobs on both Hive / Spark/ Presto. Further more, test different compression types and storage types.
    1. I know for a fact from benchmarks i performed there are some cases Hive will be faster than spark.
    2. I am less familiar the presto, but i am positive there may be useless it will be faster.
    3. from a few benchmarks I performed , your will be surprised to know that using different compression types, may have massive impact on Write time to S3 and Read time (if the data is compressed better). I personally work with Parquet of GZIP. but this only work perfectly with my useless.
    4. Notice compression has impact on CPU utilisation, so it is not a clear cut what will be cheaper (parquet/orc GZIP, BZIP) nor which will be faster (spark / hive/ presto) .
  15. did you switch to columnar? if not try the this link as reference: convert to columnar  from raw based data.
  16. did u use partitioning? did you use the correct partitioning for your query?
  17. if using ORC consider using bucketing on top of partitioning
  18. was your data spliced into chunks? if so try to change chunk size. more complicated but doable, again, could go either way – need to test this will your data.
  19. apply hints on the table may help on time spend on data scan in some cases.
  20. if using multiple tables join, order of joins, may impact scanned data, shotterning running time.
  21. consider pre aggregating data if possible as part of your transformation/cleansing process. even if it is on each (using window table, each row will hold aggregation tables. )
  22. consider pre calculating table with heavy group by on raw-data. i.e have the data already calculated on s3, and have your production user/ end user query that table.
  23. have a data engineer review each query, to make sure data scan is minimised. for example
    1. Minimise the columns in the results set… a results set of longs strings maybe be very costly.
    2. where possible switch strings to ints, this will minimise footprint on storage greatly.
    3. if possible switch from bigint to tinyint. this will save some disk space as well. notice the list of supported data types: https://prestodb.io/docs/current/language/types.html

 

Conclusion

As you can see, There are many ways to save costs on AWS EMR. The easiest thing will be to use task groups and scale in/out based on your needs. The rest may take some time, but will be worth it.

 

 

Need to learn more about aws big data (demystified)?

Working with R studio and a remote Spark cluster (SPARK R)

  1. Download and install RStudio server:

After the EMR cluster is up and running, ssh to the master node with user ‘hadoop@’ and download RStudio server and then install using ‘yum install` as:

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.442-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.442-x86_64.rpm

$ wget https://download2.rstudio.org/rstudio-server-rhel-1.1.383-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-rhel-1.1.383-x86_64.rpm

finally add a user to access RStudio Web console as:

$ sudo su   $ sudo useradd <username>

$ sudo echo <username>:<password> | chpasswd

  1. To access RStudio Web console you need to create a SSH tunnel from your machine to the EMR master node for local port forwarding like below:

$ ssh -NL 8787:ec2-<emr-master-node-ip>.compute-1.amazonaws.com:8787 hadoop@ec2-<emr-master-node-ip>.compute-1.amazonaws.com&

3. Now open any browser and type `http://localhost:8787` to go the RStudio Web console and use the `<username>:<password>` combo to login.

 

  1. To install the required R packages you need to first install `libcurl` into the master node like below:

sudo yum update

sudo yum install -y curl

sudo yum install -y openssl

sudo yum -y install libcurl-devel

sudo yum -y install openssl-devel

sudo yum -y install libssh2-devel

  1. Resolve permission issues with:

$ sudo -u hdfs hadoop fs -mkdir /user/<username>

$ sudo -u hdfs hadoop fs -chown <username> /user/<username>

Otherwise, you may see errors like below while trying to create a Spark session from RStudio:

—————————

Error: Failed during initialize_connection() org.apache.hadoop.security.AccessControlException: Permission denied: user=<username>, access=WRITE, inode=”/user/<username>/.sparkStaging/application_1476072880868_0008″:hdfs:hadoop:drwxr-xr-x

 

  1. Install all the necessary packages in RStudio (exacly as you would on local machine)

install.packages(‘devtools’)

devtools::install_github(‘apache/spark@v2.2.1′, subdir=’R/pkg’)

install.packages(‘sparklyr’)

library(SparkR)

library(sparklyr)

library(dplyr)

you might need to install additional dependencies with the above as my setup may vary to yours, so I am leaving this open.

 

  1. Once the required packages are installed and loaded you can create a Spark session to remote EMR/Spark cluster and interact with your SparkR application using the below commands:

> sc <- spark_connect(master = “yourEMRmasterNodeDNS:8998”, method = “livy”)

> copy_to(sc, iris)

# Source:   table<iris> [?? x 5]

# Database: spark_connection

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species

      <dbl>    <dbl>     <dbl>    <dbl> <chr>  

1      5.10     3.50      1.40    0.200 setosa

 

Need to learn more about aws big data (demystified)?

Bootstrapping Zeppelin EMR

a few notes before we lunch an emr cluster with zeppelin:

  • Zeppelin is installed on the master node of the EMR cluster ( choose the right installation for you )

If you want, you can automate the above process by using an EMR step. Please find attached a simple shell script that will download your zeppelin-site.xml file from S3 onto your EMR cluster and restart the Zeppelin service.

To run it, simply copy the script to an S3 bucket and then use the script-runner.jar process as outlined in [2] below with the script s3 location as its only argument.

To do this via the AWS EMR Console:

1  – Under the “Add steps (optional)” section, select “Custom JAR” for the “Step type” and click the “Configure” button.

2 – In the pop-up window, for us-east-1 the JAR location for script-runner.jar is:

s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar

3 – For the argument, you would pass in your S3 bucket and location of the “setupZeppelin.sh” file e.g.,:

s3://mybucket/mylocation/setupZeppelin.sh

Once done, click “Add” and continue on with your EMR cluster creation (this process is included when cloning an EMR cluster).

Zeppelin Config optimisation

  • In the interpreter
    • zeppelin.interpreter.output.limit 1024 instead 102400
    • zeppelin.spark.maxResult 100 instead of 1000
  • In zeppelin-env.sh
    sudo nano /etc/zeppelin/conf.dist/zeppelin-env.sh

    • export ZEPPELIN_MEM=”-Xms44024m -Xmx46024m -XX:MaxPermSize=512m”
    • export ZEPPELIN_INTP_MEM=”-Xms44024m -Xmx46024m -XX:MaxPermSize=512m”

 

Saving your Zeppelin notebooks on the amazon s3 storage for durability:

https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/

 

Example of config zeppelin-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!–
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the “License”); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an “AS IS” BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
–>

<configuration>

<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server address</description>
</property>

<property>
<name>zeppelin.server.port</name>
<value>8080</value>
<description>Server port.</description>
</property>

<property>
<name>zeppelin.server.ssl.port</name>
<value>8890</value>
<description>Server ssl port. (used when ssl property is set to true)</description>
</property>

<property>
<name>zeppelin.server.context.path</name>
<value>/</value>
<description>Context Path of the Web Application</description>
</property>

<property>
<name>zeppelin.war.tempdir</name>
<value>webapps</value>
<description>Location of jetty temporary directory</description>
</property>

<property>
<name>zeppelin.notebook.dir</name>
<value>notebook</value>
<description>path or URI for notebook persist</description>
</property>

<property>
<name>zeppelin.notebook.homescreen</name>
<value></value>
<description>id of notebook to be displayed in homescreen. ex) 2A94M5J1Z Empty value displays default home screen</description>
</property>

<property>
<name>zeppelin.notebook.homescreen.hide</name>
<value>false</value>
<description>hide homescreen notebook from list when this value set to true</description>
</property>

 

<!– Amazon S3 notebook storage –>
<!– Creates the following directory structure: s3://{bucket}/{username}/{notebook-id}/note.json –>
<property>
<name>zeppelin.notebook.s3.user</name>
<value>user</value>
<description>user name for s3 folder structure</description>
</property>

<property>
<name>zeppelin.notebook.s3.bucket</name>
<value>my-zeppelin-bucket</value>
<description>bucket name for notebook storage</description>
</property>

<property>
<name>zeppelin.notebook.s3.endpoint</name>
<value>s3.amazonaws.com</value>
<description>endpoint for s3 bucket</description>
</property>

<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
–>

<!– Additionally, encryption is supported for notebook data stored in S3 –>
<!– Use the AWS KMS to encrypt data –>
<!– If used, the EC2 role assigned to the EMR cluster must have rights to use the given key –>
<!– See https://aws.amazon.com/kms/ and http://docs.aws.amazon.com/kms/latest/developerguide/concepts.html –>
<!–
<property>
<name>zeppelin.notebook.s3.kmsKeyID</name>
<value>AWS-KMS-Key-UUID</value>
<description>AWS KMS key ID used to encrypt notebook data in S3</description>
</property>
–>

<!– provide region of your KMS key –>
<!– See http://docs.aws.amazon.com/general/latest/gr/rande.html#kms_region for region codes names –>
<!–
<property>
<name>zeppelin.notebook.s3.kmsKeyRegion</name>
<value>us-east-1</value>
<description>AWS KMS key region in your AWS account</description>
</property>
–>

<!– Use a custom encryption materials provider to encrypt data –>
<!– No configuration is given to the provider, so you must use system properties or another means to configure –>
<!– See https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/EncryptionMaterialsProvider.html –>
<!–
<property>
<name>zeppelin.notebook.s3.encryptionMaterialsProvider</name>
<value>provider implementation class name</value>
<description>Custom encryption materials provider used to encrypt notebook data in S3</description>
</property>
–>

 

<!– If using Azure for storage use the following settings –>
<!–
<property>
<name>zeppelin.notebook.azure.connectionString</name>
<value>DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=<accountKey></value>
<description>Azure account credentials</description>
</property>

<property>
<name>zeppelin.notebook.azure.share</name>
<value>zeppelin</value>
<description>share name for notebook storage</description>
</property>

<property>
<name>zeppelin.notebook.azure.user</name>
<value>user</value>
<description>optional user name for Azure folder structure</description>
</property>

<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.AzureNotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
–>

<!– Notebook storage layer using local file system
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.VFSNotebookRepo</value>
<description>local notebook persistence layer implementation</description>
</property>
–>

<!– For connecting your Zeppelin with ZeppelinHub –>
<!–
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitNotebookRepo, org.apache.zeppelin.notebook.repo.zeppelinhub.ZeppelinHubRepo</value>
<description>two notebook persistence layers (versioned local + ZeppelinHub)</description>
</property>
–>

<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitNotebookRepo</value>
<description>versioned notebook persistence layer implementation</description>
</property>

<property>
<name>zeppelin.notebook.one.way.sync</name>
<value>false</value>
<description>If there are multiple notebook storages, should we treat the first one as the only source of truth?</description>
</property>

<property>
<name>zeppelin.interpreter.dir</name>
<value>interpreter</value>
<description>Interpreter implementation base directory</description>
</property>

<property>
<name>zeppelin.interpreter.localRepo</name>
<value>local-repo</value>
<description>Local repository for interpreter’s additional dependency loading</description>
</property>

<property>
<name>zeppelin.interpreter.dep.mvnRepo</name>
<value>http://repo1.maven.org/maven2/</value&gt;
<description>Remote principal repository for interpreter’s additional dependency loading</description>
</property>

<property>
<name>zeppelin.dep.localrepo</name>
<value>local-repo</value>
<description>Local repository for dependency loader</description>
</property>

<property>
<name>zeppelin.helium.npm.registry</name>
<value>http://registry.npmjs.org/</value&gt;
<description>Remote Npm registry for Helium dependency loader</description>
</property>

<property>
<name>zeppelin.interpreters</name>
<value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.rinterpreter.RRepl,org.apache.zeppelin.rinterpreter.KnitR,org.apache.zeppelin.spark.SparkRInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.file.HDFSFileInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,,org.apache.zeppelin.python.PythonInterpreter,org.apache.zeppelin.python.PythonInterpreterPandasSql,org.apache.zeppelin.python.PythonCondaInterpreter,org.apache.zeppelin.python.PythonDockerInterpreter,org.apache.zeppelin.lens.LensInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter,org.apache.zeppelin.cassandra.CassandraInterpreter,org.apache.zeppelin.geode.GeodeOqlInterpreter,org.apache.zeppelin.postgresql.PostgreSqlInterpreter,org.apache.zeppelin.jdbc.JDBCInterpreter,org.apache.zeppelin.kylin.KylinInterpreter,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter,org.apache.zeppelin.scalding.ScaldingInterpreter,org.apache.zeppelin.alluxio.AlluxioInterpreter,org.apache.zeppelin.hbase.HbaseInterpreter,org.apache.zeppelin.livy.LivySparkInterpreter,org.apache.zeppelin.livy.LivyPySparkInterpreter,org.apache.zeppelin.livy.LivyPySpark3Interpreter,org.apache.zeppelin.livy.LivySparkRInterpreter,org.apache.zeppelin.livy.LivySparkSQLInterpreter,org.apache.zeppelin.bigquery.BigQueryInterpreter,org.apache.zeppelin.beam.BeamInterpreter,org.apache.zeppelin.pig.PigInterpreter,org.apache.zeppelin.pig.PigQueryInterpreter,org.apache.zeppelin.scio.ScioInterpreter</value>
<description>Comma separated interpreter configurations. First interpreter become a default</description>
</property>

<property>
<name>zeppelin.interpreter.group.order</name>
<value>spark,md,angular,sh,livy,alluxio,file,psql,flink,python,ignite,lens,cassandra,geode,kylin,elasticsearch,scalding,jdbc,hbase,bigquery,beam</value>
<description></description>
</property>

<property>
<name>zeppelin.interpreter.connect.timeout</name>
<value>30000</value>
<description>Interpreter process connect timeout in msec.</description>
</property>

<property>
<name>zeppelin.interpreter.output.limit</name>
<value>10240</value>
<description>Output message from interpreter exceeding the limit will be truncated</description>
</property>

<property>
<name>zeppelin.ssl</name>
<value>true</value>
<description>Should SSL be used by the servers?</description>
</property>

<property>
<name>zeppelin.ssl.client.auth</name>
<value>false</value>
<description>Should client authentication be used for SSL connections?</description>
</property>

<property>
<name>zeppelin.ssl.keystore.path</name>
<value>/home/hadoop/certificate.p12</value>
<description>Path to keystore relative to Zeppelin configuration directory</description>
</property>

<property>
<name>zeppelin.ssl.keystore.type</name>
<value>PKCS12</value>
<description>The format of the given keystore (e.g. JKS or PKCS12)</description>
</property>

<property>
<name>zeppelin.ssl.keystore.password</name>
<value>MyPassword</value>
<description>Keystore password. Can be obfuscated by the Jetty Password tool</description>
</property>

<!–
<property>
<name>zeppelin.ssl.key.manager.password</name>
<value>change me</value>
<description>Key Manager password. Defaults to keystore password. Can be obfuscated.</description>
</property>
–>

<property>
<name>zeppelin.ssl.truststore.path</name>
<value>truststore</value>
<description>Path to truststore relative to Zeppelin configuration directory. Defaults to the keystore path</description>
</property>

<property>
<name>zeppelin.ssl.truststore.type</name>
<value>JKS</value>
<description>The format of the given truststore (e.g. JKS or PKCS12). Defaults to the same type as the keystore type</description>
</property>

<!–
<property>
<name>zeppelin.ssl.truststore.password</name>
<value>change me</value>
<description>Truststore password. Can be obfuscated by the Jetty Password tool. Defaults to the keystore password</description>
</property>
–>

<property>
<name>zeppelin.server.allowed.origins</name>
<value>*</value>
<description>Allowed sources for REST and WebSocket requests (i.e. http://onehost:8080,http://otherhost.com). If you leave * you are vulnerable to https://issues.apache.org/jira/browse/ZEPPELIN-173</description&gt;
</property>

<property>
<name>zeppelin.anonymous.allowed</name>
<value>false</value>
<description>Anonymous user allowed by default</description>
</property>

<property>
<name>zeppelin.notebook.public</name>
<value>false</value>
<description>Make notebook public by default when created, private otherwise</description>
</property>

<property>
<name>zeppelin.websocket.max.text.message.size</name>
<value>1024000</value>
<description>Size in characters of the maximum text message to be received by websocket. Defaults to 1024000</description>
</property>

<property>
<name>zeppelin.server.default.dir.allowed</name>
<value>false</value>
<description>Enable directory listings on server.</description>
</property>

<!–
<property>
<name>zeppelin.server.jetty.name</name>
<value>Jetty(7.6.0.v20120127)</value>
<description>Hardcoding Application Server name to Prevent Fingerprinting</description>
</property>
–>
<!–
<property>
<name>zeppelin.server.xframe.options</name>
<value>SAMEORIGIN</value>
<description>The X-Frame-Options HTTP response header can be used to indicate whether or not a browser should be allowed to render a page in a frame/iframe/object.</description>
</property>
–>

<!–
<property>
<name>zeppelin.server.strict.transport</name>
<value>max-age=631138519</value>
<description>The HTTP Strict-Transport-Security response header is a security feature that lets a web site tell browsers that it should only be communicated with using HTTPS, instead of using HTTP. Enable this when Zeppelin is running on HTTPS. Value is in Seconds, the default value is equivalent to 20 years.</description>
</property>
–>
<!–
<property>
<name>zeppelin.server.xxss.protection</name>
<value>1</value>
<description>The HTTP X-XSS-Protection response header is a feature of Internet Explorer, Chrome and Safari that stops pages from loading when they detect reflected cross-site scripting (XSS) attacks. When value is set to 1 and a cross-site scripting attack is detected, the browser will sanitize the page (remove the unsafe parts).</description>
</property>
–>
</configuration>

 

Shiri.ini example

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the “License”); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

[users]
# List of users with their password allowed to access Zeppelin.
# To use a different strategy (LDAP / Database / …) check the shiro doc at http://shiro.apache.org/configuration.html#Configuration-INISections

Omid = BigDataNinjaForever, admin

# Sample LDAP configuration, for user Authentication, currently tested for single Realm
[main]
### A sample for configuring Active Directory Realm
#activeDirectoryRealm = org.apache.zeppelin.realm.ActiveDirectoryGroupRealm
#activeDirectoryRealm.systemUsername = userNameA

#use either systemPassword or hadoopSecurityCredentialPath, more details in http://zeppelin.apache.org/docs/latest/security/shiroauthentication.html
#activeDirectoryRealm.systemPassword = passwordA
#activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://file/user/zeppelin/zeppelin.jceks
#activeDirectoryRealm.searchBase = CN=Users,DC=SOME_GROUP,DC=COMPANY,DC=COM
#activeDirectoryRealm.url = ldap://ldap.test.com:389
#activeDirectoryRealm.groupRolesMap = “CN=admin,OU=groups,DC=SOME_GROUP,DC=COMPANY,DC=COM”:”admin”,”CN=finance,OU=groups,DC=SOME_GROUP,DC=COMPANY,DC=COM”:”finance”,”CN=hr,OU=groups,DC=SOME_GROUP,DC=COMPANY,DC=COM”:”hr”
#activeDirectoryRealm.authorizationCachingEnabled = false

### A sample for configuring LDAP Directory Realm
#ldapRealm = org.apache.zeppelin.realm.LdapGroupRealm
## search base for ldap groups (only relevant for LdapGroupRealm):
#ldapRealm.contextFactory.environment[ldap.searchBase] = dc=COMPANY,dc=COM
#ldapRealm.contextFactory.url = ldap://ldap.test.com:389
#ldapRealm.userDnTemplate = uid={0},ou=Users,dc=COMPANY,dc=COM
#ldapRealm.contextFactory.authenticationMechanism = simple

### A sample PAM configuration
#pamRealm=org.apache.zeppelin.realm.PamRealm
#pamRealm.service=sshd

### A sample for configuring ZeppelinHub Realm
#zeppelinHubRealm = org.apache.zeppelin.realm.ZeppelinHubRealm
## Url of ZeppelinHub
#zeppelinHubRealm.zeppelinhubUrl = https://www.zeppelinhub.com
#securityManager.realms = $zeppelinHubRealm

sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager

### If caching of user is required then uncomment below lines
#cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager
#securityManager.cacheManager = $cacheManager

securityManager.sessionManager = $sessionManager
# 86,400,000 milliseconds = 24 hour
securityManager.sessionManager.globalSessionTimeout = 86400000
shiro.loginUrl = /api/login

[roles]
dataScience = *
admin = *

[urls]
# This section is used for url-based security.
# You can secure interpreter, configuration and credential information by urls. Comment or uncomment the below urls that you want to hide.
# anon means the access is anonymous.
# authc means Form based Auth Security
# To enfore security, comment the line below and uncomment the next one
/api/version = anon
#/api/interpreter/** = authc, roles[admin]
#/api/configurations/** = authc, roles[admin]
#/api/credential/** = authc, roles[admin]
#/** = anon
/api/notebook/** = anon
/** = authc

 

 

Need to learn more about aws big data (demystified)?