Blog

Big Data Jargon | FAQ’s and EVERYTHING you wanted to know and didn’t ask about Big Data …

What is Big Data?

Everyone has their their own definition, I will share with you my definition, and i hope it helps. Big Data is an ecosystem. like saying computer science , electronics etc. The ecosystem was designed to solve  particular set of problems :

  1. Ingestion  –  The data coming into our data lake.
  2. Transformation – converting a data from one type to another – e.g  JSON into CSV.
  3. Modeling – Querying and analyzing the data
  4. Presentation – Queuing the data from your BI and or Application.

So, again what is big data? it is a methodology to solve problems of scalability of tour data platform. you have a big data problem if your infrastructure is unable to process the amount of data you currently have.  This methodology was to designed to help analyze let in a scalable manner and grow as you grow. see example below in the picture, and read more in this lecure: AWS Big Data demystified.

What is Scale up ( = Scale Vertically)?

Scale up – An old Architecture practice. Used mainly in OLTP Databases and monolith applications.  Imagine any Database, and you seem to be having difficulties querying data on it. Then, your IT guys will hint it is probably a resources issue,  and they will throw some more resources on it (CPU, RAM, faster disk). It will probably work, until you hit a point the utilization is low, and you still unable to process data. This my friend is a big data problem, and you may need to consider to switch to a technology where the architecture practice is Scale Out.

What is Scale Out ( = Scale Horizontally) ?

An current architecture practice, Used mainly in distributed applications such as Hadoop. The basic id is – when you hit a stage where the cluster is responding slow, just add another machine to the cluster and entire workload will benefit from this change as the workload is equally distributed on all the machines in the cluster. the more node you have in the cluster, the more performance you get.

Scale Up VS Scale Out ?

Scale Up ( = Scale Vertically) pros and cons

Usually a Small cluster , Usually active/passive like architecture. When you ran out of resources , you increase resources per machine. In the world of Big Data, it is good idea for Power Queries such as Joins. Anti pattern of such a architecture is  Parallelism. Many analysts working on the same cluster. I.e 50 analysts running join query on a scale up architecture, will not work as each of the join queries will consume most the machines resources. 

Scale Out ( = Scale Vertically) pros and cons

Ran into a performance problem? just add more servers and the entire cluster will enjoy performance boost, as the application Distributed : Each node can handle a fraction of the task. Thus, the Pros of this architecture is parallelism.  The Cons of this architecture – is power queries, i.e imagine a really big Join, running on a cluster of medium resources nodes. Since the the task is distributed among the node of the cluster, eventually the there will one machine getting all the intermediate results from thousands nodes running the query, and in the in the of the day – the amount of memory in this last node is limited, and network overhead will slow you down drastically.


What is Structured Data?

Anything you can put in a table, or a comma separated file such CSV. I.e columns and rows.

Challenges associated with CSV like files?

  1. Encoding…
  2. Charset….
  3. Parsing a CSV with a large number of columns is a world of PAIN. AWS GLUE helps with that as it detect the schema automatically.
  4. Once your change a column in a CSV, any ETL using that modified CSV based on the original file schema – will brake.

What is Semi – Structured Data?

Files like XML, JSON, nested JSON , which are text based files, with no rigid schema schema.

Benefits associated with json?

  1. Once your change a column in a JSON, ant ETL using that modified JSON based on the original file schema – will NOT brake.

Challenges associated with JSON like files?

json files may contain highly nested and complex data structures. which will make the parsing world of pain.

What is Unstructured Data?

Pictures, Movies, audio files, any binary file (non readable by any text file editor)

You simply can insert binary files to most RDBMS systems.


Big Data in  Cloud vs Big data in a datacenter? When do I Choose Datacenter for Big Data workloads and when do I choose cloud?

You can do big data in the cloud and in a data center. both can be cheap, fast, and simple.   In the past there was only option build you Hadoop cluster in a datacenter. accumulate the data forever, and this would be your only option.

Nows days there are good cloud ecosystem with different Big Data PaaS solution designed to help you manage Big Data at scale. The common problem is not performance or simplicity. The problem is  how to manage costs in the cloud. In a data center it is simple, you buy server and you decommission it – when it is DEAD after a minimum of 5 years (vendor warranty can be extended to  5 years)…. a typical server can last 7 to 10 years, depending on the average utilization.

Basically there advantages and disadvantages for each decision. It really depends on the use case.

Advantages of using a data center for your big data?

  1. Predictable costs
  2. Could be much more cheaper than cloud if you take into a large time frame such as three years and above.

Disadvantages of using a data center for big data?

  1. Hard to get started, building a DC takes time. Ordering Servers take weeks.
  2. When there is a problem with the server , like faulty disk, a physical access is required.
  3. Designing a datacenter with High availability is hard.
  4. Designing a datacenter with disaster recovery is harder. image you have  PB of data. you need manages an offsite storage and dedicated private network simply to keep a copy of data nobody will use unless there is s disaster like earth queue.
  5. Estimating the future traffic is a bitch. You have to design for peek, and this means idle resource and lot of waste in non rush hours.

Use datacenter  for big data when:

  1. You already have a datacenter in the organization and a dedicated OPS team with fast turn around.
  2. When you need a highly custom hardware.
  3. You have a country regulation that forbids you from taking the data out of the country.
  4. When you are planning to build your own private cloud (b/c you have have too, self service is a MUST these days)

Use Cloud for Big Data when:

  1. You are not sure about the project success rate.
  2. You are getting started with small data and unsure about the data growth rate
  3. When time to market is company wide priority.
  4. When datacenter is not part of your company business.
  5. When most of the time of day you are idle, and you have rush hours for a short period. Auto Scaling in the cloud can reduce your costs.
  6. By the way… each year about 5% of costs is decreased. basically over time, simply by upgrading to the newest and latest machines, with same amount of resources your get more for less.

Which Cloud vendor should you choose for Big Data?

Wow…. Talk about the million dollar questions. The truth is that , It really really really… depends on the use case – don’ t hate me. hate the game. I can compare cloud vendor on the technical level, I can compare market share of each of the cloud vendors and so on, but each company, each industry has it own challenges. Each Industry will benefit a different cloud provider.

I would use the following rules of thumbs for selecting a cloud vendor for Big Data applications:

  1. Does this cloud vendor has the managed services (PaaS) you need for your applications.
    1. OLTP databases
    2. DW solutions
    3. Big Data Solution such as messaging, streaming
    4. Storage – yes, managed storage solution has many features that are useful for you, such as cross region replication, and cross account/project bucket sharing.
  2. Do you need and Hadoop distribution? if so do a technical deep dive of the feature list and flexibility in the managed service of this specific cloud vendor such
    1. list of open source technologies supported
    2. list of web client such as Hue/Zeppelin etc.
    3. Auto Scaling feature?
    4. Master Node availability ?
    5. Transient cluster bootstrapping?
  3. Get a strong sense of how good the support team of the specific Big data managed services you are going to use.

What is ACID ?

  • Atomic – all or nothing. Either query completed or not. (transaction)
  • Consistency – once transaction committed – that must conform with with the constraints of the schema.
    • Requires
      • a well defined schema (anti pattern to NoSQL). Great for READ intensive and  AD HOC queries.
      • Locking of data until transaction is finished. → performance bottleneck.
    • Consistency uses cases
      • Strongly consistent” : bank transactions must be ACCURATE
      • Eventually consistent”: social media feed updates. (not all users must see same results at the same time.
  • Isolations – concurrent queries – must run in isolation, this ensuring same result.
  • Durable – once transaction is executed, data will preserved.  E.g. power failure

You should know that not every databases is ACID. but not always it is important for you.


What is NoSQL database?

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

NoSQL antipattern – AD HOC querying.

Advantages of NoSQL databases?

  1. Non-relational and schema-less data model
  2. Low latency and high performance
  3. Highly scalable
  4. SQL queries are not well suited for the object oriented data structures that are used in most applications now.
  5. Some application operations require multiple and/or very complex queries. In that case, data mapping and query generation complexity raises too much and becomes difficult to maintain on the application side.

What are the different NoSQL Database types ?

  • Key value store.
  • Document store – e.g json.  document stores are used for aggregate objects that have no shared complex data between them and to quickly search or filter by some object properties.
  • column family. These are used for organizing data based on individual columns where actual data is used as a key to refer to whole data collection. This particular approach is used for very large scalable databases to greatly reduce time for searching data
  • Graph databases map more directly to object oriented programming models and are faster for highly associative data sets and graph queries. Furthermore they typically support ACID transaction properties in the same way as most RDBMS.

When do you normalize tables?

  1. eliminate redundant data, utilize space efficiently
  2. reduce update errors.
  3. in situations where we are storing immutable data such as financial transactions or a particular day’s price list.

When should you NOT normalize tables? denormalize use flat tables?

  1. Could you denormalize your schema to create flatter data structures that are easier to scale?
  2. When Multiple Joins are Needed to Produce a Commonly Accessed View
  3. decided to go with database partitioning/sharding then you will likely end up resorting to denormalization
  4. Normalization saves space, but space is cheap!
  5. Normalization simplifies updates, but reads are more common!
  6. Performance, performance, Performance

What are the common NoSQL use cases?

  1. Application requiring horizontal scaling : Entity-attribute-value = fetch by distinct value
    1. Mobile APP with millions of users – for each user: READ/WRITE
    2. The equivalent in OLTP is sharding – VERY COMPLEX.
  2. User Preferences, fetch by distinct value
    1. (structure may change dynamically)
    2. Per user do something.
  3. Low latency / session state
    1. Ad tech , and large scale web servers- capturing cookie state
    2. Gaming – game state
  4. Thousand of requests per seconds (WRITE only)
    1. Reality – voting
    2. Application monitoring and IoT- logs and json.

What is Hadoop?

The Extremely brief and demystified answer: Hadoop is an ecosystem of open source solutions designed to solve a variety of big data problems.

There are several layers in Hadoop cluster. Compute, Resource management, and Storage.

  1. Compute Layer which contains applications such as Spark and Hive. they run on top of an hadoop cluster .
  2. Resources management solutions, such as Yarn, that help distribute the workflows on the clusters nodes.
  3. Storage solutions such Hadoop File system, Kudu. Designed ro keep the your big data.

What is the difference between Row Based databases and Columnar databases?

Example: Mysql is row based. Redshift is columnar.

for an operational database (production database) you would use a row based database as, the use case is simple: given a user id – give me his attributes.

For Analytics – use a columnar database, as the use case is different – you want to country how many user have visited your website yesterday from a specific country

Using the correct database for your use case has a tremendous impact on performance. I.e when you need a row from the data, – a row based date makes sense. when you need only a column, a row based database will not make sense.

Furthermore, columnar database usually have some analytical SQL syntax and functions like window function, which are meaning less for operational databases.

Air Flow example of job | Data Composer GCP

I Started recently integrating airflow into one my Data Pipelines. Since the learning curve is steep, each working example will be committed into GitHub and shown here. the problem with this wordpress template, is that it is not flexible enough to show code properly, especially for indentation. I apologize for that. You are welcome to add/send me more examples

For now , you can either view the example and indent it on your IDE, or use examples i committed to our git repository:

Big Data Demystified Git

Working example of running a query on bigQuery and saving the results into a new table

import datetime
import os
import logging

from airflow import DAG
from airflow import models
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator

today_date = datetime.datetime.now().strftime(“%Y%m%d”)

#table_name = ‘omid.test_results’ + ‘$’ + today_date
table_name = ‘DATA_LAKE_INGESTION_US.Daily_Stats’

yesterday = datetime.datetime.combine(
datetime.datetime.today() – datetime.timedelta(1),
datetime.datetime.min.time())

default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
‘start_date’: yesterday,
# To email on failure or retry set ’email’ arg to your email and enable
# emailing here.
’email_on_failure’: False,
’email_on_retry’: False,
# If a task fails, retry it once after waiting at least 5 minutes
‘retries’: 0,
‘retry_delay’: datetime.timedelta(minutes=5),
‘project_id’: models.Variable.get(‘gcp_project’)
}

 
with DAG(dag_id=’Daily_Stats_Dag’,
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:

start = DummyOperator(task_id=’start’)

end = DummyOperator(task_id=’end’)

logging.error(‘trying to bq_query: ‘)
logging.error(‘table name: ‘+table_name)
sql = “”” SELECT * FROM `omid.test1` “””

bq_query = BigQueryOperator(
task_id=’bq_query’,
bql=sql,
destination_dataset_table=table_name,
bigquery_conn_id=’bigquery_default’,
use_legacy_sql=False,
write_disposition=’WRITE_TRUNCATE’,
create_disposition=’CREATE_IF_NEEDED’,
dag=dag
)

start >> bq_query >> end

Working example of loading data into BigQuery table from google cloud storage ( GCS )

import datetime
import os
import logging

from airflow import models
from airflow.contrib.operators import bigquery_to_gcs
from airflow.contrib.operators import gcs_to_bq
from airflow.operators import dummy_operator
#from airflow.operators import BashOperator

# Import operator from plugins
from airflow.contrib.operators import gcs_to_gcs
from airflow.utils import trigger_rule

# Output file for job.
output_file = os.path.join(
models.Variable.get(‘gcs_bucket’), ‘android_reviews_file_transfer’,
datetime.datetime.now().strftime(‘%Y%m%d-%H%M%S’)) + os.sep
# Path to GCS buckets. no need to add gs://
DST_BUCKET = (‘pubsite_prod_rev/reviews’)
DST_BUCKET_UTF8 = (‘pubsite_prod_rev_ingestion/reviews_utf8’)
# source bucekt is not in out project, and there permission issues with IT. using bash to copy the files to DST_BUCKET
# and then encoding the files to UTF8 to load to BQ (UTF8 is not supported yet. will change in the future)

#SRC_BUCKET = (‘pubsite_prod_rev’)

yesterday = datetime.datetime.combine(
datetime.datetime.today() – datetime.timedelta(1),
datetime.datetime.min.time())

default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
‘start_date’: yesterday,
# To email on failure or retry set ’email’ arg to your email and enable
# emailing here.
’email_on_failure’: False,
’email_on_retry’: False,
# If a task fails, retry it once after waiting at least 5 minutes
‘retries’: 1,
‘retry_delay’: datetime.timedelta(minutes=5),
‘project_id’: models.Variable.get(‘gcp_project’)
}

with models.DAG(
‘android_reviews_load_to_bigQuery’,
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
#load from local bucket o GCS table of android
load_to_bq_from_gcs = gcs_to_bq.GoogleCloudStorageToBigQueryOperator(
task_id=’load_to_bq_from_gcs’,
source_objects=’*’,
skip_leading_rows=1,
write_disposition=’WRITE_TRUNCATE’, #overwrite?
bucket=DST_BUCKET_UTF8,
destination_project_dataset_table=’DATA.Replica_android_review’
)

# Define DAG dependencies.
#create_dataproc_cluster >> run_dataproc_hadoop >> delete_dataproc_cluster

load_to_bq_from_gcs

 

AWS Athena Error: Query exhausted resources at this scale factor

Athena is a Serverless technology i.e. It make use of shared resources available with AWS and hence, when large amount of queries are submitted by users concurrently around the world at the same time, sometimes resource exhaustion take place.  Athena service team has identified this as a known issue.

However, this error is transient in nature. By that I mean if you can submit the query again, it might be successful. If you repeatedly get the same error consistently, then you might need to partition your data and optimize the query further as mentioned in Performance Tuning Best Practices for Athena [1] and another aritcle in this blog about cost reduction which in turn might reduce resource consumption [2].

You can find suggestions below  from AWS support team:

1) Avoid submitting queries at the beginning or end of an hour. If query fails, Back off exponentially by some minutes and try to submit query again. [ Wierd, but thats an official answer…]
2)  highly recommended to adopt Amazon Athena best practices [1] to optimize your query and your data.
3) Use columnar formatted data which can drastically reduce the resource consumption.

[1] Top 10 Performance Tuning Best Practices for Athena — https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

[2]  aws atehna Cost reduction (might also reduce resource consumption)

Error: HIVE_PATH_ALREADY_EXISTS: Target directory for table | error while CTAS in AWS Athena

the good New? CTAS in Athena exist.

https://docs.aws.amazon.com/athena/latest/ug/ctas.html

The bad news? the error messages still require some work to make them human readable….

“HIVE_PATH_ALREADY_EXISTS: Target directory for table ‘sampledb.ctas_json_partitioned’ already exists: s3://myBucket/.”

In order to run the CTAS command in Athena, it is always preferable to use an explicit location. If you don’t specify the explicit location, Athena uses this location by default :

s3://aws-athena-query-results-<account>-<region>/<query-name-or-unsaved>/<year>/<month/<date>/<query-id>/.

For example, if we are running following CTAS query with location ‘s3://mybucket/my_ctas_table/’:

CREATE my_ctas_table WITH ( external_location = ‘s3://mybucket/my_ctas_table/’ ) AS SELECT …

 

In this example, we’d first want to make sure the location (exists !) and is empty

To check the location, use the CLI [1] command:

aws s3 ls s3://aws-athena-query-results-783591744239-eu-west-1/Unsaved/2019/02/03/tables/ –recursive

GCP Big Data Demystified #1 | Investing.com Big Data Journey

 

How to get started on Big Data? on the cloud? datacenter? what are the challenges? architecture? Google cloud or AWS cloud? in this blog, i will share with your the slides and the video from a meetup from 27.1.19 detailing the journey investing.com has made to the big data in the cloud.

GCP Big Data Demystified | Investing.com Big Data Journey | Part 1

Sunday, Jan 27, 2019, 6:00 PM

investing
hashelosha st 2, Tel Aviv, Tel Aviv-Yafo Tel Aviv-Yafo, al

34 Members Went

Agenda: 18:00 networking and gathering 18:30 “GCP Big Data Demystified | Investing.com Big data Journey”, Omid Vahdaty 19:15 “Nature writes the best algorithms”, Yigal Goldfine Omid Vahdaty, Big Data Ninja and Meetup orgenizer ———————————————————————– in this lecture (first in a serias of lectures) we …

Check out this Meetup →