Thursday, August 13, 2015

YARN and MapReduce Memory and Heap Size


Reference: 

1. Determine YARN and MapReduce Memory Configuration Settings 






Looks like Container is a key to manage Memory\CUP\Desk. A MapReduce task has a specific (built in) ApplicationMaster (called MRAppMaster), which runs on a specific container, called container 0, and each mapper and reducer runs on its own container to be accurate! The MRAppMaster allocates these containers for the mappers and reducers as it sees fit.

YARN’s Resource Manager does resource allocation, but it does not specify the heap settings of any spawned processes. For example, the memory provided to a mapper process by the NodeManager is specified to YARN via the per-job mapreduce.map.memory.mb. However, the JVM itself has memory overhead. So the heap size available to the mapper is determined by settings such as mapred.[map|reduce].child.java.opts, which specifies the heap size for the launched Java process. When resources allocated by YARN don’t match the resources consumed by the MR2 job, you’ll see either cluster underutilization or jobs being killed, depending on the direction of the mismatch.




What is a container in YARN?  (hortonworks doc)


python yarn-utils.py -c 16 -m 64 -d 4 -k True

Using cores=16 memory=64GB disks=4 hbase=True
 Profile: cores=16 memory=64GB reserved=16GB usableMem=48GB disks=4
 Num Container=32
 Container Ram=1536MB
 Used Ram=48GB
 Unused Ram=16GB
 yarn.scheduler.minimum-allocation-mb=1536
 yarn.scheduler.maximum-allocation-mb=49152
 yarn.nodemanager.resource.memory-mb=49152
 mapreduce.map.memory.mb=1536
 mapreduce.map.java.opts=-Xmx1228m
 mapreduce.reduce.memory.mb=3072
 mapreduce.reduce.java.opts=-Xmx2457m
 yarn.app.mapreduce.am.resource.mb=3072
 yarn.app.mapreduce.am.command-opts=-Xmx2457m

 mapreduce.task.io.sort.mb=614


Reference
java.lang.OutOfMemoryError: Java heap space with hive  (stackoverflow)

hive -hiveconf mapreduce.map.memory.mb=4096 -hiveconf mapreduce.reduce.memory.mb=5120 -e "select count(*) from test_table;"

export HIVE_OPTS="-hiveconf mapreduce.map.memory.mb=4096 -hiveconf mapreduce.reduce.memory.mb=5120"

hive> set mapreduce.map.memory.mb=4096;
hive> set mapreduce.reduce.memory.mb=5120;

hive> select count(*) from test_table;


Other references:
1. YARN: Avoiding 6 Time-Consuming "Gotchas"
2. Tuning the heap size of map and reduce tasks 



HDInsight Throughput




The number of nodes required to trigger throttling by Windows Azure Storage depends on whether geo-replication is enabled for the storage account (because bandwidth allocation is different for each case). If geo-replication is enabled, clusters with more than 7 nodes may encounter throttling. If geo-replication is not enabled, clusters with more than 10 nodes may encounter throttling.The number of nodes required to trigger throttling by Windows Azure Storage depends on whether geo-replication is enabled for the storage account (because bandwidth allocation is different for each case). If geo-replication is enabled, clusters with more than 7 nodes may encounter throttling. If geo-replication is not enabled, clusters with more than 10 nodes may encounter throttling.

Bandwidth for a Geo Redundant storage account (geo-replication on)
    Ingress - up to 5 gigabits per second
    Egress - up to 10 gigabits per second
Bandwidth for a Locally Redundant storage account (geo-replication off)
   Ingress - up to 10 gigabits per second

   Egress - up to 15 gigabits per second

Setting in core-site.xml

fs.azure.selfthrottling.read.factor
fs.azure.selfthrottling.write.factor


References:
1. Maximizing HDInsight throughput to Azure Blob Storage by MS Brian Swan  

Monday, July 13, 2015

Hive in Motion: IIS dealing --- basic performance

The HIVE script does 3 things

1. create staging HIVE table\import data under one container
2. import data from staging to another HIVE table w/ partition on Year and Month
3. Query partitioned HIVE table: select all


Case 1
8.3 GB in Size, 32,061 files, 37,585,495 rows 4 A3 Nodes
30 mins for #1, 30 mins for #2 and 2 mins for #3 on 37M rows
roughly 10 mins computing time, about 20+ from start to finish. 

Case 2
485.2GB in Size, 36,743 files, 567,835,204 rows on 40 A3 Nodes
11+ hrs for both #1 and #2, 3 hrs for #3,
roughly 32 mins computing time, about 5 hrs from start to finish

Case 3
11.72GB in Size, 28,157 files, 47,831,164 on 8 A3 Nodes
roughly 42 mins #1 and #2, 4 mins for count(*), about 20+ from start to finish


HDInsight Hive in Motion: failed and works

A list from what I experienced:


1.  "Can not create a Path from an empty string"

I have my data set under Azure Container, have encountered such error, found out

will error out if
LOCATION 'wasb://[containername]@prodiis001.blob.core.windows.net'

works with "/" at end
LOCATION 'wasb://[containername]@prodiis001.blob.core.windows.net/'

Friday, July 10, 2015

SQL Table Row Size

from here
http://dba.stackexchange.com/questions/66471/script-to-estimate-row-sizes-for-any-table

SELECT 
     -- record cannot be smaller than the forwarding stub size =9 Bytes
     CASE WHEN [Max Size]>=9
     THEN [Max Size]
     ELSE 9 
     END AS [Max Size]
     -- record cannot be smaller than the forwarding stub size =9 Bytes
    , CASE WHEN [Min Size]>=9
     THEN [Min Size]
     ELSE 9 
     END AS [Min Size]
    , [Table Name]
    , [Table Type]
    , [Total Number of Columns]
    , [Schema]
    FROM
    (
SELECT 
    DISTINCT 
    -- Overhead for row header of a data row
    4
    +
    -- Overhead for NULL bitmap
    2+cast(([Total Number of Columns]+7)/8 AS BIGINT)+
    -- overhead for variable length
    CASE WHEN [IsVariableLength]>0
    THEN
    2
    ELSE
    0
    END
    +  
    --- Sum is on record level
    SUM(
    a1.[max_length]
    +
    -- Overhead for variable-length columns
    CASE WHEN 
    -- varchar
    [System Type]='varchar'
    --(([system_type_id]=167) AND ([user_type_id]=167))
    OR
    -- nvarchar 
    [System Type]='nvarchar'
    --(([system_type_id]=231) AND ([user_type_id]=231))
    OR
    -- IMAGE
    (([system_type_id]=34) OR ([user_type_id]=34))
    OR
    -- TEXT
    (([system_type_id]=35) OR ([user_type_id]=35))
    OR 
    --  NTEXT
    (([system_type_id]=99) OR ([user_type_id]=99))
    OR 
    --  SQLVARIANT
    (([system_type_id]=98) OR ([user_type_id]=98))
    OR
    -- hierarchyid geometry geography
    (([system_type_id]=240))
    THEN 2
    ELSE 0
    END
    )
    OVER (PARTITION BY a1.[Schema], a1.[Table Name]) AS [Max Size]

    , -- Overhead for row header of a data row
    4
    +
    -- Overhead for NULL bitmap
    2+cast(([Total Number of Columns]+7)/8 AS BIGINT)+
    -- overhead for variable length
    CASE WHEN ([IsVariableLength]>0) AND ([AnyFixedColumn]=0)
    THEN
    2
    ELSE
    0
    END
    +
    --- Sum is on record level
    SUM(
    -- overhead for variable length depending on number of variable columns
    CASE WHEN 
    -- varchar
    --[System Type]='varchar'
    (([system_type_id]=167) OR ([user_type_id]=167))
    OR
    -- nvarchar 
    --[System Type]='nvarchar'
    (([system_type_id]=231) OR ([user_type_id]=231))
    OR
    -- IMAGE
    (([system_type_id]=34) OR ([user_type_id]=34))
    OR
    -- TEXT
    (([system_type_id]=35) OR ([user_type_id]=35))
    OR 
    --  NTEXT
    (([system_type_id]=99) OR ([user_type_id]=99))
    --  VARBINARY
    OR
    (([system_type_id]=165) OR ([user_type_id]=165))
    OR 
    --  SQLVARIANT
    (([system_type_id]=98) OR ([user_type_id]=98))
    OR
    -- hierarchyid geometry geography
    (([system_type_id]=240))
    OR
    -- xml
    (([system_type_id]=241))
    THEN
        CASE WHEN [Is Nullable]=1
        THEN 0 
        ELSE 
        1
        END
    ELSE
        CASE
        WHEN
        -- bit
        (([system_type_id]=104) OR ([user_type_id]=104))
        and [Is Nullable]=1
        THEN 0
        ELSE
        a1.[max_length]
        END
    END
    -- 

    )
    OVER (PARTITION BY a1.[Schema], a1.[Table Name]) AS [Min Size]
    , a1.[Table Name]
    , [Table Type]
    , [Total Number of Columns]
    , a1.[Schema]
    FROM
    -- Start a1
    (   SELECT
        (SELECT [name] FROM [sys].[schemas]
        WHERE [sys].[schemas].[schema_id]=[sys].[objects].[schema_id])
        AS [Schema]
        , [sys].[objects].[name] AS [Table Name]
        , [sys].[all_columns].[name] AS [Column Name]
        , [sys].[all_columns].[system_type_id]
        , (
            SELECT name FROM [sys].[types]
            WHERE [sys].[types].[system_type_id]=[sys].[all_columns].[system_type_id]
            AND
                    [sys].[types].[user_type_id]=[sys].[all_columns].[user_type_id]
            ) AS [System Type]
        , [sys].[all_columns].[user_type_id]
        , 
        CASE 
        WHEN 
        -- IMAGE
        (([system_type_id]=34) OR ([user_type_id]=34))
        THEN 2147483647
        -- TEXT
        WHEN (([system_type_id]=35) OR ([user_type_id]=35))
        THEN 2147483647
        --  NTEXT
        WHEN (([system_type_id]=99) OR ([user_type_id]=99))
        THEN 1073741823
        -- varchar(max)
        WHEN (([system_type_id]=167) OR ([user_type_id]=167)) AND ([sys].[all_columns].[max_length]=-1)
        THEN 2147483647
        -- nvarchar(max) 
        WHEN (([system_type_id]=231) OR ([user_type_id]=231)) AND ([sys].[all_columns].[max_length]=-1)
        THEN 1073741823
        -- varbinary(max)
        WHEN (([system_type_id]=165) OR ([user_type_id]=165)) AND ([sys].[all_columns].[max_length]=-1)
        THEN 2147483647
        -- hierarchyid geometry geography
        WHEN (([system_type_id]=240))
        THEN 2147483647
        -- xml
        WHEN (([system_type_id]=241) AND ([sys].[all_columns].[max_length]=-1))
        THEN 2147483647
        -- bit
        WHEN (([system_type_id]=104) OR ([user_type_id]=104))
        THEN 1/8    
        ELSE 
        CAST([sys].[all_columns].[max_length] AS BIGINT)
        END [max_length]
        , [sys].[all_columns].[is_nullable] AS [Is Nullable]
        , 
        CASE 
        WHEN EXISTS 
            (   
                SELECT type_desc FROM sys.indexes
                WHERE type_desc='CLUSTERED'
                AND [sys].[objects].[object_id]=[sys].[indexes].[object_id]
            )
        THEN 'CLUSTERED'
        ELSE 'HEAP'
        END AS [Table Type]
        , COUNT([sys].[all_columns].[name]) OVER (PARTITION BY [sys].[objects].[object_id]) AS [Total Number of Columns]
        ,SUM (CASE WHEN 
        -- varchar
        (
        (([system_type_id]=167) AND ([user_type_id]=167))
        OR
        -- nvarchar 
        (([system_type_id]=231) AND ([user_type_id]=231))
        )
        AND [sys].[all_columns].[is_nullable]=0
        THEN 1
        ELSE 0
        END) OVER (PARTITION BY [sys].[objects].[name]) AS [IsNonNullableVariableLength]
        ,SUM (
        CASE WHEN 
        -- varchar
        (([system_type_id]=167) OR ([user_type_id]=167))
        OR
        -- nvarchar 
        (([system_type_id]=231) OR ([user_type_id]=231))
        OR
        -- IMAGE
        (([system_type_id]=34) OR ([user_type_id]=34))
        OR
        -- TEXT
        (([system_type_id]=35) OR ([user_type_id]=35))
        OR 
        --  NTEXT
        (([system_type_id]=99) OR ([user_type_id]=99))
        --  VARBINARY
        OR
        (([system_type_id]=165) OR ([user_type_id]=165))
        OR 
        --  SQLVARIANT
        (([system_type_id]=98) OR ([user_type_id]=98))
        OR
        -- hierarchyid geometry geography
        (([system_type_id]=240))        
        OR
        -- xml
        (([system_type_id]=241))    
        THEN 1
        ELSE 0
        END) OVER (PARTITION BY [sys].[objects].[name]) 
        AS [IsVariableLength]
        ,SUM (
        CASE WHEN 
        -- varchar
        (([system_type_id]=167) OR ([user_type_id]=167))
        OR
        -- nvarchar 
        (([system_type_id]=231) OR ([user_type_id]=231))
        OR
        -- IMAGE
        (([system_type_id]=34) OR ([user_type_id]=34))
        OR
        -- TEXT
        (([system_type_id]=35) OR ([user_type_id]=35))
        OR 
        --  NTEXT
        (([system_type_id]=99) OR ([user_type_id]=99))
        --  VARBINARY
        OR
        (([system_type_id]=165) OR ([user_type_id]=165))
        OR 
        --  SQLVARIANT
        (([system_type_id]=98) OR ([user_type_id]=98))
        OR
        -- hierarchyid geometry geography
        (([system_type_id]=240))    
        OR
        -- xml
        (([system_type_id]=241))    
        THEN 0
        ELSE 1
        END) OVER (PARTITION BY [sys].[objects].[name]) 
        AS [AnyFixedColumn]
    FROM [sys].[objects]
    INNER JOIN sys.all_columns
    ON [sys].[objects].[object_id]=[sys].[all_columns].[object_id]
    WHERE type_desc='USER_TABLE'
    ) a1
    ) a2