Geeks With Blogs
Josh Reuben March 2016 Entries Pipelines QuickRef
When constructing Spark Machine Learning Pipelines - I find it really helpful to maintain a bird's eye view of the various transformers and estimators a nutshell: fit trainingData (train a model), transform testData (predict with model)Transformer: DataFrame => DataFrameEstimator: DataFrame => TransformerTransformersToke... sentence => wordsRegexTokenizer: sentence => words - setPatternHashingTF: terms => feature vectors based on frequency - setNumFeaturesStopWordsRemo... ......

Posted On Tuesday, March 22, 2016 7:05 AM

Enterprise Integration Patterns
System Architecture patternsN-TierEvent-Driven - Mediator / BrokerMicrokernelMicroServi... - MVC / MVP / MVVMserver - RPC / Remoting / WS / SOA / RESTSpace-BasedSOA PatternsFoundational StructuralService Host - infraActive Service - worker thread for upstream pre-fetchTransactional ServiceWorkflowEdge ComponentQoS PatternsDecoupled Invocation - Queues for reli­ability, burstsParallel Pipelines - Steps -> throughputGridable ServiceService Instance - mutiple stateless copies, NLBVirtual ......

Posted On Tuesday, March 22, 2016 7:02 AM

Hive - HQL query over MapReduce
Overview Developed by Facebook HiveQL is a SQL-like framework for data warehousing on top of MapReduce over HDFS. converts SQL query into a series of jobs for execution on a Hadoop cluster. Organizes HDFS data into tables - attaching structure. Schema on Read Versus Schema on Write - doesn’t verify the data when it is loaded, but rather when a query is issued. full-table scans are the norm and a table update is achieved by transforming the data into a new table. HDFS does not provide in-place file ......

Posted On Tuesday, March 22, 2016 5:32 AM

Big Data File Format Zoo
Big Data has a plethora of Data File Formats - its important to understand their strengths and weaknesses. Most explorers start out with some NoSQL exported JSON data. However, specialized data structures are required - because putting each blob of binary data into its own file just doesn’t scale across a distributed filesystem. TL/DR; Choose Parquet !!! Row-oriented File Formats (Sequence, Avro ) – best for large number of columns of a single row needed for processing at the same time. general-purpose, ......

Posted On Wednesday, March 16, 2016 6:15 AM

Copyright © JoshReuben | Powered by: