Część praktyczna.

*Arkadiusz Iwaszko - NIIT Limited*

Data Mining Courses

Kod | Nazwa | Czas trwania | Charakterystyka kursu |
---|---|---|---|

wdneo4j | Wprowadzenie do Neo4j - grafowej bazy danych | 7 godz. | Wprowadzenie do Neo4j Instalacja i konfiguracja Struktura aplikacji Neo4j Relacyjne i grafowe sposoby reprezentacji danych Model grafowy danych Czy zagadnienie można i powinno reprezentować się jako graf? Wybrane przypadki użycia i modelowanie wybranego zagadnienia Najważniejsze pojęcia modelu grafowego Neo4j: Węzeł Relacja Właściwość Etykieta Język zapytań Cypher i operacje na grafach Tworzenie i zarządzanie schematem za pomocą języka Cypher Operacje CRUD na danych Zapytania Cypher oraz ich odpowiedniki w SQL Algorytmy grafowe wykorzystywane w Neo4j Interfejs REST Podstawowe zagadnienia administracyjne Tworzenie i odtwarzanie kopii zapasowych Zarządzanie bazą z poziomu przeglądarki Import i eksport danych w uniwersalnych formatach |

BigData_ | A practical introduction to Data Analysis and Big Data | 28 godz. | Participants who complete this training will gain a practical, real-world understanding of Big Data and its related technologies, methodologies and tools. Participants will have the opportunity to put this knowledge into practice through hands-on exercises. Group interaction and instructor feedback make up an important component of the class. The course starts with an introduction to elemental concepts of Big Data, then progresses into the programming languages and methodologies used to perform Data Analysis. Finally, we discuss the tools and infrastructure that enable Big Data storage, Distributed Processing, and Scalability. Audience Developers / programmers IT consultants Format of the course Part lecture, part discussion, heavy hands-on practice and implementation, occasional quizing to measure progress. Introduction to Data Analysis and Big Data What makes Big Data "big"? Velocity, Volume, Variety, Veracity (VVVV) Limits to traditional Data Processing Distributed Processing Statistical Analysis Types of Machine Learning Analysis Data Visualization Languages used for Data Analysis R language (crash course) Why R for Data Analysis? Data manipulation, calculation and graphical display Python (crash course) Why Python for Data Analysis? Manipulating, processing, cleaning, and crunching data Approaches to Data Analysis Statistical Analysis Time Series analysis Forecasting with Correlation and Regression models Inferential Statistics (estimating) Descriptive Statistics in Big Data sets (e.g. calculating mean) Machine Learning Supervised vs unsupervised learning Classification and clustering Estimating cost of specific methods Filtering Natural Language Processing Processing text Understaing meaning of the text Automatic text generation Sentiment/Topic Analysis Computer Vision Acquiring, processing, analyzing, and understanding images Reconstructing, interpreting and understanding 3D scenes Using image data to make decisions Big Data infrastructure Data Storage Relational databases (SQL) MySQL Postgres Oracle Non-relational databases (NoSQL) Cassandra MongoDB Neo4js Understanding the nuances Hierarchical databases Object-oriented databases Document-oriented databases Graph-oriented databases Other Distributed Processing Hadoop HDFS as a distributed filesystem MapReduce for distributed processing Spark All-in-one in-memory cluster computing framework for large-scale data processing Structured streaming Spark SQL Machine Learning libraries: MLlib Graph processing with GraphX Search Engines ElasticSearch Solr Scalability Public cloud AWS, Google, Aliyun, etc. Private cloud OpenStack, Cloud Foundry, etc. Auto-scalability Choosing right solution for the problem The future of Big Data Closing remarks |

datashrinkgov | Data Shrinkage for Government | 14 godz. | Why shrink data Relational databases Introduction Aggregation and disaggregation Normalisation and denormalisation Null values and zeroes Joining data Complex joins Cluster analysis Applications Strengths and weaknesses Measuring distance Hierarchical clustering K-means and derivatives Applications in Government Factor analysis Concepts Exploratory factor analysis Confirmatory factor analysis Principal component analysis Correspondence analysis Software Applications in Government Predictive analytics Timelines and naming conventions Holdout samples Weights of evidence Information value Scorecard building demonstration using a spreadsheet Regression in predictive analytics Logistic regression in predictive analytics Decision Trees in predictive analytics Neural networks Measuring accuracy Applications in Government |

ApHadm1 | Apache Hadoop: Manipulation and Transformation of Data Performance | 21 godz. | This course is intended for developers, architects, data scientists or any profile that requires access to data either intensively or on a regular basis. The major focus of the course is data manipulation and transformation. Among the tools in the Hadoop ecosystem this course includes the use of Pig and Hive both of which are heavily used for data transformation and manipulation. This training also addresses performance metrics and performance optimisation. The course is entirely hands on and is punctuated by presentations of the theoretical aspects. 1.1Hadoop Concepts 1.1.1HDFS The Design of HDFS Command line interface Hadoop File System 1.1.2Clusters Anatomy of a cluster Mater Node / Slave node Name Node / Data Node 1.2Data Manipulation 1.2.1MapReduce detailed Map phase Reduce phase Shuffle 1.2.2Analytics with Map Reduce Group-By with MapReduce Frequency distributions and sorting with MapReduce Plotting results (GNU Plot) Histograms with MapReduce Scatter plots with MapReduce Parsing complex datasets Counting with MapReduce and Combiners Build reports 1.2.3Data Cleansing Document Cleaning Fuzzy string search Record linkage / data deduplication Transform and sort event dates Validate source reliability Trim Outliers 1.2.4Extracting and Transforming Data Transforming logs Using Apache Pig to filter Using Apache Pig to sort Using Apache Pig to sessionize 1.2.5Advanced Joins Joining data in the Mapper using MapReduce Joining data using Apache Pig replicated join Joining sorted data using Apache Pig merge join Joining skewed data using Apache Pig skewed join Using a map-side join in Apache Hive Using optimized full outer joins in Apache Hive Joining data using an external key value store 1.3Performance Diagnosis and Optimization Techniques Map Investigating spikes in input data Identifying map-side data skew problems Map task throughput Small files Unsplittable files Reduce Too few or too many reducers Reduce-side data skew problems Reduce tasks throughput Slow shuffle and sort Competing jobs and scheduler throttling Stack dumps & unoptimized code Hardware failures CPU contention Tasks Extracting and visualizing task execution times Profiling your map and reduce tasks Avoid the reducer Filter and project Using the combiner Fast sorting with comparators Collecting skewed data Reduce skew mitigation |

matlab2 | MATLAB Fundamental | 21 godz. | This three-day course provides a comprehensive introduction to the MATLAB technical computing environment. The course is intended for beginning users and those looking for a review. No prior programming experience or knowledge of MATLAB is assumed. Themes of data analysis, visualization, modeling, and programming are explored throughout the course. Topics include: Working with the MATLAB user interface Entering commands and creating variables Analyzing vectors and matrices Visualizing vector and matrix data Working with data files Working with data types Automating commands with scripts Writing programs with logic and flow control Writing functions Part 1 A Brief Introduction to MATLAB Objectives: Offer an overview of what MATLAB is, what it consists of, and what it can do for you An Example: C vs. MATLAB MATLAB Product Overview MATLAB Application Fields What MATLAB can do for you? The Course Outline Working with the MATLAB User Interface Objective: Get an introduction to the main features of the MATLAB integrated design environment and its user interfaces. Get an overview of course themes. MATALB Interface Reading data from file Saving and loading variables Plotting data Customizing plots Calculating statistics and best-fit line Exporting graphics for use in other applications Variables and Expressions Objective: Enter MATLAB commands, with an emphasis on creating and accessing data in variables. Entering commands Creating variables Getting help Accessing and modifying values in variables Creating character variables Analysis and Visualization with Vectors Objective: Perform mathematical and statistical calculations with vectors, and create basic visualizations. See how MATLAB syntax enables calculations on whole data sets with a single command. Calculations with vectors Plotting vectors Basic plot options Annotating plots Analysis and Visualization with Matrices Objective: Use matrices as mathematical objects or as collections of (vector) data. Understand the appropriate use of MATLAB syntax to distinguish between these applications. Size and dimensionality Calculations with matrices Statistics with matrix data Plotting multiple columns Reshaping and linear indexing Multidimensional arrays Part 2 Automating Commands with Scripts Objective: Collect MATLAB commands into scripts for ease of reproduction and experimentation. As the complexity of your tasks increases, entering long sequences of commands in the Command Window becomes impractical. A Modelling Example The Command History Creating script files Running scripts Comments and Code Cells Publishing scripts Working with Data Files Objective: Bring data into MATLAB from formatted files. Because imported data can be of a wide variety of types and formats, emphasis is given to working with cell arrays and date formats. Importing data Mixed data types Cell arrays Conversions amongst numerals, strings, and cells Exporting data Multiple Vector Plots Objective: Make more complex vector plots, such as multiple plots, and use color and string manipulation techniques to produce eye-catching visual representations of data. Graphics structure Multiple figures, axes, and plots Plotting equations Using color Customizing plots Logic and Flow Control Objective: Use logical operations, variables, and indexing techniques to create flexible code that can make decisions and adapt to different situations. Explore other programming constructs for repeating sections of code, and constructs that allow interaction with the user. Logical operations and variables Logical indexing Programming constructs Flow control Loops Matrix and Image Visualization Objective: Visualize images and matrix data in two or three dimensions. Explore the difference in displaying images and visualizing matrix data using images. Scattered Interpolation using vector and matrix data 3-D matrix visualization 2-D matrix visualization Indexed images and colormaps True color images Part 3 Data Analysis Objective: Perform typical data analysis tasks in MATLAB, including developing and fitting theoretical models to real-life data. This leads naturally to one of the most powerful features of MATLAB: solving linear systems of equations with a single command. Dealing with missing data Correlation Smoothing Spectral analysis and FFTs Solving linear systems of equations Writing Functions Objective: Increase automation by encapsulating modular tasks as user-defined functions. Understand how MATLAB resolves references to files and variables. Why functions? Creating functions Adding comments Calling subfunctions Workspaces Subfunctions Path and precedence Data Types Objective: Explore data types, focusing on the syntax for creating variables and accessing array elements, and discuss methods for converting among data types. Data types differ in the kind of data they may contain and the way the data is organized. MATLAB data types Integers Structures Converting types File I/O Objective: Explore the low-level data import and export functions in MATLAB that allow precise control over text and binary file I/O. These functions include textscan, which provides precise control of reading text files. Opening and closing files Reading and writing text files Reading and writing binary files Note that the actual delivered might be subject to minor discrepancies from the outline above without prior notification. Conclusion Note that the actual delivered might be subject to minor discrepancies from the outline above without prior notification. Objectives: Summarise what we have learnt A summary of the course Other upcoming courses on MATLAB Note that the course might be subject to few minor discrepancies when being delivered without prior notifications. |

pdi2 | Pentaho Data Integration (PDI) - moduł do przetwarzania danych ETL (poziom średniozaawansowany) | 14 godz. | Pentaho jest produktem dystrybuowanym na zasadzie licencji Open Source, który dostarcza pełnej gamy rozwiązań dla biznesu w obszarze Business Intelligence, włączając w to raportowanie, analizy danych, kokpity managerskie i integrację danych. Dzięki platformie Pentaho poszczególne komórki biznesu uzyskują dostęp do szerokiego wachlarza cennych informacji, począwszy od analiz sprzedaży i opłacalności poszczególnych klientów czy produktów, poprzez raportowanie na potrzeby HR i działów finansowych, aż do dostarczania informacji zbiorczych na potrzeby kierownictwa wyższego szczebla. Szkolenie jest adresowane do programistów, architektów oraz administratorów aplikacji, którzy chcą tworzyć lub utrzymywać procesy ekstrakcji, transformacji i ładowania danych (ETL) z wykorzystaniem Pentaho Data Integration (PDI). Po szkoleniu uczestnik nabędzie umiejętności związane z: instalacją i konfiguracją środowiska Pentaho, projektowaniem, implementowaniem, monitorowaniem, uruchamianiem i strojeniem procesów ETL, pracy z danymi w PDI, wprowadzaniem różnych typów danych oraz różnych formatów danych filtrowaniem, grupowaniem oraz łączeniem danych harmonogramowaniem zadań, uruchamianiem transformacji. Kurs ma zadanie przeprowadzić uczestnika od poziomu podstawowego do średniozaawansowanego. Dzień pierwszy Instalacja i konfiguracja Pentaho Data Integration Utworzenie repozytorium Zapoznanie się z interfejsem użytkownika Spoon Tworzenie transformacji Odczyt i zapis do pliku Praca z bazami danych (generator zapytań SQL) Filtrowanie, grupowanie oraz łączenie danych Praca z XLS Tworzenie zadań Definiowanie parametrów i zmiennych Dzień DRUGI Wersjonowanie danych (obsługa okresów obowiązywania) Transakcyjność bazodanowa w transformacjach Wykorzystanie JavaScript Transformacje mapujące Konwersja typów danych oraz kolejność kolumn w strumieniu Logowanie przetwarzanie Uruchamianie transformacji i zadań z linii poleceń (kitchen.bat, pan.bat) Harmonogramowanie zadań Uruchamianie transformacji równolegle |

bdhat | Big Data Hadoop Analyst Training | 28 godz. | Big Data Analyst Training to praktyczny kurs, który polecany jest każdemu, kto chce w przyszłości zostać ekspertem Data Scientist. Kurs skupia sie na aspektach potrzebnych do pracy nowoczesnego analityka w technologii Big Data. W trakcie kursu prezentowane są narzędzia pozwalające na uzyskanie dostępu, zmianę, transformację i analizę skomplikowanych struktur danych umieszczonych w klastrze Hadoop. W trakcie kursu będą poruszane tematy w ramach technologii Hadoop Ecosystem (Pig, Hive, Impala, ELK i inne). Funkcjonaloność narzędzi Pig, Hive, Impala, ELK, pozwalające na zbieranie danych, zapisywanie wyników i analitykę. Jak Pig, Hive i Impala mogą podnieść wydajność typowych i codziennych zadań analitycznych. Wykonywanie w czasie rzeczywistym interaktywnych analiz ogromnych zbiorów danych aby uzyskać cenne i wartościowe elementy dla biznesu oraz jak interpretować wnioski. Wykonywanie złożonych zapytań na bardzo dużych wolumenach danych. Podstawy Hadoop. Wprowadzenie do Pig. Podstawowa analiza danych z wykorzystaniem narzędzia Pig. Procesowanie złożonych danych z Pig. Operacje na wielu zbiorach danych z wykorzytaniem Pig. Rozwiązywanie problemów i optymalizacja Pig. Wprowadzenie do Hive, Impala, ELK. Wykonywanie zapytań w Hive, Impala, ELK. Zarządzanie danymi w Hive. Przechowywanie danych i wydajność. Analizy z wykorzystaniem narzędzi Hive i Impala. Praca z narzędziem Impala i ELK. Analiza tekstu i złożonych typów danych. Optymalizacja Hive, Pig, Impala, ELK. Interoperacyjność i przepływ pracy. Pytania, zadania, certyfikacja. |

pdi3 | Pentaho Data Integration (PDI) - moduł do przetwarzania danych ETL (poziom zaawansowany) | 21 godz. | Pentaho jest produktem dystrybuowanym na zasadzie licencji Open Source, który dostarcza pełnej gamy rozwiązań dla biznesu w obszarze Business Intelligence, włączając w to raportowanie, analizy danych, kokpity managerskie i integrację danych. Dzięki platformie Pentaho poszczególne komórki biznesu uzyskują dostęp do szerokiego wachlarza cennych informacji, począwszy od analiz sprzedaży i opłacalności poszczególnych klientów czy produktów, poprzez raportowanie na potrzeby HR i działów finansowych, aż do dostarczania informacji zbiorczych na potrzeby kierownictwa wyższego szczebla. Szkolenie jest adresowane do programistów, architektów oraz administratorów aplikacji, którzy chcą tworzyć lub utrzymywać procesy ekstrakcji, transformacji i ładowania danych (ETL) z wykorzystaniem Pentaho Data Integration (PDI). Po szkoleniu uczestnik nabędzie umiejętności związane z: instalacją i konfiguracją środowiska Pentaho, projektowaniem, implementowaniem, monitorowaniem, uruchamianiem i strojeniem procesów ETL, pracy z danymi w PDI, wprowadzaniem różnych typów danych oraz różnych formatów danych filtrowaniem, grupowaniem oraz łączeniem danych harmonogramowaniem zadań, uruchamianiem transformacji, tworzeniu klastów. Kurs ma zadanie przeprowadzić uczestnika od poziomu podstawowego do zaawansowanego. Dzień pierwszy Instalacja i konfiguracja Pentaho Data Integration Utworzenie repozytorium Zapoznanie się z interfejsem użytkownika Spoon Tworzenie transformacji Odczyt i zapis do pliku Praca z bazami danych (generator zapytań SQL) Filtrowanie, grupowanie oraz łączenie danych Praca z XLS Dzień DRUGI Tworzenie zadań Definiowanie parametrów i zmiennych Wersjonowanie danych (obsługa okresów obowiązywania) Transakcyjność bazodanowa w transformacjach Wykorzystanie JavaScript Transformacje mapujące Konwersja typów danych oraz kolejność kolumn w strumieniu Logowanie przetwarzanie Dzień Trzeci Uruchamianie transformacji i zadań z linii poleceń (kitchen.bat, pan.bat) Harmonogramowanie zadań Uruchamianie transformacji równolegle Uruchamianie zdalne (carte.bat) Tworzenie klastrów oraz partycjonowanie Wersjonowanie i praca grupowa |

datavis1 | Data Visualization | 28 godz. | This course is intended for engineers and decision makers working in data mining and knoweldge discovery. You will learn how to create effective plots and ways to present and represent your data in a way that will appeal to the decision makers and help them to understand hidden information. Day 1: what is data visualization why it is important data visualization vs data mining human cognition HMI common pitfalls Day 2: different type of curves drill down curves categorical data plotting multi variable plots data glyph and icon representation Day 3: plotting KPIs with data R and X charts examples what if dashboards parallel axes mixing categorical data with numeric data Day 4: different hats of data visualization how can data visualization lie disguised and hidden trends a case study of student data visual queries and region selection |

pbi4 | Pentaho Business Intelligence (PBI) - moduły raportowe | 28 godz. | Pentaho BI Suite to system klasy Business Intelligence (BI), przeznaczony do zaawansowanego raportowania i wielowymiarowej analizy danych biznesowych. Pozwala na łączenie danych z wielu źródeł i ich elastyczną obróbkę. Platforma Pentaho BI Suite obejmuje następujące obszary: Raportowanie (Pentaho Reporting) Analizy danych (Analysis) Kokpity menedżerskie (Dashboards) Data Mining Pentaho Metadata Pentaho Data Integration (Kettle) Narzędzia wspomagające zarządzanie projektowanie aplikacji BI w środowisku Pentaho: Data Integration, Design Studio, Pentaho SDK, Report Designer, Schema Workbench (kontynuacja Pentaho Cube Designer) Poniższe szkolenie jest kompleksowym kursem, który pozwoli uczestnikom na sprawną pracę z Report Designerem oraz Business Inteligence Server oraz na ich integrqację z Pentaho Data Integration. Report Designer (dwa dni) Poziom podstawowy (pierwszy dzień) Zapoznanie się z interfejsem użytkownika Report Designer Zapoznanie się z konsolą użytkownika (Pentaho BI Suite) Tworzenie raportów z wykorzystaniem prostych połączeń (JDBC) Wykorzystanie kreatora zapytań Formatowanie danych (style, atrybuty) Publikowanie raportów Wykorzystanie Report Wizard jako alternatywa tworzenia raportów Formatowanie warunkowe Użycie parametrów (prompty) Tworzenie wykresów Poziom zaawansowany (drugi dzień) Definiowanie połączeń z XML, Tabele, PDI Tworzenie raportów z podzapytaniami (podraporty) oraz z zapytaniami JavaScript Zaawansowane formatowanie danych (HTML, JS, EXCEL, Eventy) Wykorzystanie parametrów i funkcji Wykorzystanie transformacji (PDI) jako źródło danych Business Intelligence Server (Jeden dzień) Konsola użytkownika Zapoznanie się z interfejsem użytkownika Tworzenie raportów na bazie połączeń od baz oraz do pliku CSV Definiowanie zapytań Harmonogramowanie raportów Udostępnianie Konsola administratora Użytkownicy i grupy Definiowanie połączeń Harmonogramowanie Data Integration na potrzeby report designer (jeden dzień) Podstawy tworzenia transformacji w PeNtaho Data Integration (PDI) Omówienie rodzajów danych wejściowych Przetwarzanie danych przy pomocy prostych mechanizmów Filtrowanie danych Złączenia danych Grupowanie Integracja Data Integrator z Report Designer Tworzenie połączenia do transformacji Wykorzystanie danych z transformacji |

dsbda | Data Science for Big Data Analytics | 35 godz. | Introduction to Data Science for Big Data Analytics Data Science Overview Big Data Overview Data Structures Drivers and complexities of Big Data Big Data ecosystem and a new approach to analytics Key technologies in Big Data Data Mining process and problems Association Pattern Mining Data Clustering Outlier Detection Data Classification Introduction to Data Analytics lifecycle Discovery Data preparation Model planning Model building Presentation/Communication of results Operationalization Exercise: Case study From this point most of the training time (80%) will be spent on examples and exercises in R and related big data technology. Getting started with R Installing R and Rstudio Features of R language Objects in R Data in R Data manipulation Big data issues Exercises Getting started with Hadoop Installing Hadoop Understanding Hadoop modes HDFS MapReduce architecture Hadoop related projects overview Writing programs in Hadoop MapReduce Exercises Integrating R and Hadoop with RHadoop Components of RHadoop Installing RHadoop and connecting with Hadoop The architecture of RHadoop Hadoop streaming with R Data analytics problem solving with RHadoop Exercises Pre-processing and preparing data Data preparation steps Feature extraction Data cleaning Data integration and transformation Data reduction – sampling, feature subset selection, Dimensionality reduction Discretization and binning Exercises and Case study Exploratory data analytic methods in R Descriptive statistics Exploratory data analysis Visualization – preliminary steps Visualizing single variable Examining multiple variables Statistical methods for evaluation Hypothesis testing Exercises and Case study Data Visualizations Basic visualizations in R Packages for data visualization ggplot2, lattice, plotly, lattice Formatting plots in R Advanced graphs Exercises Regression (Estimating future values) Linear regression Use cases Model description Diagnostics Problems with linear regression Shrinkage methods, ridge regression, the lasso Generalizations and nonlinearity Regression splines Local polynomial regression Generalized additive models Regression with RHadoop Exercises and Case study Classification The classification related problems Bayesian refresher Naïve Bayes Logistic regression K-nearest neighbors Decision trees algorithm Neural networks Support vector machines Diagnostics of classifiers Comparison of classification methods Scalable classification algorithms Exercises and Case study Assessing model performance and selection Bias, Variance and model complexity Accuracy vs Interpretability Evaluating classifiers Measures of model/algorithm performance Hold-out method of validation Cross-validation Tuning machine learning algorithms with caret package Visualizing model performance with Profit ROC and Lift curves Ensemble Methods Bagging Random Forests Boosting Gradient boosting Exercises and Case study Support vector machines for classification and regression Maximal Margin classifiers Support vector classifiers Support vector machines SVM’s for classification problems SVM’s for regression problems Exercises and Case study Identifying unknown groupings within a data set Feature Selection for Clustering Representative based algorithms: k-means, k-medoids Hierarchical algorithms: agglomerative and divisive methods Probabilistic base algorithms: EM Density based algorithms: DBSCAN, DENCLUE Cluster validation Advanced clustering concepts Clustering with RHadoop Exercises and Case study Discovering connections with Link Analysis Link analysis concepts Metrics for analyzing networks The Pagerank algorithm Hyperlink-Induced Topic Search Link Prediction Exercises and Case study Association Pattern Mining Frequent Pattern Mining Model Scalability issues in frequent pattern mining Brute Force algorithms Apriori algorithm The FP growth approach Evaluation of Candidate Rules Applications of Association Rules Validation and Testing Diagnostics Association rules with R and Hadoop Exercises and Case study Constructing recommendation engines Understanding recommender systems Data mining techniques used in recommender systems Recommender systems with recommenderlab package Evaluating the recommender systems Recommendations with RHadoop Exercise: Building recommendation engine Text analysis Text analysis steps Collecting raw text Bag of words Term Frequency –Inverse Document Frequency Determining Sentiments Exercises and Case study |

danagr | Data and Analytics - from the ground up | 42 godz. | Data analytics is a crucial tool in business today. We will focus throughout on developing skills for practical hands on data analysis. The aim is to help delegates to give evidence-based answers to questions: What has happened? processing and analyzing data producing informative data visualizations What will happen? forecasting future performance evaluating forecasts What should happen? turning data into evidence-based business decisions optimizing processes The course itself can be delivered either as a 6 day classroom course or remotely over a period of weeks if preferred. We can work with you to deliver the course to best suit your needs. Basic Excel Navigation Manipulating data Working with formulas and addresses Charts Advanced Excel Logical functions Scenario analysis Solver Macros First glimpse of code: VBA VBA Data types Writing a function Controlling a program: conditional evaluation and loops Debugging techniques Data Analytics with R Introducing R Variables and types Data manipulation in R Writing functions Data visualization using ggplot Data wrangling with Dplyr Introduction to Machine Learning with R Linear Regression Classification and regression trees Classification using Support Vector Machines and Random Forests Clustering techniques |

neo4j | Beyond the relational database: neo4j | 21 godz. | Relational, table-based databases such as Oracle and MySQL have long been the standard for organizing and storing data. However, the growing size and fluidity of data have made it difficult for these traditional systems to efficiently execute highly complex queries on the data. Imagine replacing rows-and-columns-based data storage with object-based data storage, whereby entities (e.g., a person) could be stored as data nodes, then easily queried on the basis of their vast, multi-linear relationship with other nodes. And imagine querying these connections and their associated objects and properties using a compact syntax, up to 20 times lighter than SQL. This is what graph databases, such as neo4j offer. In this hands-on course, we will set up a live project and put into practice the skills to model, manage and access your data. We contrast and compare graph databases with SQL-based databases as well as other NoSQL databases and clarify when and where it makes sense to implement each within your infrastructure. Audience Database administrators (DBAs) Data analysts Developers System Administrators DevOps engineers Business Analysts CTOs CIOs Format of the course Heavy emphasis on hands-on practice. Most of the concepts are learned through samples, exercises and hands-on development. Getting started with neo4j neo4j vs relational databases neo4j vs other NoSQL databases Using neo4j to solve real world problems Installing neo4j Data modeling with neo4j Mapping white-board diagrams and mind maps to neo4j Working with nodes Creating, changing and deleting nodes Defining node properties Node relationships Creating and deleting relationships Bi-directional relationships Querying your data with Cypher Querying your data based on relationships MATCH, RETURN, WHERE, REMOVE, MERGE, etc. Setting indexes and constraints Working with the REST API REST operations on nodes REST operations on relationships REST operations on indexes and constraints Accessing the core API for application development Working with NET, Java, Javascript, and Python APIs Closing remarks |

scilab | Scilab | 14 godz. | Scilab is a well-developed, free, and open-source high-level language for scientific data manipulation. Used for statistics, graphics and animation, simulation, signal processing, physics, optimization, and more, its central data structure is the matrix, simplifying many types of problems compared to alternatives such as FORTRAN and C derivatives. It is compatible with languages such as C, Java, and Python, making it suitable as for use as a supplement to existing systems. In this instructor-led training, participants will learn the advantages of Scilab compared to alternatives like Matlab, the basics of the Scilab syntax as well as some advanced functions, and interface with other widely used languages, depending on demand. The course will conclude with a brief project focusing on image processing. By the end of this training, participants will have a grasp of the basic functions and some advanced functions of Scilab, and have the resources to continue expanding their knowledge. Audience Data scientists and engineers, especially with interest in image processing and facial recognition Format of the course Part lecture, part discussion, exercises and intensive hands-on practice, with a final project Introduction Comparison with other languages Getting started Matrix operations Multidimensional data Plotting and exporting graphics Creating an ATOMS toolbox Interface with C, Java, and others Final project: Image analysis Closing remarks Overview of useful libraries and extensions |

mdlmrah | Model MapReduce w implementacji oprogramowania Apache Hadoop | 14 godz. | Szkolenie skierowane jest do organizacji chcących wdrożyć rozwiązania pozwalające na przetwarzanie dużych zbiorów danych za pomocą klastrów. Data Mining i Bussiness Intelligence Wprowadzenie Obszary zastosowań Możliwości Podstawy eksploracji danych i odkrywania wiedzy Big data Co rozumiemy pod pojęciem Big data? Big data a Data mining MapReduce Opis modelu Przykładowe zastosowanie Statystyki Model klastra Hadoop Czym jest Hadoop Instalacja Podstawowa konfiguracja Ustawienia klastra Architektura i konfiguracja Hadoop Distributed File System Komendy i obsługa z konsoli Narzędzie DistCp MapReduce i Hadoop Streaming Administracja i konfiguracja Hadoop On Demand Alternatywne rozwiązania |

processmining | Process Mining | 21 godz. | Process mining, or Automated Business Process Discovery (ABPD), is a technique that applies algorithms to event logs for the purpose of analyzing business processes. Process mining goes beyond data storage and data analysis; it bridges data with processes and provides insights into the trends and patterns that affect process efficiency. Format of the course The course starts with an overview of the most commonly used techniques for process mining. We discuss the various process discovery algorithms and tools used for discovering and modeling processes based on raw event data. Real-life case studies are examined and data sets are analyzed using the ProM open-source framework. Audience Data science professionals Anyone interested in understanding and applying process modeling and data mining Overview Discovering, analyzing and re-thinking your processes Types of process mining Discovery, conformance and enhancement Process mining workflow From log data analysis to response and action Other tools for process mining PMLAB, Apromoro Commercial offerings Closing remarks |

matlabfundamentalsfinance | MATLAB Fundamentals + MATLAB for Finance | 35 godz. | This course provides a comprehensive introduction to the MATLAB technical computing environment + an introduction to using MATLAB for financial applications. The course is intended for beginning users and those looking for a review. No prior programming experience or knowledge of MATLAB is assumed. Themes of data analysis, visualization, modeling, and programming are explored throughout the course. Topics include: Working with the MATLAB user interface Entering commands and creating variables Analyzing vectors and matrices Visualizing vector and matrix data Working with data files Working with data types Automating commands with scripts Writing programs with logic and flow control Writing functions Using the Financial Toolbox for quantitative analysis Part 1 A Brief Introduction to MATLAB Objectives: Offer an overview of what MATLAB is, what it consists of, and what it can do for you An Example: C vs. MATLAB MATLAB Product Overview MATLAB Application Fields What MATLAB can do for you? The Course Outline Working with the MATLAB User Interface Objective: Get an introduction to the main features of the MATLAB integrated design environment and its user interfaces. Get an overview of course themes. MATALB Interface Reading data from file Saving and loading variables Plotting data Customizing plots Calculating statistics and best-fit line Exporting graphics for use in other applications Variables and Expressions Objective: Enter MATLAB commands, with an emphasis on creating and accessing data in variables. Entering commands Creating variables Getting help Accessing and modifying values in variables Creating character variables Analysis and Visualization with Vectors Objective: Perform mathematical and statistical calculations with vectors, and create basic visualizations. See how MATLAB syntax enables calculations on whole data sets with a single command. Calculations with vectors Plotting vectors Basic plot options Annotating plots Analysis and Visualization with Matrices Objective: Use matrices as mathematical objects or as collections of (vector) data. Understand the appropriate use of MATLAB syntax to distinguish between these applications. Size and dimensionality Calculations with matrices Statistics with matrix data Plotting multiple columns Reshaping and linear indexing Multidimensional arrays Part 2 Automating Commands with Scripts Objective: Collect MATLAB commands into scripts for ease of reproduction and experimentation. As the complexity of your tasks increases, entering long sequences of commands in the Command Window becomes impractical. A Modelling Example The Command History Creating script files Running scripts Comments and Code Cells Publishing scripts Working with Data Files Objective: Bring data into MATLAB from formatted files. Because imported data can be of a wide variety of types and formats, emphasis is given to working with cell arrays and date formats. Importing data Mixed data types Cell arrays Conversions amongst numerals, strings, and cells Exporting data Multiple Vector Plots Objective: Make more complex vector plots, such as multiple plots, and use color and string manipulation techniques to produce eye-catching visual representations of data. Graphics structure Multiple figures, axes, and plots Plotting equations Using color Customizing plots Logic and Flow Control Objective: Use logical operations, variables, and indexing techniques to create flexible code that can make decisions and adapt to different situations. Explore other programming constructs for repeating sections of code, and constructs that allow interaction with the user. Logical operations and variables Logical indexing Programming constructs Flow control Loops Matrix and Image Visualization Objective: Visualize images and matrix data in two or three dimensions. Explore the difference in displaying images and visualizing matrix data using images. Scattered Interpolation using vector and matrix data 3-D matrix visualization 2-D matrix visualization Indexed images and colormaps True color images Part 3 Data Analysis Objective: Perform typical data analysis tasks in MATLAB, including developing and fitting theoretical models to real-life data. This leads naturally to one of the most powerful features of MATLAB: solving linear systems of equations with a single command. Dealing with missing data Correlation Smoothing Spectral analysis and FFTs Solving linear systems of equations Writing Functions Objective: Increase automation by encapsulating modular tasks as user-defined functions. Understand how MATLAB resolves references to files and variables. Why functions? Creating functions Adding comments Calling subfunctions Workspaces Subfunctions Path and precedence Data Types Objective: Explore data types, focusing on the syntax for creating variables and accessing array elements, and discuss methods for converting among data types. Data types differ in the kind of data they may contain and the way the data is organized. MATLAB data types Integers Structures Converting types File I/O Objective: Explore the low-level data import and export functions in MATLAB that allow precise control over text and binary file I/O. These functions include textscan, which provides precise control of reading text files. Opening and closing files Reading and writing text files Reading and writing binary files Note that the actual delivered might be subject to minor discrepancies from the outline above without prior notification. Part 4 Overview of the MATLAB Financial Toolbox Objective: Learn to apply the various features included in the MATLAB Financial Toolbox to perform quantitative analysis for the financial industry. Gain the knowledge and practice needed to efficiently develop real-world applications involving financial data. Asset Allocation and Portfolio Optimization Risk Analysis and Investment Performance Fixed-Income Analysis and Option Pricing Financial Time Series Analysis Regression and Estimation with Missing Data Technical Indicators and Financial Charts Monte Carlo Simulation of SDE Models Asset Allocation and Portfolio Optimization Objective: perform capital allocation, asset allocation, and risk assessment. Estimating asset return and total return moments from price or return data Computing portfolio-level statistics, such as mean, variance, value at risk (VaR), and conditional value at risk (CVaR) Performing constrained mean-variance portfolio optimization and analysis Examining the time evolution of efficient portfolio allocations Performing capital allocation Accounting for turnover and transaction costs in portfolio optimization problems Risk Analysis and Investment Performance Objective: Define and solve portfolio optimization problems. Specifying a portfolio name, the number of assets in an asset universe, and asset identifiers. Defining an initial portfolio allocation. Fixed-Income Analysis and Option Pricing Objective: Perform fixed-income analysis and option pricing. Analyzing cash flow Performing SIA-Compliant fixed-income security analysis Performing basic Black-Scholes, Black, and binomial option-pricing Part 5 Financial Time Series Analysis Objective: analyze time series data in financial markets. Performing data math Transforming and analyzing data Technical analysis Charting and graphics Regression and Estimation with Missing Data Objective: Perform multivariate normal regression with or without missing data. Performing common regressions Estimating log-likelihood function and standard errors for hypothesis testing Completing calculations when data is missing Technical Indicators and Financial Charts Objective: Practice using performance metrics and specialized plots. Moving averages Oscillators, stochastics, indexes, and indicators Maximum drawdown and expected maximum drawdown Charts, including Bollinger bands, candlestick plots, and moving averages Monte Carlo Simulation of SDE Models Objective: Create simulations and apply SDE models Brownian Motion (BM) Geometric Brownian Motion (GBM) Constant Elasticity of Variance (CEV) Cox-Ingersoll-Ross (CIR) Hull-White/Vasicek (HWV) Heston Conclusion Objectives: Summarise what we have learned A summary of the course Other upcoming courses on MATLAB Note: the actual content delivered might differ from the outline as a result of customer requirements and the time spent on each topics. |

sspsspas | Statystyka z SPSS Predictive Analytics SoftWare | 14 godz. | Cel: Opanowanie umiejętności pracy z programem SPSS na poziomie samodzielności Adresaci: Analitycy, Badacze, Naukowcy, studenci i wszyscy, którzy chcą pozyskać umiejętność posługiwania się pakietem SPSS oraz poznać popularne techniki eksploracji danych. Obsługa programu Okna dialogowe wprowadzanie/wczytywanie danych pojęcie zmiennej i skale pomiarowe przygotowanie bazy danych generowanie tabel i wykresów formatowanie raportu Język poleceńsyntax automatyzacja analiz zapisywanie i modyfikacja procedur tworzenie własnych procedur analitycznych Analiza danych Statystyki opisowe kluczowe terminy: m.in. zmienna, hipoteza, istotność statystyczna miary tendencji centralnej miary dyspersji rozkłady cech statystycznych standaryzacja Wprowadzenie do badania zależności między zmiennymi metody korelacyjne a eksperymentalne Podsumowanie: analiza przypadku i omówienie |

kdd | Knowledge Discover in Databases (KDD) | 21 godz. | Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. Real-life applications for this data mining technique include marketing, fraud detection, telecommunication and manufacturing. In this course, we introduce the processes involved in KDD and carry out a series of exercises to practice the implementation of those processes. Audience Data analysts or anyone interested in learning how to interpret data to solve problems Format of the course After a theoretical discussion of KDD, the instructor will present real-life cases which call for the application of KDD to solve a problem. Participants will prepare, select and cleanse sample data sets and use their prior knowledge about the data to propose solutions based on the results of their observations. Introduction KDD vs data mining Establishing the application domain Establishing relevant prior knowledge Understanding the goal of the investigation Creating a target data set Data cleaning and preprocessing Data reduction and projection Choosing the data mining task Choosing the data mining algorithms Interpreting the mined patterns |

matlabdsandreporting | MATLAB Fundamentals, Data Science & Report Generation | 126 godz. | In the first part of this training, we cover the fundamentals of MATLAB and its function as both a language and a platform. Included in this discussion is an introduction to MATLAB syntax, arrays and matrices, data visualization, script development, and object-oriented principles. In the second part, we demonstrate how to use MATLAB for data mining, machine learning and predictive analytics. To provide participants with a clear and practical perspective of MATLAB's approach and power, we draw comparisons between using MATLAB and using other tools such as spreadsheets, C, C++, and Visual Basic. In the third part of the training, participants learn how to streamline their work by automating their data processing and report generation. Throughout the course, participants will put into practice the ideas learned through hands-on exercises in a lab environment. By the end of the training, participants will have a thorough grasp of MATLAB' capabilities and will be able to employ it for solving real-world data science problems as well as for streamlining their work through automation. Assessments will be conducted throughout the course to guage progress. Format of the course Course includes theoretical and practical exercises, including case discussions, sample code inspection, and hands-on implementation. Note Practice sessions will based on pre-arranged sample data report templates. If you have specific requirements, please contact us to arrange Introduction MATLAB for data science and reporting Part 01: MATLAB fundamentals Overview MATLAB for data analysis, visualization, modeling, and programming. Working with the MATLAB user interface Overview of MATLAB syntax Entering commands Using the command line interface Creating variables Numeric vs character data Analyzing vectors and matrices Creating and manipulating Performing calculations Visualizing vector and matrix data Working with data files Importing data from Excel spreadsheets Working with data types Working with table data Automating commands with scripts Creating and running scripts Organizing and publishing your scripts Writing programs with branching and loops User interaction and flow control Writing functions Creating and calling functions Debugging with MATLAB Editor Applying object-oriented programming principles to your programs Part 02: MATLAB for data science Overview MATLAB for data mining, machine learning and predictive analytics Accessing data Obtaining data from files, spreadsheets, and databases Obtaining data from test equipment and hardware Obtaining data from software and the Web Exploring data Identifying trends, testing hypotheses, and estimating uncertainty Creating customized algorithms Creating visualizations Creating models Publishing customized reports Sharing analysis tools As MATLAB code As standalone desktop or Web applications Using the Statistics and Machine Learning Toolbox Using the Neural Network Toolbox Part 03: Report generation Overview Presenting results from MATLAB programs, applications, and sample data Generating Microsoft Word, PowerPoint®, PDF, and HTML reports. Templated reports Tailor-made reports Using organization’s templates and standards Creating reports interactively vs programmatically Using the Report Explorer Using the DOM (Document Object Model) API Creating reports interactively using Report Explorer Report Explorer Examples Magic Squares Report Explorer Example Creating reports Using Report Explorer to create report setup file, define report structure and content Formatting reports Specifying default report style and format for Report Explorer reports Generating reports Configuring Report Explorer for processing and running report Managing report conversion templates Copying and managing Microsoft Word , PDF, and HTML conversion templates for Report Explorer reports Customizing Report Conversion templates Customizing the style and format of Microsoft Word and HTML conversion templates for Report Explorer reports Customizing components and style sheets Customizing report components, define layout style sheets Creating reports programmatically in MATLAB Template-Based Report Object (DOM) API Examples Functional report Object-oriented report Programmatic report formatting Creating report content Using the Document Object Model (DOM) API Report format basics Specifying format for report content Creating form-based reports Using the DOM API to fill in the blanks in a report form Creating object-oriented reports Deriving classes to simplify report creation and maintenance Creating and formatting report objects Lists, tables, and images Creating DOM Reports from HTML Appending HTML string or file to a Microsoft® Word, PDF, or HTML report generated by Document Object Model (DOM) API Creating report templates Creating templates to use with programmatic reports Formatting page layouts Formatting pages in Microsoft Word and PDF reports Summary and closing remarks |

bdbiga | Big Data Business Intelligence for Govt. Agencies | 35 godz. | Advances in technologies and the increasing amount of information are transforming how business is conducted in many industries, including government. Government data generation and digital archiving rates are on the rise due to the rapid growth of mobile devices and applications, smart sensors and devices, cloud computing solutions, and citizen-facing portals. As digital information expands and becomes more complex, information management, processing, storage, security, and disposition become more complex as well. New capture, search, discovery, and analysis tools are helping organizations gain insights from their unstructured data. The government market is at a tipping point, realizing that information is a strategic asset, and government needs to protect, leverage, and analyze both structured and unstructured information to better serve and meet mission requirements. As government leaders strive to evolve data-driven organizations to successfully accomplish mission, they are laying the groundwork to correlate dependencies across events, people, processes, and information. High-value government solutions will be created from a mashup of the most disruptive technologies: Mobile devices and applications Cloud services Social business technologies and networking Big Data and analytics IDC predicts that by 2020, the IT industry will reach $5 trillion, approximately $1.7 trillion larger than today, and that 80% of the industry's growth will be driven by these 3rd Platform technologies. In the long term, these technologies will be key tools for dealing with the complexity of increased digital information. Big Data is one of the intelligent industry solutions and allows government to make better decisions by taking action based on patterns revealed by analyzing large volumes of data — related and unrelated, structured and unstructured. But accomplishing these feats takes far more than simply accumulating massive quantities of data.“Making sense of these volumes of Big Data requires cutting-edge tools and technologies that can analyze and extract useful knowledge from vast and diverse streams of information,” Tom Kalil and Fen Zhao of the White House Office of Science and Technology Policy wrote in a post on the OSTP Blog. The White House took a step toward helping agencies find these technologies when it established the National Big Data Research and Development Initiative in 2012. The initiative included more than $200 million to make the most of the explosion of Big Data and the tools needed to analyze it. The challenges that Big Data poses are nearly as daunting as its promise is encouraging. Storing data efficiently is one of these challenges. As always, budgets are tight, so agencies must minimize the per-megabyte price of storage and keep the data within easy access so that users can get it when they want it and how they need it. Backing up massive quantities of data heightens the challenge. Analyzing the data effectively is another major challenge. Many agencies employ commercial tools that enable them to sift through the mountains of data, spotting trends that can help them operate more efficiently. (A recent study by MeriTalk found that federal IT executives think Big Data could help agencies save more than $500 billio while also fulfilling mission objectives.). Custom-developed Big Data tools also are allowing agencies to address the need to analyze their data. For example, the Oak Ridge National Laboratory’s Computational Data Analytics Group has made its Piranha data analytics system available to other agencies. The system has helped medical researchers find a link that can alert doctors to aortic aneurysms before they strike. It’s also used for more mundane tasks, such as sifting through résumés to connect job candidates with hiring managers. Breakdown of topics on daily basis: (Each session is 2 hours) Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Govt. Case Studies from NIH, DoE Big Data adaptation rate in Govt. Agencies & and how they are aligning their future operation around Big Data Predictive Analytics Broad Scale Application Area in DoD, NSA, IRS, USDA etc. Interfacing Big Data with Legacy data Basic understanding of enabling technologies in predictive analytics Data Integration & Dashboard visualization Fraud management Business Rule/ Fraud detection generation Threat detection and profiling Cost benefit analysis for Big Data implementation Day-1: Session-2 : Introduction of Big Data-1 Main characteristics of Big Data-volume, variety, velocity and veracity. MPP architecture for volume. Data Warehouses – static schema, slowly evolving dataset MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica etc. Hadoop Based Solutions – no conditions on structure of dataset. Typical pattern : HDFS, MapReduce (crunch), retrieve from HDFS Batch- suited for analytical/non-interactive Volume : CEP streaming data Typical choices – CEP products (e.g. Infostreams, Apama, MarkLogic etc) Less production ready – Storm/S4 NoSQL Databases – (columnar and key-value): Best suited as analytical adjunct to data warehouse/database Day-1 : Session -3 : Introduction to Big Data-2 NoSQL solutions KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB) KV Store - Dynamo, Voldemort, Dynomite, SubRecord, Mo8onDb, DovetailDB KV Store (Hierarchical) - GT.m, Cache KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracoqua Tuple Store - Gigaspaces, Coord, Apache River Object Database - ZopeDB, DB40, Shoal Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI Varieties of Data: Introduction to Data Cleaning issue in Big Data RDBMS – static structure/schema, doesn’t promote agile, exploratory environment. NoSQL – semi structured, enough structure to store data without exact schema before storing data Data cleaning issues Day-1 : Session-4 : Big Data Introduction-3 : Hadoop When to select Hadoop? STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration) SEMI STRUCTURED data – tough to do with traditional solutions (DW/DB) Warehousing data = HUGE effort and static even after implementation For variety & volume of data, crunched on commodity hardware – HADOOP Commodity H/W needed to create a Hadoop Cluster Introduction to Map Reduce /HDFS MapReduce – distribute computing over multiple servers HDFS – make data available locally for the computing process (with redundancy) Data – can be unstructured/schema-less (unlike RDBMS) Developer responsibility to make sense of data Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: universe of Big Data Tools-which one to use and when? Hadoop vs. Other NoSQL solutions For interactive, random access to data Hbase (column oriented database) on top of Hadoop Random access to data but restrictions imposed (max 1 PB) Not good for ad-hoc analytics, good for logging, counting, time-series Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access) Flume – Stream data (e.g. log data) into HDFS Day-2: Session-2: Big Data Management System Moving parts, compute nodes start/fail :ZooKeeper - For configuration/coordination/naming services Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari In Cloud : Whirr Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI : Introduction to Machine learning Learning classification techniques Bayesian Prediction-preparing training file Support Vector Machine KNN p-Tree Algebra & vertical mining Neural Network Big Data large variable problem -Random forest (RF) Big Data Automation problem – Multi-model ensemble RF Automation through Soft10-M Text analytic tool-Treeminer Agile learning Agent based learning Distributed learning Introduction to Open source Tools for predictive analytics : R, Rapidminer, Mahut Day-2: Session-4 Predictive analytics eco-system-2: Common predictive analytic problems in Govt. Insight analytic Visualization analytic Structured predictive analytic Unstructured predictive analytic Threat/fraudstar/vendor profiling Recommendation Engine Pattern detection Rule/Scenario discovery –failure, fraud, optimization Root cause discovery Sentiment analysis CRM analytic Network analytic Text Analytics Technology assisted review Fraud analytic Real Time Analytic Day-3 : Sesion-1 : Real Time and Scalable Analytic Over Hadoop Why common analytic algorithms fail in Hadoop/HDFS Apache Hama- for Bulk Synchronous distributed computing Apache SPARK- for cluster computing for real time analytic CMU Graphics Lab2- Graph based asynchronous approach to distributed computing KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation Day-3: Session-2: Tools for eDiscovery and Forensics eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance Predictive coding and technology assisted review (TAR) Live demo of a Tar product ( vMiner) to understand how TAR works for faster discovery Faster indexing through HDFS –velocity of data NLP or Natural Language processing –various techniques and open source products eDiscovery in foreign languages-technology for foreign language processing Day-3 : Session 3: Big Data BI for Cyber Security –Understanding whole 360 degree views of speedy data collection to threat identification Understanding basics of security analytics-attack surface, security misconfiguration, host defenses Network infrastructure/ Large datapipe / Response ETL for real time analytic Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data Day-3: Session 4: Big Data in USDA : Application in Agriculture Introduction to IoT ( Internet of Things) for agriculture-sensor based Big Data and control Introduction to Satellite imaging and its application in agriculture Integrating sensor and image data for fertility of soil, cultivation recommendation and forecasting Agriculture insurance and Big Data Crop Loss forecasting Day-4 : Session-1: Fraud prevention BI from Big Data in Govt-Fraud analytic: Basic classification of Fraud analytics- rule based vs predictive analytics Supervised vs unsupervised Machine learning for Fraud pattern detection Vendor fraud/over charging for projects Medicare and Medicaid fraud- fraud detection techniques for claim processing Travel reimbursement frauds IRS refund frauds Case studies and live demo will be given wherever data is available. Day-4 : Session-2: Social Media Analytic- Intelligence gathering and analysis Big Data ETL API for extracting social media data Text, image, meta data and video Sentiment analysis from social media feed Contextual and non-contextual filtering of social media feed Social Media Dashboard to integrate diverse social media Automated profiling of social media profile Live demo of each analytic will be given through Treeminer Tool. Day-4 : Session-3: Big Data Analytic in image processing and video feeds Image Storage techniques in Big Data- Storage solution for data exceeding petabytes LTFS and LTO GPFS-LTFS ( Layered storage solution for Big image data) Fundamental of image analytics Object recognition Image segmentation Motion tracking 3-D image reconstruction Day-4: Session-4: Big Data applications in NIH: Emerging areas of Bio-informatics Meta-genomics and Big Data mining issues Big Data Predictive analytic for Pharmacogenomics, Metabolomics and Proteomics Big Data in downstream Genomics process Application of Big data predictive analytics in Public health Big Data Dashboard for quick accessibility of diverse data and display : Integration of existing application platform with Big Data Dashboard Big Data management Case Study of Big Data Dashboard: Tableau and Pentaho Use Big Data app to push location based services in Govt. Tracking system and management Day-5 : Session-1: How to justify Big Data BI implementation within an organization: Defining ROI for Big Data implementation Case studies for saving Analyst Time for collection and preparation of Data –increase in productivity gain Case studies of revenue gain from saving the licensed database cost Revenue gain from location based services Saving from fraud prevention An integrated spreadsheet approach to calculate approx. expense vs. Revenue gain/savings from Big Data implementation. Day-5 : Session-2: Step by Step procedure to replace legacy data system to Big Data System: Understanding practical Big Data Migration Roadmap What are the important information needed before architecting a Big Data implementation What are the different ways of calculating volume, velocity, variety and veracity of data How to estimate data growth Case studies Day-5: Session 4: Review of Big Data Vendors and review of their products. Q/A session: Accenture APTEAN (Formerly CDC Software) Cisco Systems Cloudera Dell EMC GoodData Corporation Guavus Hitachi Data Systems Hortonworks HP IBM Informatica Intel Jaspersoft Microsoft MongoDB (Formerly 10Gen) MU Sigma Netapp Opera Solutions Oracle Pentaho Platfora Qliktech Quantum Rackspace Revolution Analytics Salesforce SAP SAS Institute Sisense Software AG/Terracotta Soft10 Automation Splunk Sqrrl Supermicro Tableau Software Teradata Think Big Analytics Tidemark Systems Treeminer VMware (Part of EMC) |

dmmlr | Data Mining & Machine Learning with R | 14 godz. | Introduction to Data mining and Machine Learning Statistical learning vs. Machine learning Iteration and evaluation Bias-Variance trade-off Regression Linear regression Generalizations and Nonlinearity Exercises Classification Bayesian refresher Naive Bayes Dicriminant analysis Logistic regression K-Nearest neighbors Support Vector Machines Neural networks Decision trees Exercises Cross-validation and Resampling Cross-validation approaches Bootstrap Exercises Unsupervised Learning K-means clustering Examples Challenges of unsupervised learning and beyond K-means Advanced topics Ensemble models Mixed models Boosting Examples Multidimensional reduction Factor Analysis Principal Component Analysis Examples |

TalendDI | Talend Open Studio for Data Integration | 28 godz. | Talend Open Studio for Data Integration is an open-source data integration product used to combine, convert and update data in various locations across a business. In this instructor-led, live training, participants will learn how to use the Talend ETL tool to carry out data transformation, data extraction, and connectivity with Hadoop, Hive, and Pig. By the end of this training, participants will be able to Explain the concepts behind ETL (Extract, Transform, Load) and propagation Define ETL methods and ETL tools to connect with Hadoop Efficiently amass, retrieve, digest, consume, transform and shape big data in accordance to business requirements Audience Business intelligence professionals Project managers Database professionals SQL Developers ETL Developers Solution architects Data architects Data warehousing professionals System administrators and integrators Format of the course Part lecture, part discussion, exercises and heavy hands-on practice To request a customized course outline for this training, please contact us. |

psr | Podstawy systemów rekomendacyjnych | 7 godz. | Szkolenie skierowane jest dla pracowników działów marketingu oraz leaderów działów IT. Problemy i nadzieje związane z gromadzeniem danych Information overload Rodzaje gromadzonych danych Potencjał danych dziś i jutro Podstawowe pojęcia związane z Data Mining Rekomendacja a wyszukiwanie Wyszukiwanie i filtrowanie Sortowanie Określanie wag wyników Wykorzystanie synonimów Wyszukiwanie pełnotekstowe Koncepcja Long Tail Idea Chrisa Andersona Argumenty przeciwników koncepcji Long Tail; Argumentacja Anity Elberse Próba określenia podobieństw Produkty Użytkownicy Dokumenty i strony internetowe Content-Based Recomendation i miary podobieństw Odległość cosinusowa Odległość euklidesowa wektórów TFIDF i pojęcie częstości występowania termów Collaborative filtering Rekomendacja na podstawie ocen społeczności Wykorzystanie grafów Możliwości grafów Określanie podobieństwa grafów Rekomendacja na podstawie relacji pomiędzy użytkownikami Sieci neuronowe Zasada działania Dane wzorcowe Przykładowe zastosowanie sieci neuronowych dla systemów rekomendacyjnych HR Zachęcanie użytkowników do udostępniania informacji Wygoda działania serwisu Ułatwienia nawigacji Funkcjonalność i UX Systemy rekomendacyjne na świecie Problemy i popularrność systemów rekomendacyjnych Udane wdrożenia systemów rekomendacyjncyh Przykłady na podstawie popularnych serwisów |

rprogda | R Programming for Data Analysis | 14 godz. | This course is part of the Data Scientist skill set (Domain: Data and Technology) Introduction and preliminaries Making R more friendly, R and available GUIs Rstudio Related software and documentation R and statistics Using R interactively An introductory session Getting help with functions and features R commands, case sensitivity, etc. Recall and correction of previous commands Executing commands from or diverting output to a file Data permanency and removing objects Simple manipulations; numbers and vectors Vectors and assignment Vector arithmetic Generating regular sequences Logical vectors Missing values Character vectors Index vectors; selecting and modifying subsets of a data set Other types of objects Objects, their modes and attributes Intrinsic attributes: mode and length Changing the length of an object Getting and setting attributes The class of an object Arrays and matrices Arrays Array indexing. Subsections of an array Index matrices The array() function The outer product of two arrays Generalized transpose of an array Matrix facilities Matrix multiplication Linear equations and inversion Eigenvalues and eigenvectors Singular value decomposition and determinants Least squares fitting and the QR decomposition Forming partitioned matrices, cbind() and rbind() The concatenation function, (), with arrays Frequency tables from factors Lists and data frames Lists Constructing and modifying lists Concatenating lists Data frames Making data frames attach() and detach() Working with data frames Attaching arbitrary lists Managing the search path Data manipulation Selecting, subsetting observations and variables Filtering, grouping Recoding, transformations Aggregation, combining data sets Character manipulation, stringr package Reading data Txt files CSV files XLS, XLSX files SPSS, SAS, Stata,… and other formats data Exporting data to txt, csv and other formats Accessing data from databases using SQL language Probability distributions R as a set of statistical tables Examining the distribution of a set of data One- and two-sample tests Grouping, loops and conditional execution Grouped expressions Control statements Conditional execution: if statements Repetitive execution: for loops, repeat and while Writing your own functions Simple examples Defining new binary operators Named arguments and defaults The '...' argument Assignments within functions More advanced examples Efficiency factors in block designs Dropping all names in a printed array Recursive numerical integration Scope Customizing the environment Classes, generic functions and object orientation Graphical procedures High-level plotting commands The plot() function Displaying multivariate data Display graphics Arguments to high-level plotting functions Basic visualisation graphs Multivariate relations with lattice and ggplot package Using graphics parameters Graphics parameters list Automated and interactive reporting Combining output from R with text |

PentahoDI | Pentaho Data Integration Fundamentals | 21 godz. | Pentaho Data Integration is an open-source data integration tool for defining jobs and data transformations. In this instructor-led, live training, participants will learn how to use Pentaho Data Integration's powerful ETL capabilities and rich GUI to manage an entire big data lifecycle, maximizing the value of data to the organization. By the end of this training, participants will be able to: Create, preview, and run basic data transformations containing steps and hops Configure and secure the Pentaho Enterprise Repository Harness disparate sources of data and generate a single, unified version of the truth in an analytics-ready format. Provide results to third-part applications for further processing Audience Data Analyst ETL developers Format of the course Part lecture, part discussion, exercises and heavy hands-on practice To request a customized course outline for this training, please contact us. |

68780 | Apache Spark | 14 godz. | Why Spark? Problems with Traditional Large-Scale Systems Introducing Spark Spark Basics What is Apache Spark? Using the Spark Shell Resilient Distributed Datasets (RDDs) Functional Programming with Spark Working with RDDs RDD Operations Key-Value Pair RDDs MapReduce and Pair RDD Operations The Hadoop Distributed File System Why HDFS? HDFS Architecture Using HDFS Running Spark on a Cluster Overview A Spark Standalone Cluster The Spark Standalone Web UI Parallel Programming with Spark RDD Partitions and HDFS Data Locality Working With Partitions Executing Parallel Operations Caching and Persistence RDD Lineage Caching Overview Distributed Persistence Writing Spark Applications Spark Applications vs. Spark Shell Creating the SparkContext Configuring Spark Properties Building and Running a Spark Application Logging Spark, Hadoop, and the Enterprise Data Center Overview Spark and the Hadoop Ecosystem Spark and MapReduce Spark Streaming Spark Streaming Overview Example: Streaming Word Count Other Streaming Operations Sliding Window Operations Developing Spark Streaming Applications Common Spark Algorithms Iterative Algorithms Graph Analysis Machine Learning Improving Spark Performance Shared Variables: Broadcast Variables Shared Variables: Accumulators Common Performance Issues |

bigddbsysfun | Big Data & Database Systems Fundamentals | 14 godz. | The course is part of the Data Scientist skill set (Domain: Data and Technology). Data Warehousing Concepts What is Data Ware House? Difference between OLTP and Data Ware Housing Data Acquisition Data Extraction Data Transformation. Data Loading Data Marts Dependent vs Independent data Mart Data Base design ETL Testing Concepts: Introduction. Software development life cycle. Testing methodologies. ETL Testing Work Flow Process. ETL Testing Responsibilities in Data stage. Big data Fundamentals Big Data and its role in the corporate world The phases of development of a Big Data strategy within a corporation Explain the rationale underlying a holistic approach to Big Data Components needed in a Big Data Platform Big data storage solution Limits of Traditional Technologies Overview of database types NoSQL Databases Hadoop Map Reduce Apache Spark |

pmml | Predictive Models with PMML | 7 godz. | The course is created to scientific, developers, analysts or any other people who want to standardize or exchange their models with Predictive Model Markup Language (PMML) file format.Predictive Models Intro to predictive models Predictive models supported by PMML PMML Elements Header Data Dictionary Data Transformations Model Mining Schema Targets Output API Overview of API providers for PMML Executing your model in a cloud |

rintrob | Introductory R for Biologists | 28 godz. | I. Introduction and preliminaries 1. Overview Making R more friendly, R and available GUIs Rstudio Related software and documentation R and statistics Using R interactively An introductory session Getting help with functions and features R commands, case sensitivity, etc. Recall and correction of previous commands Executing commands from or diverting output to a file Data permanency and removing objects Good programming practice: Self-contained scripts, good readability e.g. structured scripts, documentation, markdown installing packages; CRAN and Bioconductor 2. Reading data Txt files (read.delim) CSV files 3. Simple manipulations; numbers and vectors + arrays Vectors and assignment Vector arithmetic Generating regular sequences Logical vectors Missing values Character vectors Index vectors; selecting and modifying subsets of a data set Arrays Array indexing. Subsections of an array Index matrices The array() function + simple operations on arrays e.g. multiplication, transposition Other types of objects 4. Lists and data frames Lists Constructing and modifying lists Concatenating lists Data frames Making data frames Working with data frames Attaching arbitrary lists Managing the search path 5. Data manipulation Selecting, subsetting observations and variables Filtering, grouping Recoding, transformations Aggregation, combining data sets Forming partitioned matrices, cbind() and rbind() The concatenation function, (), with arrays Character manipulation, stringr package short intro into grep and regexpr 6. More on Reading data XLS, XLSX files readr and readxl packages SPSS, SAS, Stata,… and other formats data Exporting data to txt, csv and other formats 6. Grouping, loops and conditional execution Grouped expressions Control statements Conditional execution: if statements Repetitive execution: for loops, repeat and while intro into apply, lapply, sapply, tapply 7. Functions Creating functions Optional arguments and default values Variable number of arguments Scope and its consequences 8. Simple graphics in R Creating a Graph Density Plots Dot Plots Bar Plots Line Charts Pie Charts Boxplots Scatter Plots Combining Plots II. Statistical analysis in R 1. Probability distributions R as a set of statistical tables Examining the distribution of a set of data 2. Testing of Hypotheses Tests about a Population Mean Likelihood Ratio Test One- and two-sample tests Chi-Square Goodness-of-Fit Test Kolmogorov-Smirnov One-Sample Statistic Wilcoxon Signed-Rank Test Two-Sample Test Wilcoxon Rank Sum Test Mann-Whitney Test Kolmogorov-Smirnov Test 3. Multiple Testing of Hypotheses Type I Error and FDR ROC curves and AUC Multiple Testing Procedures (BH, Bonferroni etc.) 4. Linear regression models Generic functions for extracting model information Updating fitted models Generalized linear models Families The glm() function Classification Logistic Regression Linear Discriminant Analysis Unsupervised learning Principal Components Analysis Clustering Methods(k-means, hierarchical clustering, k-medoids) 5. Survival analysis (survival package) Survival objects in r Kaplan-Meier estimate, log-rank test, parametric regression Confidence bands Censored (interval censored) data analysis Cox PH models, constant covariates Cox PH models, time-dependent covariates Simulation: Model comparison (Comparing regression models) 6. Analysis of Variance One-Way ANOVA Two-Way Classification of ANOVA MANOVA III. Worked problems in bioinformatics Short introduction to limma package Microarray data analysis workflow Data download from GEO: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1397 Data processing (QC, normalisation, differential expression) Volcano plot Custering examples + heatmaps |

dataminr | Data Mining z wykorzystaniem R | 14 godz. | Sources of methods Artificial intelligence Machine learning Statistics Sources of data Pre processing of data Data Import/Export Data Exploration and Visualization Dimensionality Reduction Dealing with missing values R Packages Data mining main tasks Automatic or semi-automatic analysis of large quantities of data Extracting previously unknown interesting patterns groups of data records (cluster analysis) unusual records (anomaly detection) dependencies (association rule mining) Data mining Anomaly detection (Outlier/change/deviation detection) Association rule learning (Dependency modeling) Clustering Classification Regression Summarization Frequent Pattern Mining Text Mining Decision Trees Regression Neural Networks Sequence Mining Frequent Pattern Mining Data dredging, data fishing, data snooping |

datama | Data Mining and Analysis | 28 godz. | Objective: Delegates be able to analyse big data sets, extract patterns, choose the right variable impacting the results so that a new model is forecasted with predictive results. Data preprocessing Data Cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Statistical inference Probability distributions, Random variables, Central limit theorem Sampling Confidence intervals Statistical Inference Hypothesis testing Multivariate linear regression Specification Subset selection Estimation Validation Prediction Classification methods Logistic regression Linear discriminant analysis K-nearest neighbours Naive Bayes Comparison of Classification methods Neural Networks Fitting neural networks Training neural networks issues Decision trees Regression trees Classification trees Trees Versus Linear Models Bagging, Random Forests, Boosting Bagging Random Forests Boosting Support Vector Machines and Flexible disct Maximal Margin classifier Support vector classifiers Support vector machines 2 and more classes SVM’s Relationship to logistic regression Principal Components Analysis Clustering K-means clustering K-medoids clustering Hierarchical clustering Density based clustering Model Assesment and Selection Bias, Variance and Model complexity In-sample prediction error The Bayesian approach Cross-validation Bootstrap methods |

d2dbdpa | From Data to Decision with Big Data and Predictive Analytics | 21 godz. | Audience If you try to make sense out of the data you have access to or want to analyse unstructured data available on the net (like Twitter, Linked in, etc...) this course is for you. It is mostly aimed at decision makers and people who need to choose what data is worth collecting and what is worth analyzing. It is not aimed at people configuring the solution, those people will benefit from the big picture though. Delivery Mode During the course delegates will be presented with working examples of mostly open source technologies. Short lectures will be followed by presentation and simple exercises by the participants Content and Software used All software used is updated each time the course is run so we check the newest versions possible. It covers the process from obtaining, formatting, processing and analysing the data, to explain how to automate decision making process with machine learning. Quick Overview Data Sources Minding Data Recommender systems Target Marketing Datatypes Structured vs unstructured Static vs streamed Attitudinal, behavioural and demographic data Data-driven vs user-driven analytics data validity Volume, velocity and variety of data Models Building models Statistical Models Machine learning Data Classification Clustering kGroups, k-means, nearest neighbours Ant colonies, birds flocking Predictive Models Decision trees Support vector machine Naive Bayes classification Neural networks Markov Model Regression Ensemble methods ROI Benefit/Cost ratio Cost of software Cost of development Potential benefits Building Models Data Preparation (MapReduce) Data cleansing Choosing methods Developing model Testing Model Model evaluation Model deployment and integration Overview of Open Source and commercial software Selection of R-project package Python libraries Hadoop and Mahout Selected Apache projects related to Big Data and Analytics Selected commercial solution Integration with existing software and data sources |

osqlide | Oracle SQL Intermediate - Data Extraction | 14 godz. | Limiting results The WHERE clause Comparison operators LIKE Condition Prerequisite BETWEEN ... AND IS NULL condition Condition IN Boolean operators AND, OR and NOT Many of the conditions in the WHERE clause The order of the operators. DISTINCT clause SQL functions The differences between the functions of one and multilines Features text, numeric, date, Explicit and implicit conversion Conversion functions Nesting functions Viewing the performance of the functions - dual table Getting the current date function SYSDATE Handling of NULL values Aggregating data using the grouping function Grouping functions How grouping functions treat NULL values Create groups of data - the GROUP BY clause Grouping multiple columns Limiting the function result grouping - the HAVING clause Subqueries Place subqueries in the SELECT command Subqueries single and multi-lineage Operators Subqueries single-line Features grouping in subquery Operators Subqueries multi-IN, ALL, ANY How NULL values are treated in subqueries Operators collective UNION operator UNION ALL operator INTERSECT operator MINUS operator Further Usage Of Joins Revisit Joins Combining Inner and Outer Joins Partitioned Outer Joins Hierarchical Queries Further Usage Of Sub-Queries Revisit sub-queries Use of sub-queries as virtual tables/inline views and columns Use of the WITH construction Combining sub-queries and joins Analytics functions OVER clause Partition Clause Windowing Clause Rank, Lead, Lag, First, Last functions Retrieving data from multiple tables (if time at end) Types of connectors The use NATURAL JOIN Aliases tables Joins in the WHERE clause INNER JOIN Inner join External Merge LEFT, RIGHT, FULL OUTER JOIN Cartesian product Aggregate Functions (if time at end) Revisit Group By function and Having clause Group and Rollup Group and Cube |

datamin | Data Mining | 21 godz. | Course can be provided with any tools, including free open-source data mining software and applicationsIntroduction Data mining as the analysis step of the KDD process ("Knowledge Discovery in Databases") Subfield of computer science Discovering patterns in large data sets Sources of methods Artificial intelligence Machine learning Statistics Database systems What is involved? Database and data management aspects Data pre-processing Model and inference considerations Interestingness metrics Complexity considerations Post-processing of discovered structures Visualization Online updating Data mining main tasks Automatic or semi-automatic analysis of large quantities of data Extracting previously unknown interesting patterns groups of data records (cluster analysis) unusual records (anomaly detection) dependencies (association rule mining) Data mining Anomaly detection (Outlier/change/deviation detection) Association rule learning (Dependency modeling) Clustering Classification Regression Summarization Use and applications Able Danger Behavioral analytics Business analytics Cross Industry Standard Process for Data Mining Customer analytics Data mining in agriculture Data mining in meteorology Educational data mining Human genetic clustering Inference attack Java Data Mining Open-source intelligence Path analysis (computing) Reactive business intelligence Data dredging, data fishing, data snooping |

druid | Druid: Build a fast, real-time data analysis system | 21 godz. | Druid is an open-source, column-oriented, distributed data store written in Java. It was designed to quickly ingest massive quantities of event data and execute low-latency OLAP queries on that data. Druid is commonly used in business intelligence applications to analyze high volumes of real-time and historical data. It is also well suited for powering fast, interactive, analytic dashboards for end-users. Druid is used by companies such as Alibaba, Airbnb, Cisco, eBay, Netflix, Paypal, and Yahoo. In this course we explore some of the limitations of data warehouse solutions and discuss how Druid can compliment those technologies to form a flexible and scalable streaming analytics stack. We walk through many examples, offering participants the chance to implement and test Druid-based solutions in a lab environment. Audience Application developers Software engineers Technical consultants DevOps professionals Architecture engineers Format of the course Part lecture, part discussion, heavy hands-on practice, occasional tests to gauge understanding Introduction Installing and starting Druid Druid architecture and design Real-time ingestion of event data Sharding and indexing Loading data Querying data Visualizing data Running a distributed cluster Druid + Apache Hive Druid + Apache Kafka Druid + others Troubleshooting Administrative tasks |

Szkolenie | Data Kursu | Cena szkolenia [Zdalne / Stacjonarne] |
---|---|---|

From Data to Decision with Big Data and Predictive Analytics - Zielona Góra, ul. Reja 6 | pon., 2017-10-09 09:00 | 29220PLN / 9605PLN |

Introductory R for Biologists - Opole, Władysława Reymonta 29 | pon., 2017-10-09 09:00 | 23680PLN / 8976PLN |

Pentaho Data Integration (PDI) - moduł do przetwarzania danych ETL (poziom zaawansowany) - Szczecin, ul. Sienna 9 | pon., 2017-10-09 09:00 | 12000PLN / 4750PLN |

Model MapReduce and Apache Hadoop - Lublin, ul. Spadochroniarzy 9 | wt., 2017-10-10 09:00 | 5000PLN / 3000PLN |