Welcome, Guest |
You have to register before you can post on our site.
|
Latest Threads |
What is the maximum numbe...
Forum: BDB Designer Q & A
Last Post: sariga.vr@bdb.ai
12-28-2022, 07:59 AM
» Replies: 0
» Views: 8,011
|
Inbuilt Capability of VC...
Forum: BDB - Platform
Last Post: shivani.jaipuria
12-27-2022, 05:23 AM
» Replies: 0
» Views: 1,181
|
Can dataset/cube refresh...
Forum: BDB - Platform
Last Post: shivani.jaipuria
12-27-2022, 05:08 AM
» Replies: 0
» Views: 1,221
|
How to load business stor...
Forum: BDB Designer Q & A
Last Post: sariga.vr@bdb.ai
12-26-2022, 04:47 PM
» Replies: 0
» Views: 3,171
|
How to load business stor...
Forum: BDB Designer Q & A
Last Post: sariga.vr@bdb.ai
12-26-2022, 04:46 PM
» Replies: 0
» Views: 3,229
|
How to load business stor...
Forum: BDB Designer Q & A
Last Post: sariga.vr@bdb.ai
12-26-2022, 04:45 PM
» Replies: 0
» Views: 2,248
|
How to load business stor...
Forum: BDB Designer Q & A
Last Post: sariga.vr@bdb.ai
12-26-2022, 04:44 PM
» Replies: 0
» Views: 2,226
|
Data Preparation operati...
Forum: BDB-Data Prep & ETL
Last Post: shivani.jaipuria
12-26-2022, 10:09 AM
» Replies: 0
» Views: 1,202
|
Plugability Feature of B...
Forum: BDB Platform Q & A
Last Post: shivani.jaipuria
12-26-2022, 08:32 AM
» Replies: 0
» Views: 1,082
|
How to use environment va...
Forum: BDB Platform Q & A
Last Post: archana
12-26-2022, 05:57 AM
» Replies: 0
» Views: 1,083
|
|
|
What is under fitting in ml model and how to solve |
Posted by: manjunath - 12-23-2022, 09:18 AM - Forum: DS- Lab Q&A
- No Replies
|
|
Underfitting in machine learning refers to a model that is too simple to accurately capture the complexity of the data. This results in poor performance on the training data, and generalizes poorly to new data.
There are several ways to solve underfitting in a machine learning model:
- Use a more complex model: One option is to use a more complex model with more parameters. This can help capture more of the complexity of the data. However, be aware that using a very complex model can also lead to overfitting, which is when the model is too specific to the training data and does not generalize well to new data.
- Add more features: Another option is to add more features to the model. This can provide the model with more information and help it capture more of the complexity of the data.
- Increase the amount of training data: Increasing the amount of training data can also help the model capture more of the complexity of the data.
- Regularization: Regularization is a technique that helps prevent overfitting by adding a penalty to the model based on the complexity of the model. This can help balance the trade-off between model complexity and overfitting.
- Early stopping: Early stopping is a technique that involves training the model until the performance on a validation set starts to degrade, and then stopping the training process. This can help prevent overfitting and improve the generalization performance of the model.
It is important to keep in mind that underfitting is often a result of a trade-off between model complexity and overfitting. Finding the right balance between these two factors can be key to achieving good performance on the training data and generalization to new data.
|
|
|
what is normal distribution curve , how do we use it |
Posted by: manjunath - 12-23-2022, 09:09 AM - Forum: DS- Lab Q&A
- No Replies
|
|
A normal distribution curve is a graphical representation of a normal distribution, which is a type of probability distribution that is symmetrical around the mean. It is also known as a bell curve because of its shape.
In a normal distribution, the majority of the data is concentrated around the mean, and the values become increasingly rare as they get further from the mean. This means that, in a normal distribution, there are fewer extreme values and more values that are close to the mean.
Normal distribution curves are often used in statistics and data science to represent data that follows a normal distribution. They can be used to visualize the distribution of a dataset and to identify patterns or trends in the data.
One way to use a normal distribution curve in a dataset is to fit a curve to the data and use it to make predictions about future outcomes. For example, if you have a dataset of test scores and you want to predict the probability that a student will score above a certain threshold, you can fit a normal distribution curve to the data and use it to calculate the probability.
Another way to use a normal distribution curve in a dataset is to use it to identify outliers. If the data does not follow a normal distribution, it may be necessary to transform the data in order to make it more normally distributed. This can be done using a variety of techniques, such as log transformation or standardization.
|
|
|
What is correlation between two variables |
Posted by: manjunath - 12-23-2022, 09:07 AM - Forum: DS- Lab Q&A
- No Replies
|
|
correlation is a measure of the relationship between two variables.
It tells you how closely two variables are related to each other, and it can range from -1 (perfect negative correlation) to 1 (perfect positive correlation).
It can help you understand the relationships between different variables in a dataset and how they might be related to one another.
An example of positive correlation would be height and weight. Taller people tend to be heavier.
A negative correlation is a relationship between two variables in which an increase in one variable is associated with a decrease in the other.
In data science, correlation is often used to identify patterns and trends in data and to make predictions about future outcomes.
It can be measured using a variety of statistical techniques, such as Pearson's correlation coefficient or Spearman's rank correlation coefficient.
It is important to note that correlation does not imply causation, meaning that just because two variables are correlated
does not necessarily mean that one variable causes the other. It is possible that there is another factor that is causing the relationship between the two variables
|
|
|
What are the different functions of kafka in pipelines |
Posted by: manjunath - 12-23-2022, 09:02 AM - Forum: BDB Data Pipeline Q & A
- No Replies
|
|
Apache Kafka is a distributed streaming platform that is often used as the backbone for a data pipeline. It is used to build real-time data pipelines and streaming apps.
A data pipeline is a set of processes that move data from one place to another. The data can be moved between systems, or within a system. Kafka can be used as a central hub for moving data between systems, or as a way to move data within a system.
There are several ways in which Kafka can be used in a data pipeline:
- As a message broker: Kafka can be used to send messages between systems. For example, a system that generates data can send the data to Kafka, and other systems that need the data can consume it from Kafka.
- As a streaming platform: Kafka can be used to process streaming data in real-time. For example, a system that generates a stream of data can send the data to Kafka, and other systems can consume the data from Kafka and process it in real-time.
- As a buffer: Kafka can be used to buffer data between systems that operate at different speeds. For example, a system that generates data quickly can send the data to Kafka, and another system that processes the data more slowly can consume the data from Kafka at its own pace.
- As a repository: Kafka can be used to store data for a certain period of time. For example, a system that generates data can send the data to Kafka, and other systems can consume the data from Kafka and store it for further analysis or reporting.
Overall, Kafka is a powerful tool for building data pipelines and streaming applications, and it is widely used in many different types of systems.
|
|
|
How to monitor data pipelines in production |
Posted by: manjunath - 12-23-2022, 08:14 AM - Forum: BDB Data Pipeline Q & A
- No Replies
|
|
Monitoring data pipelines in production is an important aspect of data management, as it helps ensure that the pipeline is running smoothly and that data is being processed and delivered as expected. There are several key considerations when it comes to monitoring data pipelines in production:
- Data pipeline performance: It is important to monitor the performance of the data pipeline, including how long it takes to run and how much data it processes. This can help identify bottlenecks and areas for optimization.
- Data quality: Data quality is critical for ensuring that the data pipeline is producing accurate and reliable results. It is important to monitor data metrics(logs , podlogs), such as completeness, accuracy, and consistency, and to take corrective action if necessary.
- Data security: Data security is an important concern when working with production data pipelines. It is important to monitor access to the data and ensure that only authorized users have access to sensitive data.
- Data availability: It is important to ensure that the data pipeline is available and running as expected. This includes monitoring for downtime or failures and taking corrective action if necessary.
To monitor data pipelines in production, organizations can use a variety of tools and techniques, including monitoring and dashboards, and alerting systems. It is also important to have processes in place for responding to issues and failures as they occur.
|
|
|
How to create Data sync and use in pipelines |
Posted by: manjunath - 12-23-2022, 08:00 AM - Forum: BDB Data Pipeline Q & A
- No Replies
|
|
To use data sync in pipeline,
First we have to create a data sync in pipeline setting by specifying the host , port , databasename ,username , password ,driver(ex:mongodb , MySQL,Clickhouse , postgres, oracle)
Than we can use the data sync across pipeline.
To access the data sync to pipeline , we can access in toggle panel in pipeline with plus button ,create it and drag and drop to panel .
Than we can specify the table name and save mode (append or upsert) in data sync component . There is no need of using Kafka component before data sync.
(Upsert: One extra field will be displayed for upsert save mode i.e.: Composite Key)
|
|
|
Data Catalogue and Uses |
Posted by: Abhiram.m - 12-23-2022, 07:47 AM - Forum: BDB Platform Q & A
- No Replies
|
|
Data Catalogue - A data catalogue is an organized inventory of data assets in an organization.
- Its a collection of metadata, combined with data management & search tools, that helps analysts and other data users to find the data that they need.
- It provides a single place where all of the organization's data can be catalogued, enriched, searched, tracked and prioritized whether big or small, internally or externally sourced, available as files, databases or APIs.
- It serves as an inventory of available data & provides information to evaluate the fitness of data for intended uses, by providing a single view across all data of interest to a user, regardless of where the data is stored and where it was sourced from.
Purpose- It helps data professionals collect, organize, access & enrich metadata to support data discovery & governance.
- It enables users to search through various data assets available within the organization's data platform, in a manner similar to that of search engines, which will enable them to fetch the necessary data from the desired location.
- Users can get important details of the available data assets such as details of metadata, type of asset, relationship between various data assets, dependency between data assets etc.
|
|
|
|