AI Chatbots and Assistants

Explore the best AI Chatbots and Assistants — independent reviews, comparisons, pricing and step-by-step how-to guides, curated by Aizhi.

  • The Cancer Imaging Archive

    The Cancer Imaging Archive

    The Cancer Imaging Archive (TCIA) is an open-access database of medical images for cancer research. The site is funded by the National Cancer Institute's (NCI) Cancer Imaging Program, and the contract is operated by the University of Arkansas for Medical Sciences. Data within the archive is organized into collections which typically share a common cancer type and/or anatomical site. The majority of the data consists of CT, MRI, and nuclear medicine (e.g. PET) images stored in DICOM format, but many other types of supporting data are also provided or linked to, in order to enhance research utility. All data are de-identified in order to comply with the Health Insurance Portability and Accountability Act and National Institutes of Health data sharing policies. TCIA resources are intended to support: Development of computer aided diagnosis methods (quantitative imaging) Evaluation of unbiased science reproducibility by acceptable standard statistical methods Research on correlation of clinical diagnostic medical images with digital microscopic histological images Exploratory biomarker research for which imaging is a key element Collaboration between cross-disciplinary investigators where imaging is crucial to research on tumor heterogeneity, between patients and within the tumor; tissue temporal response tracking - objective measurements of tumor progression; imaging genomics and Big Data linkages and analysis (clinical, histo-pathology, genomics) TCIA is recognized as a recommended repository for the Scientific Data, PLOS One, and F1000Research journals. It is also listed in the Registry of Research Data Repositories. == History == Prior to the creation of TCIA, the NCI funded development of the National Biomedical Imaging Archive. NBIA is an open-source Web application which was designed to allow the storage and query of DICOM images. TCIA was subsequently initiated in December 2010 to expand data sharing activities by funding a service component which would help address the technical and policy challenges associated with medical imaging research. TCIA leverages open-source tools such as NBIA and Clinical Trials Processor in order to provide its services. == Organization of the archive == The site content is organized into five categories: About Us - Provides a general overview of the site the organizations responsible for operating it. Share Your Data - Provides an overview of how to apply to upload data to the archive. Access the Archive - Provides information about the available data, methods for accessing that data and system usage metrics. Research Activities - Provides information about major research initiatives being conducted using TCIA data as well as information about publication guidelines. Help - Provides information about how to get support using the archive as well as documentation and data usage policies. == Methods for accessing data == Most collections on the Cancer Imaging Archive can be accessed without an account, but a few are restricted to specific users and therefore require an account to access them. TCIA has several ways to browse, filter, and download data. They include: Downloading the entire contents of a collection in bulk Leveraging the NBIA application to filter or search within or across collections Utilizing the RESTful Application programming interface to filter or search within or across collections === Browsing, bulk downloading and access to supporting data === The home page includes a list of all available collections. Basic information about the data such as the cancer type, cancer location, modalities, and number of subjects are also provided. Clicking on a collection name presents a page which describes the data including its original research purpose, how the data were generated, and how it might be useful to other TCIA users. For example, doi:10.7937/K9/TCIA.2015.L4FRET6Z describes the NSCLC-Radiomics-Genomics Collection. In the lower section of the page there are links to search or download the images and any available supporting data in the Data Access tab. Additional tabs provide information about data versions and how to cite the data if used in publications. Many collections contain additional data types such as genomics, patient demographics, treatment details, and expert analyses of the images. This data is usually only found by browsing the collection pages as opposed to searching in NBIA or using the API. === Filtering or searching with NBIA === On each Collection page and also in the main menu of the site there are links to "Search TCIA". This will load the NBIA application which allows simple, advanced and free text searches. Search results follow the conventional DICOM hierarchy of patient -> study -> series. TCIA provides comprehensive documentation on the various features of the NBIA software. === RESTful API === A number of search and download commands are also available through the API. New iterations on the API are released as new versions, so that existing applications developed against older versions of the API continue to function. == Research activities == A list of known publications based on TCIA data is maintained as a convenience to researchers who might want to investigate how it has been used previously. In addition to peer-reviewed publications there are also several major research initiatives described in the Research Activities section of the site. === The CIP TCGA Radiology Initiative for Radiogenomics Research === A large number of collections contain subjects which were analyzed as part of the NIH/NHGRI database known as The Cancer Genome Atlas (TCGA). This offers researchers the ability to correlate clinical images using shared unique identifiers each study that has in TCGA extensive genomic analysis, digital pathology slides and bulk download of individual demographic data and clinical data. A multi-institutional network of investigators volunteering their time is using the data to develop methods to determine prognosis or predict the response to therapy. TCGA collections are designated by nomenclature shared by the TCGA Data Portal (e.g.: TCGA-BRCA, TCGA-GBM, etc). They are subject to a special publication policy which is unique from the other public data on TCIA. === Challenge competitions === TCIA also provides specific data sets used for "Challenge" competitions such as international digital image-focused professional societies like MICCAI, SPIE, or ISBI. A directory of previous and upcoming challenges is maintained on the site. === Digital object identifiers === To facilitate data sharing, many publications encourage authors to include data citations to the data that the authors used in creating the results described in their scholarly papers. In addition, new journals are now available for describing data collections outright (e.g., Nature Scientific Data). TCIA assigns digital object identifiers (DOIs) to all collections when they are submitted, and also has the ability to create persistent identifiers linked to subsets of data held within TCIA that authors may use for data citations in their scholarly papers.

    Read more →
  • Data exchange

    Data exchange

    Data exchange is the process of moving data from one information system to another. It often involves transforming data that is native to the source system into a form that is consumable by the target system or to a standardized form that is consumable by any compatible system. In particular, data exchange allows data to be shared between computer programs. Data exchange is similar to data integration except that data may be restructured with possible loss of content. There may be no way to transform a particular collection based on exchange constraints. Conversely, there may be multiple ways to transform the data, in which case one option must be identified in order to achieve compatibility between source and target. There are two main types of data exchange: broadcast and peer-to-peer (a.k.a. unicast). For broadcast, data is transmitted simultaneously to all consumers. Just as a conference call, all participants get the same information from the speaker at the same time. For peer-to-peer, data is sent to a single receiver, defined by a specific address. For example, a letter goes to just one mail box. == Single-domain == In some domains, a multiple source and target schema (proprietary data formats) may exist. An exchange or interchange format is often developed for a single domain, and then necessary routines (mappings) are written to (indirectly) transform/translate each and every source schema to each and every target schema by using the interchange format as an intermediate step. That requires less work than writing and debugging the many routines that would be required to directly translate each source schema directly to each target schema. Examples of these transformative interchange formats include: Standard Interchange Format for geospatial data; Data Interchange Format for spreadsheet data; Open Document Format for spreadsheets, charts, presentations and word processing documents; GPS eXchange Format or Keyhole Markup Language for describing GPS data; GDSII for integrated circuit layout. == Representation == A data exchange (a.k.a. interchange) language defines a domain-independent way to represent data. These languages have evolved from being markup and display-oriented to support the encoding of metadata that describes the structural attributes of the information. Practice has shown that certain types of formal languages are better suited for this task than others, since their specification is driven by a formal process instead of particular software implementation. For example, XML is a markup language that was designed to enable the creation of dialects (the definition of domain-specific sublanguages). However, it does not contain domain-specific dictionaries or fact types. Beneficial to a reliable data exchange is the availability of standard dictionaries-taxonomies and tools libraries such as parsers, schema validators, and transformation tools. === XML === The popularity of XML for data exchange on the World Wide Web has several reasons. First of all, it is closely related to the preexisting standards Standard Generalized Markup Language (SGML) and Hypertext Markup Language (HTML), and as such a parser written to support these two languages can be easily extended to support XML as well. For example, XHTML has been defined as a format that is formal XML, but understood correctly by most (if not all) HTML parsers. === YAML === YAML was designed to be human-readable and authored via a text editor with notion similar to reStructuredText and wiki syntax. YAML 1.2 also includes a shorthand notion that is compatible with JSON, and as such any JSON document is also valid YAML; this however does not hold the other way. === REBOL === REBOL was designed to be human-readable and authored via a text editor. It uses a simple free-form syntax with minimal punctuation and a rich set of data types (such as URL, email, date and time, tuple, string, tag) that respect common standards. It is designed to not need any additional meta-language, being designed in a metacircular fashion which is why the parse dialect used for definitions and transformations of REBOL dialects is also itself a dialect of REBOL. REBOL was used as a source of inspiration for JSON. === Gellish === Gellish English is a formalized subset of natural English (language), which includes a simple grammar and a large, extensible dictionary (taxonomy) that defines the general and domain specific terminology, whereas the concepts are arranged in a hierarchy, which supports inheritance of knowledge and requirements. The dictionary also includes standardized fact types. The terms and relation types together can be used to create and interpret expressions of facts, knowledge, requirements and other information. Gellish can be used in combination with SQL, RDF/XML, OWL and various other meta-languages. The Gellish standard is a combination of ISO 10303-221 (AP221) and ISO 15926. === List === The following describes and compares popular data exchange languages. Columns Schemas – Whether supports representing domain specific data structure definition Flexible – Whether supports extension of the semantic expression capabilities without modifying the schema Semantic verification – Whether supports semantic verification of the correctness of expressions in the language Dictionary – Whether includes a dictionary and a taxonomy (hierarchy) of concepts with inheritance Information model – Whether supports an information model Synonyms and homonyms – Whether supports the use of synonyms and homonyms in expressions Dialecting – Whether is available in multiple natural languages or dialects Web standard – Whether is standardized by a recognized body Transformations – Whether includes a translation to other standards Lightweight – Whether a lightweight version is available Human readable – Whether expressions are understandable without training Compatibility – Which other tools can be used or are required

    Read more →
  • Forking lemma

    Forking lemma

    The forking lemma is any of a number of related lemmas in cryptography research. The lemma states that if an adversary (typically a probabilistic Turing machine), on inputs drawn from some distribution, produces an output that has some property with non-negligible probability, then with non-negligible probability, if the adversary is re-run on new inputs but with the same random tape, its second output will also have the property. This concept was first used by David Pointcheval and Jacques Stern in "Security proofs for signature schemes," published in the proceedings of Eurocrypt 1996. In their paper, the forking lemma is specified in terms of an adversary that attacks a digital signature scheme instantiated in the random oracle model. They show that if an adversary can forge a signature with non-negligible probability, then there is a non-negligible probability that the same adversary with the same random tape can create a second forgery in an attack with a different random oracle. The forking lemma was later generalized by Mihir Bellare and Gregory Neven. The forking lemma has been used and further generalized to prove the security of a variety of digital signature schemes and other random-oracle based cryptographic constructions. == Statement of the lemma == The generalized version of the lemma is stated as follows. Let A be a probabilistic algorithm, with inputs (x, h1, ..., hq; r) that outputs a pair (J, y), where r refers to the random tape of A (that is, the random choices A will make). Suppose further that IG is a probability distribution from which x is drawn, and that H is a set of size h from which each of the hi values are drawn according to the uniform distribution. Let acc be the probability that on inputs distributed as described, the J output by A is greater than or equal to 1. We can then define a "forking algorithm" FA that proceeds as follows, on input x: Pick a random tape r for A. Pick h1, ..., hq uniformly from H. Run A on input (x, h1, ..., hq; r) to produce (J, y). If J = 0, then return (0, 0, 0). Pick h'J, ..., h'q uniformly from H. Run A on input (x, h1, ..., hJ−1, h'J, ..., h'q; r) to produce (J', y'). If J' = J and hJ ≠ h'J then return (1, y, y'), otherwise, return (0, 0, 0). Let frk be the probability that FA outputs a triple starting with 1, given an input x chosen randomly from IG. Then frk ≥ acc ⋅ ( acc q − 1 h ) . {\displaystyle {\text{frk}}\geq {\text{acc}}\cdot \left({\frac {\text{acc}}{q}}-{\frac {1}{h}}\right).} === Intuition === The idea here is to think of A as running two times in related executions, where the process "forks" at a certain point, when some but not all of the input has been examined. In the alternate version, the remaining inputs are re-generated but are generated in the normal way. The point at which the process forks may be something we only want to decide later, possibly based on the behavior of A the first time around: this is why the lemma statement chooses the branching point (J) based on the output of A. The requirement that hJ ≠ h'J is a technical one required by many uses of the lemma. (Note that since both hJ and h'J are chosen randomly from H, then if h is large, as is usually the case, the probability of the two values not being distinct is extremely small.) === Example === For example, let A be an algorithm for breaking a digital signature scheme in the random oracle model. Then x would be the public parameters (including the public key) A is attacking, and hi would be the output of the random oracle on its ith distinct input. The forking lemma is of use when it would be possible, given two different random signatures of the same message, to solve some underlying hard problem. An adversary that forges once, however, gives rise to one that forges twice on the same message with non-negligible probability through the forking lemma. When A attempts to forge on a message m, we consider the output of A to be (J, y) where y is the forgery, and J is such that m was the Jth unique query to the random oracle (it may be assumed that A will query m at some point, if A is to be successful with non-negligible probability). (If A outputs an incorrect forgery, we consider the output to be (0, y).) By the forking lemma, the probability (frk) of obtaining two good forgeries y and y' on the same message but with different random oracle outputs (that is, with hJ ≠ h'J) is non-negligible when acc is also non-negligible. This allows us to prove that if the underlying hard problem is indeed hard, then no adversary can forge signatures. This is the essence of the proof given by Pointcheval and Stern for a modified ElGamal signature scheme against an adaptive adversary. == Known issues with application of forking lemma == The reduction provided by the forking lemma is not tight. Pointcheval and Stern proposed security arguments for Digital Signatures and Blind Signature using Forking Lemma. Claus P. Schnorr provided an attack on blind Schnorr signatures schemes, with more than p o l y l o g ( n ) {\displaystyle polylog(n)} concurrent executions (the case studied and proven secure by Pointcheval and Stern). A polynomial-time attack, for Ω ( n ) {\displaystyle \Omega (n)} concurrent executions, was shown in 2020 by Benhamouda, Lepoint, Raykova, and Orrù. Schnorr also suggested enhancements for securing blind signatures schemes based on discrete logarithm problem.

    Read more →
  • Chaos Communication Congress

    Chaos Communication Congress

    The Chaos Communication Congress is an annual hacker conference organized by the Chaos Computer Club. The congress features a variety of lectures and workshops on technical and political issues related to security, cryptography, privacy and online freedom of speech. It has taken place regularly at the end of the year since 1984, with the current date and duration (27–30 December) established in 2005. It is considered one of the largest events of its kind, alongside DEF CON in Las Vegas. == History == The congress is held in Germany. It started in 1984 in Hamburg, moved to Berlin in 1998, and back to Hamburg in 2012, having exceeded the capacity of the Berlin venue with more than 4500 attendees. Since then, it attracts an increasing number of people: around 6600 attendees in 2012, over 13000 in 2015, and more than 15000 in 2017. From 2017 to 2019, it took place at the Trade Fair Grounds in Leipzig, since the Hamburg venue (CCH) was closed for renovation in 2017 and the existing space was not enough for the growing congress. The congress moved back to Hamburg in 2023, after the renovation of CCH was finished. A large range of speakers are featured. The event is organized by volunteers called Chaos Angels. The non-members entry fee for four days was €100 in 2016, and was raised to €120 in 2018 to include a public transport ticket for the Leipzig area. An important part of the congress are the assemblies, semi-open spaces with clusters of tables and internet connections for groups and individuals to collaborate and socialize in projects, workshops and hands-on talks. These assembly spaces, introduced at the 2012 meeting, combine the hack center project space and distributed group spaces of former years. From 1997 to 2004 the congress also hosted the annual German Lockpicking Championships. 2005 was the first year the Congress lasted four days instead of three and lacked the German Lockpicking Championships. 2020 was the first year where the Congress did not take place at a physical location due to the COVID-19 pandemic, giving way to the first Remote Chaos Experience (rC3). The Chaos Computer Club announced to return to the now newly renovated Congress Center Hamburg for the 37th edition of the Chaos Communication Congress. The announcement confirms the usual date of 27-30 December, notably omitting the year it will be held. On 18 October 2022, they confirmed that the congress will indeed not be held in 2022. On 6 October 2023, the CCC announced that 37C3 will take place again on the usual dates in 2023. === Timeline ===

    Read more →
  • List of monochrome and RGB color formats

    List of monochrome and RGB color formats

    This list of monochrome and RGB palettes includes generic repertoires of colors (color palettes) to produce black-and-white and RGB color pictures by a computer's display hardware. RGB is the most common method to produce colors for displays; so these complete RGB color repertoires have every possible combination of R-G-B triplets within any given maximum number of levels per component. Each palette is represented by a series of color patches. When the number of colors is low, a 1-pixel-size version of the palette appears below it, for easily comparing relative palette sizes. Huge palettes are given directly in one-color-per-pixel color patches. For each unique palette, an image color test chart and sample image (truecolor original follows) rendered with that palette (without dithering) are given. The test chart shows the full 256 levels of the red, green, and blue (RGB) primary colors and cyan, magenta, and yellow complementary colors, along with a full 256-level grayscale. Gradients of RGB intermediate colors (orange, lime green, sea green, sky blue, violet, and fuchsia), and a full hue spectrum are also present. Color charts are not gamma corrected. These elements illustrate the color depth and distribution of the colors of any given palette, and the sample image indicates how the color selection of such palettes could represent real-life images. These images are not necessarily representative of how the image would be displayed on the original graphics hardware, as the hardware may have additional limitations regarding the maximum display resolution, pixel aspect ratio and color placement. Implementation of these formats is specific to each machine. Therefore, the number of colors that can be simultaneously displayed in a given text or graphic mode might be different. Also, the actual displayed colors are subject to the output format used - PAL or NTSC, composite or component video, etc. - and might be slightly different. For simulated images and specific hardware and alternate methods to produce colors other than RGB (ex: composite), see the List of 8-bit computer hardware palettes, the List of 16-bit computer hardware palettes and the List of video game console palettes. For various software arrangements and sorts of colors, including other possible full RGB arrangements within 8-bit color depth displays, see the List of software palettes. == Monochrome palettes == These palettes only have some shades of gray, from black to white (considered the darkest and lightest "grays", respectively). The general rule is that those palettes have 2n different shades of gray, where n is the number of bits needed to represent a single pixel. === Monochrome (1-bit grayscale) === Monochrome graphics displays typically have a black background with a white or light gray image, though green and amber monochrome monitors were also common. Such a palette requires only one bit per pixel. Where photo-realism was desired, these early computer systems had a heavy reliance on dithering to make up for the limits of the technology. In some systems, as Hercules and CGA graphic cards for the IBM PC, a bit value of 1 represents white pixels (light on) and a value of 0 the black ones (light off); others, like the Playdate and Atari ST and Apple Macintosh with monochrome monitors, a bit value of 0 means a white pixel (no ink) and a value of 1 means a black pixel (dot of ink), which it approximates to the printing logic. === 2-bit Grayscale === In a 2-bit color palette each pixel's value is represented by 2 bits resulting in a 4-value palette (22 = 4). 2-bit dithering: It has black, white and two intermediate levels of gray as follows: A monochrome 2-bit palette is used on: The Monochrome Display Adapter for the IBM PC NeXT Computer, NeXTcube and NeXTstation monochrome graphic displays. Original Game Boy system portable video game console. Macintosh PowerBook 150 monochrome LC displays. Amiga with A2024 monochrome monitor in high-resolution mode. The original Amazon Kindle The original WonderSwan The Tiger Electronics Game.com portable video game console The original Neo Geo Pocket. === 4-bit Grayscale === In a 4-bit color palette each pixel's value is represented by 4 bits resulting in a 16-value palette (24 = 16): 4-bit grayscale dithering does a fairly good job of reducing visible banding of the level changes: A monochrome 4-bit palette is used on: MOS Technology VDC (on the Commodore 128 with monochrome monitor) Amstrad CPC series with a GT64/GT65 Green Monitor (16 unique green shades) Amstrad CPC Plus series with the MM12 Monochrome monitor (16 shades of grey) Some Apple PowerBooks equipped with monochrome displays like the PowerBook 5300 The original VideoNow === 8-bit Grayscale === In an 8-bit color palette each pixel's value is represented by 8 bits resulting in a 256-value palette (28 = 256). This is usually the maximum number of grays in ordinary monochrome systems; each image pixel occupies a single memory byte. Most scanners can capture images in 8-bit grayscale, and image file formats like TIFF and JPEG natively support this monochrome palette size. Alpha channels employed for video overlay also use (conceptually) this palette. The gray level indicates the opacity of the blended image pixel over the background image pixel. == Dichrome palettes == === 16-bit RG palette === The RG or red–green color space is a color space that uses only two primary colors: red and green. It was used on early color processes for films. It was used as an additive format, similar to the RGB color model but without a blue channel, on processes such as Kinemacolor, Prizma, Technicolor I, Raycol, etc., producing shades of black, red, green and yellow. Alternatively, it was used as a subtractive format on Brewster Color I, Kodachrome I, Prizma II, Technicolor II, etc., producing shades of transparent, red, green and black. Until recently, its primary use was in low-cost light-emitting diode displays in which red and green tended to be far more common than the still nascent blue LED technology, but full-color LEDs with blue have become more common in recent years. ColorCode 3-D, a anaglyph stereoscopic color scheme, uses the RG color space to simulate a broad spectrum of color in one eye, while the blue portion of the spectrum transmits a black-and-white (black-and-blue) image to the other eye to give depth perception. === 16-bit RB palette === === 16-bit GB palette === == Regular RGB palettes == Here are grouped those full RGB hardware palettes that have the same number of binary levels (i.e., the same number of bits) for every red, green and blue components using the full RGB color model. Thus, the total number of colors are always the number of possible levels by component, n, raised to a power of 3: n×n×n = n3. === 3-bit RGB === 3-bit RGB dithering: Systems with a 3-bit RGB palette use 1 bit for each of the red, green and blue color components. That is, each component is either "on" or "off" with no intermediate states. This results in an 8-color palette ((21)3 = 23 = 8) that has black, white, the three RGB primary colors red, green and blue and their correspondent complementary colors cyan, magenta and yellow as follows: The color indices vary between implementations; therefore, index numbers are not given. The 3-bit RGB palette is used by: Text terminals following the ECMA-48 standard (sometimes known as the "ANSI standard", although ANSI X3.128 does not define colors) World System Teletext Level 1/1.5 Videotex Oric computers BBC Micro PC-8801 (up to the MkII) PC-9801 (with original 8086 CPU, before the VM/VX models) Sharp X1 (models before the X1 Turbo Z) Sharp MZ 700 FM-7, FM New 7, FM 77 (before the FM77AV) Sinclair QL Space Invaders Part II (arcade hardware) Macintosh SE (with a color printer or external monitor) Atari 2600 (SECAM version) Color Maximite (PIC32 based microcomputer) Arcadia 2001 PV-1000 Monkey Magic (arcade hardware) VIC-20 (high-res mode) Mouse Trap (arcade hardware) Sanyo MBC-550 series Windows 1.0 (includes dithering) === 6-bit RGB === Systems with a 6-bit RGB palette use 2 bits for each of the red, green, and blue color components. This results in a (22)3 = 43 = 64-color palette as follows: 6-bit RGB systems include the following: Enhanced Graphics Adapter (EGA) for IBM PC/AT (16 colors at once) Sega Master System video game console (32 colors at once) GIME for TRS-80 Color Computer 3 (16 colors at once) Pebble Time smartwatch which has a 6-bit (64 color) e-paper display Parallax Propeller using the reference VGA circuit === 9-bit RGB === Systems with a 9-bit RGB palette use 3 bits for each of the red, green, and blue color components. This results in a (23)3 = 83 = 512-color palette as follows: 9-bit RGB systems include the following: Atari ST (Normally 4 to 16 at once without tricks) MSX2 computers (up to 16 at once) Sega Genesis video game console, (64 colors at once) Sega Nomad TurboGrafx-16 (NEC PC-Engine) ZX Spectrum Next The NEC PC-88

    Read more →
  • Smart-ID

    Smart-ID

    Smart-ID is an electronic authentication tool developed by SK ID Solutions, an Estonian company. Users can log in to various electronic services and sign documents with an electronic signature. Smart-ID meets the European Union's eIDAS Regulation and the European Central Bank's standards for a secure authentication solution. Smart-ID is a Qualified Signature Creator Device (QSCD) that can issue a Qualified Electronic Signature (QES). The Smart-ID app is compatible with both iOS and Android devices and does not require a SIM card. By 2021, the Smart-ID application was launched in the Huawei AppGallery. As of May 2023, Smart-ID has 3,298,969 active users across the Baltic States (Latvia, Lithuania, and Estonia). Every month, the Smart-ID processes 79 million transactions. In March 2023, Smart-ID users made an exceptional 85 million transactions. == History == In November 2016, SK ID Solutions debuted the Smart-ID tool for the first time at its annual conference. In February 2017, eKool, Starman, and Tallinn Kaubamaja Grupp were the first to implement Smart-ID authentication in their e-services. In March 2017, Smart-ID was added as an authentication option to SEB bank and Swedbank's online banking in all three Baltic States. Dokobit, previously known as DigiDoc, began offering its clients the ability to use e-services using Smart-ID in April 2017. More than 100 service providers had implemented Smart-ID as an authentication solution for their services by November 2019. At its annual conference on November 8, 2018, SK ID Solutions revealed that Smart-ID had been certified as compatible with the QSCD[8] level, the highest level of qualified electronic signature in the European Union, following a rigorous certification process. As a result, the Smart-QES-level ID's electronic signature, the digital counterpart of a handwritten signature, is now available to all users who have registered with the tool. This signature is accepted by all European Union member states. On August 26, 2019, Estonian Information Systems Supervisory Authority experts reviewed Smart-ID (ISSA). Based on the methods provided in the eIDAS Regulation, the expert committee concluded that Smart-ID offers a high level of electronic identification assurance. SK ID Solutions and RIA struck an agreement in September 2019 that allows Smart-ID to authenticate Estonian state e-services via RIA's central authentication service, which is used by over 60 public authorities. Smart-ID accounts created three years ago have expired in January 2020. Therefore, renewing them and performing mandatory updates was necessary. In February 2020, SK ID Solutions announced that Smart-ID could be used to give digital signatures in the national digital signature software DigiDoc4, which up until this moment was only possible with ID cards via Mobile-ID. Users must have at least version 4.2.4.71 or later of the DigiDoc4 software installed on their computers to use this feature. Since February 2020, Smart-ID accounts can now be created with biometric information from an ID card or passport, but only by users who have previously used a Smart-ID account. Since October 2022, 13–17 years old minors in Lithuania are able to create a Smart-ID account using biometric information too. A parent or legal guardian must approve the registration. SK ID Solutions collaborated on the new solution with iProov from the United Kingdom and InnoValor from the Netherlands. TÜV Informationstechnik GmbH, a German certification company, assessed it. Since May 2023, Smart-ID can be used to submit company's annual reports in Estonia and digitally sign anything in the e-business register using your PIN2. == Overview == The Smart-ID app is available for download on Google Play and Apple's App Store. Android 4.4 and iOS 11 are the oldest supported operating system versions for Smart-ID. Smart-ID works on the premise of two-factor authentication, combining an intelligent device (something the user owns) with PINs (something the user knows). A new user must first authenticate themselves with an ID card or a mobile phone number and then confirm a PIN1 and PIN2 code, either manually or automatically produced. The first PIN is used to authenticate a person's identity when accessing e-banking or e-services, while the second PIN is used to support electronic signatures and authenticate transactions (e.g., transfers). The PIN1 code must be four digits long, while the PIN2 code must be five digits long. To log in to an e-service, the user must use Smart-ID as the authentication method and enter their unique Smart-ID user ID. A notification will open on the user's smart device where the software is installed and display a verification code. If the code matches the code presented to the user by the e-service, then the user can confirm the match by entering their PIN1 code. The user must verify the action with their PIN2 code when giving digital signatures. A Smart-ID account is valid for three years. The report can be updated, changed, and deleted at any given time, free of charge. Smart-ID is available in five languages: Estonian, Latvian, Lithuanian, Russian, and English. An international survey conducted in 2021 revealed that Smart-ID is the most reliable authentication solution in Baltic countries. In January 2023, the number of times Smart-ID was used to access State Authentication Service (TARA) in Estonia has surpassed those of Mobile-ID and ID-cards for the first time since July 2022. == Security == Smart-ID is based on Cybernetica's SplitKey authentication and digital signature platform technology, for which the company has filed a patent application. Public key cryptography, digital signature methods, and critical public infrastructures are all used in the technology. The user's PIN is not saved on the device and is only needed to decrypt the private key in the Smart-ID app. When the user inputs the PIN, the private key is cracked, and the answer is transmitted to the Smart-ID server, where a portion of the key given by the app is joined with the server's encrypted key. The app will block the user from accessing it for three hours if they input the incorrect PIN three times in a row. If this happens once again, the app will lock for 24 hours. If this happens a third time, the account will be permanently disabled. PINs cannot be changed or recovered once an account has been created. The user must create a new account if the account is permanently blocked. Smart-ID uses the Apple and Google messaging networks to notify the app when new data is saved on its servers. == Phishing == In February 2019, unknown criminals attempted to create Smart-ID accounts with stolen IDs obtained via phishing customers' text messages and website addresses, according to a monthly report by the Estonian Information System Manager in April 2019. The Latvian Information Technology Security Incident Assessment Body Cert was also notified of these intrusions on March 1. Fraudsters sent emails to potential victims pretending to be bank representatives. The mails linked users to a phishing page after redirecting them to a phony bank login page. Victims were asked to log in using their identification information and PIN1 code. The fraudsters then began the process of generating a new Smart-ID account. As a result, the victim had to input a PIN2 number, which permitted the fraudster to finish setting up a new tab with the victim's personal information. Fraudsters in Estonia were able to log in to multiple e-services utilizing Smart-ID using a Smart-ID account and the victim's data. On behalf of the victims, fraudsters also employed online banking services. Later, the Estonian Information System Manager identified several victims, some of whom had also experienced financial losses. The Estonian Information System Manager requested a full report on the event from SK ID Solutions. The organization opted not to criticize the corporation after receiving the information, although it did propose that the procedure of creating Smart-ID accounts be reviewed. According to the Estonian Banking Association, Estonian banks have not discontinued using Smart-ID and do not think it is required. Smart-ID was exposed to a thorough review process in September 2019 to determine this authentication instrument's level of security. Reviewers discovered no flaws, and SK ID Solutions and the Estonian Information System Manager signed a contract. Estonia later introduced Smart-ID and other authentication mechanisms to the central public services portal.

    Read more →
  • Social media optimization

    Social media optimization

    Social media optimization (SMO) is the use of online platforms to generate income or publicity to increase the awareness of a brand, event, product or service. Types of social media involved include RSS feeds, blogging sites, social bookmarking sites, social news websites, video sharing websites such as YouTube and social networking sites such as Facebook, Instagram, TikTok and X (Twitter). SMO is similar to search engine optimization (SEO) in that the goal is to drive web traffic, and draw attention to a company or creator. SMO's focal point is on gaining organic links to social media content. In contrast, SEO's core is about reaching the top of the search engine hierarchy. In general, social media optimization refers to optimizing a website and its content to encourage more users to use and share links to the website across social media and networking sites. SMO is used to strategically create online content ranging from well-written text to eye-catching digital photos or video clips that encourages and entices people to engage with a website. Users share this content, via its weblink, with social media contacts and friends. Common examples of social media engagement are "liking and commenting on posts, retweeting, embedding, sharing, and promoting content". Social media optimization is also an effective way of implementing online reputation management (ORM), meaning that if someone posts bad reviews of a business, an SMO strategy can ensure that the negative feedback is not the first link to come up in a list of search engine results. In the 2010s, with social media sites overtaking TV as a source for news for young people, news organizations have become increasingly reliant on social media platforms for generating web traffic. Publishers such as The Economist employ large social media teams to optimize their online posts and maximize traffic, while other major publishers now use advanced artificial intelligence (AI) technology to generate higher volumes of web traffic. == Relationship with search engine optimization == Social media optimization is an increasingly important factor in search engine optimization, which is the process of designing a website in a way so that it has as high a ranking as possible on search engines. Search engines are increasingly utilizing the recommendations of users of social networks such as Reddit, Facebook, Tumblr, Twitter, YouTube, LinkedIn, Pinterest and Instagram to rank pages in the search engine result pages. The implication is that when a webpage is shared or "liked" by a user on a social network, it counts as a "vote" for that webpage's quality. Thus, search engines can use such votes accordingly to properly ranked websites in search engine results pages. Furthermore, since it is more difficult to tip the scales or influence the search engines in this way, search engines are putting more stock into social search. This, coupled with increasingly personalized search based on interests and location, has significantly increased the importance of a social media presence in search engine optimization. Due to personalized search results, location-based social media presences on websites such as Yelp, Google Places, Foursquare, and Yahoo! Local have become increasingly important. While social media optimization is related to search engine marketing, it differs in several ways. Primarily, SMO focuses on driving web traffic from sources other than search engines, though improved search engine ranking is also a benefit of successful social media optimization. Further, SMO is helpful to target particular geographic regions in order to target and reach potential customers. This helps in lead generation (finding new customers) and contributes to high conversion rates (i.e., converting previously uninterested individuals into people who are interested in a brand or organization). == Relationship with viral marketing == Social media optimization is in many ways connected to the technique of viral marketing or "viral seeding" where word of mouth is created through the use of networking in social bookmarking, video and photo sharing websites. An effective SMO campaign can harness the power of viral marketing; for example, 80% of activity on Pinterest is generated through "repinning." Furthermore, by following social trends and utilizing alternative social networks, websites can retain existing followers while also attracting new ones. This allows businesses to build an online following and presence, all linking back to the company's website for increased traffic. For example, with an effective social bookmarking campaign, not only can website traffic be increased, but a site's rankings can also be increased. In a similar way, the engagement with blogs creates a similar result by sharing content through the use of RSS in the blogosphere. Social media optimization is considered an integral part of an online reputation management (ORM) or search engine reputation management (SERM) strategy for organizations or individuals who care about their online presence. SMO is one of six key influencers that affect Social Commerce Construct (SCC). Online activities such as consumers' evaluations and advices on products and services constitute part of what creates a Social Commerce Construct (SCC). Social media optimization is not limited to marketing and brand building. Increasingly, smart businesses are integrating social media participation as part of their knowledge management strategy (i.e., product/service development, recruiting, employee engagement and turnover, brand building, customer satisfaction and relations, business development and more). Additionally, social media optimization can be implemented to foster a community of the associated site, allowing for a healthy business-to-consumer (B2C) relationship. == Origins and implementation == According to technologist Danny Sullivan, the term "social media optimization" was first used and described by marketer Rohit Bhargava on his marketing blog in August 2006. In the same post, Bhargava established the five important rules of social media optimization. Bhargava believed that by following his rules, anyone could influence the levels of traffic and engagement on their site, increase popularity, and ensure that it ranks highly in search engine results. An additional 11 SMO rules have since been added to the list by other marketing contributors. The 16 rules of SMO, according to one source, are as follows: Increase your linkability Make tagging and bookmarking easy Reward inbound links Help your content to "travel" via sharing Encourage the mashup, where users are allowed to remix content Be a user resource, even if it doesn't help you (e.g., provide resources and information for users) Reward helpful and valuable users Participate (join the online conversation) Know how to target your audience Create new, quality content ("web scraping" of existing online content is ignored by good search engines) Be "real" in the tone and style of the posts Don't forget your roots; be humble Don't be afraid to experiment, innovate, try new things and "stay fresh" Develop an SMO strategy Choose your SMO tactics wisely Make SMO a key part of your marketing process and develop company best practices Bhargava's initial five rules were more specifically designed to SMO, while the list is now much broader and addresses everything that can be done across different social media platforms. According to author and CEO of TopRank Online Marketing, Lee Odden, a Social Media Strategy is also necessary to ensure optimization. This is a similar concept to Bhargava's list of rules for SMO. The Social Media Strategy may consider: Objectives e.g. creating brand awareness and using social media for external communications. Listening e.g. monitoring conversations relating to customers and business objectives. Audience e.g. finding out who the customers are, what they do, who they are influenced by, and what they frequently talk about. It is important to work out what customers want in exchange for their online engagement and attention. Participation and content e.g. establishing a presence and community online and engaging with users by sharing useful and interesting information. Measurement e.g. keeping a record of likes and comments on posts, and the number of sales to monitor growth and determine which tactics are most useful in optimizing social media. According to Lon Safko and David K. Brake in The Social Media Bible, it is also important to act like a publisher by maintaining an effective organizational strategy, to have an original concept and unique "edge" that differentiates one's approach from competitors, and to experiment with new ideas if things do not work the first time. If a business is blog-based, an effective method of SMO is using widgets that allow users to share content to their personal social media platforms. This will ultimately reach a wider target audience and drive mor

    Read more →
  • Data lineage

    Data lineage

    Data lineage refers to the process of tracking how data is generated, transformed, transmitted and used across systems over time. It documents data's origins, transformations and movements, providing detailed visibility into its life cycle. This process simplifies the identification of errors in data analytics workflows, by enabling users to trace issues back to their root causes. Data lineage facilitates the ability to replay specific segments or inputs of the dataflow. This can be used in debugging or regenerating lost outputs. In database systems, this concept is closely related to data provenance, which involves maintaining records of inputs, entities, systems and processes that influence data. Data provenance provides a historical record of data origins and transformations. It supports forensic activities such as data-dependency analysis, error/compromise detection, recovery, auditing and compliance analysis: "Lineage is a simple type of why provenance." Data governance plays a critical role in managing metadata by establishing guidelines, strategies and policies. Enhancing data lineage with data quality measures and master data management adds business value. Although data lineage is typically represented through a graphical user interface (GUI), the methods for gathering and exposing metadata to this interface can vary. Based on the metadata collection approach, data lineage can be categorized into three types: Those involving software packages for structured data, programming languages and Big data systems. Data lineage information includes technical metadata about data transformations. Enriched data lineage may include additional elements such as data quality test results, reference data, data models, business terminology, data stewardship information, program management details and enterprise systems associated with data points and transformations. Data lineage visualization tools often include masking features that allow users to focus on information relevant to specific use cases. To unify representations across disparate systems, metadata normalization or standardization may be required. == Representation of data lineage == Representation broadly depends on the scope of the metadata management and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leading to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end-to-end lineage for a reference point that provides a complete audit trail of that data point of interest from sources to their final destinations. As the data points or hops increase, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view is the ability to simplify the view by temporarily masking unwanted peripheral data points. Tools with the masking feature enable scalability of the view and enhance analysis with the best user experience for both technical and business users. Data lineage also enables companies to trace sources of specific business data to track errors, implement changes in processes and implementing system migrations to save significant amounts of time and resources. Data lineage can improve efficiency in business intelligence BI processes. Data lineage can be represented visually to discover the data flow and movement from its source to destination via various changes and hops on its way in the enterprise environment. This includes how the data is transformed along the way, how the representation and parameters change and how the data splits or converges after each hop. A simple representation of the Data Lineage can be shown with dots and lines, where dots represent data containers for data points, and lines connecting them represent transformations the data undergoes between the data containers. Data lineage can be visualized at various levels based on the granularity of the view. At a very high-level, data lineage is visualized as systems that the data interacts with before it reaches its destination. At its most granular, visualizations at the data point level can provide the details of the data point and its historical behavior, attribute properties and trends and data quality of the data passed through that specific data point in the data lineage. The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance and data management of an organization determine the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes and critical data elements of the organization. == Rationale == Distributed systems like Google Map Reduce, Microsoft Dryad, Apache Hadoop (an open-source project) and Google Pregel provide such platforms for businesses and users. However, even with these systems, Big Data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores. "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the 12 largest genome sequencing houses in the world now store petabytes of data apiece. It is very difficult for a data scientist to trace an unknown or an unanticipated result. === Big data debugging === Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Machine learning, among other algorithms, is used to transform and analyze the data. Due to the large size of the data, there could be unknown features in the data. The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running the entire analytics through a debugger for stepwise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, the sharing of data between scientific communities and use of third-party data in business enterprises. As such, more cost-efficient ways of analyzing data intensive scale-able computing (DISC) are crucial to their continued effective use. === Challenges in Big Data debugging === ==== Massive scale ==== According to an EMC/IDC study, 2.8 ZB of data were created and replicated in 2012. Furthermore, the same study states that the digital universe will double every two years between now and 2020, and that there will be approximately 5.2 TB of data for every person in 2020. Based on current technology, the storage of this much data will mean greater energy usage by data centers. ==== Unstructured data ==== Unstructured data usually refers to information that doesn't reside in a traditional row-column database. Unstructured data files often include text and multimedia content, such as e-mail messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents. While these types of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly into a database. The amount of unstructured data in enterprises is growing many times faster than structured databases are growing. Big data can include both structured and unstructured data, but IDC estimates that 90 percent of Big Data is unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for non-technical business users and data analysts alike to unbox, understand and prepare for analytic use. Beyond issues of structure, the sheer volume of this type of data contributes to such difficulty. Because of this, current data mining techniques often leave out valuable information and make analyzing unstructured data laborious and expensive. In today's competitive business environment, companies have to find and analyze the relevant data they need quickly. The challenge is going through the volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and parallel processing to crunch large volumes of data quickly. Another method is putting data in-memory but using a grid

    Read more →
  • Jaggies

    Jaggies

    Jaggies are visual artifacts in raster images, most frequently from aliasing, which in turn is often caused by non-linear mixing effects producing high-frequency components, or missing or poor anti-aliasing filtering prior to sampling. Jaggies are stair-like lines that appear where there should be "smooth" straight lines or curves. For example, when a nominally straight, un-aliased line steps across one pixel either horizontally or vertically, a "dogleg" occurs halfway through the line, where it crosses the threshold from one pixel to the other. Jaggies should not be confused with most compression artifacts, which are a different phenomenon. == Causes == Jaggies occur due to the "staircase effect". This is because a line represented in raster mode is approximated by a sequence of pixels. Jaggies can occur for a variety of reasons, the most common being that the output device (display monitor or printer) does not have sufficient resolution to portray a smooth line. In addition, jaggies often occur when a bit-mapped image is scaled to a higher resolution. This is one of the advantages that vector graphics have over bitmapped graphics – a vector image can be losslessly scaled to any arbitrary resolution or stretched infinitely in either axis without introducing jaggies. == Solutions == The effect of jaggies can be reduced by a graphics technique known as spatial anti-aliasing. Anti-aliasing smooths out jagged lines by surrounding them with transparent pixels to simulate the appearance of fractionally-filled pixels when viewed at a distance. The downside of anti-aliasing is that it reduces contrast – rather than sharp black/white transitions, there are shades of gray – and the resulting image can appear fuzzy. This is an inescapable trade-off: if the resolution is insufficient to display the desired detail, the output will either be jagged, fuzzy, or some combination thereof. While machine learning-based upscaling techniques such as DLSS can be used to infer this missing information, other types of artifacts may be introduced in the process. In real-time 3D rendering such as in video games, various anti-aliasing techniques are used to remove jaggies created by the edges of polygons and other contrasting lines. Since anti-aliasing can impose a significant performance overhead, games for home computers often allow users to choose the level and type of anti-aliasing in use in order to optimize their experience, whereas on consoles this setting is typically fixed for each title to ensure a consistent experience. While anti-aliasing is generally implemented through graphics APIs like DirectX and Vulkan, some consoles such as the Xbox 360 and PlayStation 3 are also capable of anti-aliasing to little direct performance cost by way of dedicated hardware which performs anti-aliasing on the contents of the framebuffer once it has been rendered by the GPU. Jaggies in bitmaps, such as sprites and surface materials, are most often dealt with by separate texture filtering routines, which are far easier to perform than anti-aliasing filtering. Texture filtering became ubiquitous on PCs after the introduction of 3Dfx's Voodoo GPU. == Notable uses of the term == In the 1985 game Rescue on Fractalus! for the Atari 8-bit computers, the graphics depicting the cockpit of the player's spacecraft contains two window struts, which are not anti-aliased and are therefore very "jagged". The developers made fun of this and named the in-game enemies "Jaggi", and also initially titled the game Behind Jaggi Lines!. The latter idea was scrapped by the marketing department before release.

    Read more →
  • Cryptographic nonce

    Cryptographic nonce

    In cryptography, a nonce is an arbitrary number that can be used just once in a cryptographic communication. It is often a random or pseudo-random number issued in an authentication protocol to ensure that each communication session is unique, and therefore that old communications cannot be reused in replay attacks. Nonces can also be useful as initialization vectors and in cryptographic hash functions. == Definition == A nonce is an arbitrary number used only once in a cryptographic communication, in the spirit of a nonce word. They are often random or pseudo-random numbers. Many nonces also include a timestamp to ensure exact timeliness, though this requires clock synchronisation between organisations. The addition of a client nonce ("cnonce") helps to improve the security in some ways as implemented in digest access authentication. To ensure that a nonce is used only once, it should be time-variant (including a suitably fine-grained timestamp in its value), or generated with enough random bits to ensure an insignificantly low chance of repeating a previously generated value. Some authors define pseudo-randomness (or unpredictability) as a requirement for a nonce. Nonce is a word dating back to Middle English for something only used once or temporarily (often with the construction "for the nonce"). It descends from the construction "then anes" ("the one [purpose]"). A false etymology claiming it to stand for "number used once" or similar is incorrect. == Usage == === Authentication === Authentication protocols may use nonces to ensure that old communications cannot be reused in replay attacks. For instance, nonces are used in HTTP digest access authentication to calculate an MD5 digest of the password. The nonces are different each time the 401 authentication challenge response code is presented, thus making replay attacks virtually impossible. The scenario of ordering products over the Internet can provide an example of the usefulness of nonces in replay attacks. An attacker could take the encrypted information and—without needing to decrypt—could continue to send a particular order to the supplier, thereby ordering products over and over again under the same name and purchase information. The nonce is used to give 'originality' to a given message so that if the company receives any other orders from the same person with the same nonce, it will discard those as invalid orders. A nonce may be used to ensure security for a stream cipher. Where the same key is used for more than one message and then a different nonce is used to ensure that the keystream is different for different messages encrypted with that key; often the message number is used. Secret nonce values are used by the Lamport signature scheme as a signer-side secret which can be selectively revealed for comparison to public hashes for signature creation and verification. === Hashing === Nonces are used in proof-of-work systems to vary the input to a cryptographic hash function so as to obtain a hash for a certain input that fulfils certain arbitrary conditions. In doing so, it becomes far more difficult to create a "desirable" hash than to verify it, shifting the burden of work onto one side of a transaction or system. For example, proof of work, using hash functions, was considered as a means to combat email spam by forcing email senders to find a hash value for the email (which included a timestamp to prevent pre-computation of useful hashes for later use) that had an arbitrary number of leading zeroes, by hashing the same input with a large number of values until a "desirable" hash was obtained. Similarly, the Bitcoin blockchain hashing algorithm can be tuned to an arbitrary difficulty by changing the required minimum/maximum value of the hash so that the number of bitcoins awarded for new blocks does not increase linearly with increased network computation power as new users join. This is likewise achieved by forcing Bitcoin miners to add nonce values to the value being hashed to change the hash algorithm output. As cryptographic hash algorithms cannot easily be predicted based on their inputs, this makes the act of blockchain hashing and the possibility of being awarded bitcoins something of a lottery, where the first "miner" to find a nonce that delivers a desirable hash is awarded bitcoins.

    Read more →
  • Backup

    Backup

    In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of IT disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server. A backup system contains at least one copy of all data considered worth saving. The data storage requirements can be large. An information repository model may be used to provide structure to this storage. There are different types of data storage devices used for copying backups of data that is already in secondary storage onto archive files. There are also different ways these devices can be arranged to provide geographic dispersion, data security, and portability. Data is selected, extracted, and manipulated for storage. The process can include methods for dealing with live data, including open files, as well as compression, encryption, and de-duplication. Additional techniques apply to enterprise client-server backup. Backup schemes may include dry runs that validate the reliability of the data being backed up. There are limitations and human factors involved in any backup scheme. == Storage == A backup strategy requires an information repository, "a secondary storage space for data" that aggregates backups of data "sources". The repository could be as simple as a list of all backup media (DVDs, etc.) and the dates produced, or could include a computerized index, catalog, or relational database. === 3-2-1 Backup Rule === The backup data needs to be stored, requiring a backup rotation scheme, which is a system of backing up data to computer media that limits the number of backups of different dates retained separately, by appropriate re-use of the data storage media by overwriting of backups no longer needed. The scheme determines how and when each piece of removable storage is used for a backup operation and how long it is retained once it has backup data stored on it. The 3-2-1 rule can aid in the backup process. It states that there should be at least 3 copies of the data, stored on 2 different types of storage media, and one copy should be kept offsite, in a remote location (this can include cloud storage). 2 or more different media should be used to eliminate data loss due to similar reasons (for example, optical discs may tolerate being underwater while LTO tapes may not, and SSDs cannot fail due to head crashes or damaged spindle motors since they do not have any moving parts, unlike hard drives). An offsite copy protects against fire, theft of physical media (such as tapes or discs) and natural disasters like floods and earthquakes. Physically protected hard drives are an alternative to an offsite copy, but they have limitations like only being able to resist fire for a limited period of time, so an offsite copy still remains as the ideal choice. Because there is no perfect storage, many backup experts recommend maintaining a second copy on a local physical device, even if the data is also backed up offsite. === Backup methods === ==== Unstructured ==== An unstructured repository may simply be a stack of tapes, DVD-Rs or external HDDs with minimal information about what was backed up and when. This method is the easiest to implement, but unlikely to achieve a high level of recoverability as it lacks automation. ==== Full only/System imaging ==== A repository using this backup method contains complete source data copies taken at one or more specific points in time. Copying system images, this method is frequently used by computer technicians to record known good configurations. However, imaging is generally more useful as a way of deploying a standard configuration to many systems rather than as a tool for making ongoing backups of diverse systems. ==== Incremental ==== An incremental backup stores data changed since a reference point in time. Duplicate copies of unchanged data are not copied. Typically a full backup of all files is made once or at infrequent intervals, serving as the reference point for an incremental repository. Subsequently, a number of incremental backups are made after successive time periods. Restores begin with the last full backup and then apply the incrementals. Some backup systems can create a synthetic full backup from a series of incrementals, thus providing the equivalent of frequently doing a full backup. When done to modify a single archive file, this speeds restores of recent versions of files. ==== Near-CDP ==== Continuous Data Protection (CDP) refers to a backup that instantly saves a copy of every change made to the data. This allows restoration of data to any point in time and is the most comprehensive and advanced data protection. Near-CDP backup applications—often marketed as "CDP"—automatically take incremental backups at a specific interval, for example every 15 minutes, one hour, or 24 hours. They can therefore only allow restores to an interval boundary. Near-CDP backup applications use journaling and are typically based on periodic "snapshots", read-only copies of the data frozen at a particular point in time. Near-CDP (except for Apple Time Machine) intent-logs every change on the host system, often by saving byte or block-level differences rather than file-level differences. This backup method differs from simple disk mirroring in that it enables a roll-back of the log and thus a restoration of old images of data. Intent-logging allows precautions for the consistency of live data, protecting self-consistent files but requiring applications "be quiesced and made ready for backup." Near-CDP is more practicable for ordinary personal backup applications, as opposed to true CDP, which must be run in conjunction with a virtual machine or equivalent and is therefore generally used in enterprise client-server backups. Software may create copies of individual files such as written documents, multimedia projects, or user preferences, to prevent failed write events caused by power outages, operating system crashes, or exhausted disk space, from causing data loss. A common implementation is an appended ".bak" extension to the file name. ==== Reverse incremental ==== A Reverse incremental backup method stores a recent archive file "mirror" of the source data and a series of differences between the "mirror" in its current state and its previous states. A reverse incremental backup method starts with a non-image full backup. After the full backup is performed, the system periodically synchronizes the full backup with the live copy, while storing the data necessary to reconstruct older versions. This can either be done using hard links—as Apple Time Machine does, or using binary diffs. ==== Differential ==== A differential backup saves only the data that has changed since the last full backup. This means a maximum of two backups from the repository are used to restore the data. However, as time from the last full backup (and thus the accumulated changes in data) increases, so does the time to perform the differential backup. Restoring an entire system requires starting from the most recent full backup and then applying just the last differential backup. A differential backup copies files that have been created or changed since the last full backup, regardless of whether any other differential backups have been made since, whereas an incremental backup copies files that have been created or changed since the most recent backup of any type (full or incremental). Changes in files may be detected through a more recent date/time of last modification file attribute, and/or changes in file size. Other variations of incremental backup include multi-level incrementals and block-level incrementals that compare parts of files instead of just entire files. === Storage media === Regardless of the repository model that is used, the data has to be copied onto an archive file data storage medium. The medium used is also referred to as the type of backup destination. ==== Magnetic tape ==== Magnetic tape was for a long time the most commonly used medium for bulk data storage, backup, archiving, and interchange. It was previously a less expensive option, but this is no longer the case for smaller amounts of data. Tape is a sequential access medium, so the rate of continuously writing or reading data can be very fast. While tape media itself has a low cost per space, tape drives are typically dozens of times as expensive as hard disk drives and optical drives. Tape media are generally rotated on a schedule so at least one set is off-site in case something should happe

    Read more →
  • Tokenization (data security)

    Tokenization (data security)

    Tokenization, when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no intrinsic or exploitable meaning or value. The token is a reference (i.e. identifier) that maps back to the sensitive data through a tokenization system. The mapping from original data to a token uses methods that render tokens infeasible to reverse in the absence of the tokenization system, for example using tokens created from random numbers. A one-way cryptographic function is used to convert the original data into tokens, making it difficult to recreate the original data without obtaining entry to the tokenization system's resources. To deliver such services, the system maintains a vault database of tokens that are connected to the corresponding sensitive data. Protecting the system vault is vital to the system, and improved processes must be put in place to offer database integrity and physical security. The tokenization system must be secured and validated using security best practices applicable to sensitive data protection, secure storage, audit, authentication and authorization. The tokenization system provides data processing applications with the authority and interfaces to request tokens, or detokenize back to sensitive data. The security and risk reduction benefits of tokenization require that the tokenization system is logically isolated and segmented from data processing systems and applications that previously processed or stored sensitive data replaced by tokens. Only the tokenization system can tokenize data to create tokens, or detokenize back to redeem sensitive data under strict security controls. The token generation method must be proven to have the property that there is no feasible means through direct attack, cryptanalysis, side channel analysis, token mapping table exposure or brute force techniques to reverse tokens back to live data. Replacing live data with tokens in systems is intended to minimize exposure of sensitive data to those applications, stores, people and processes, reducing risk of compromise or accidental exposure and unauthorized access to sensitive data. Applications can operate using tokens instead of live data, with the exception of a small number of trusted applications explicitly permitted to detokenize when strictly necessary for an approved business purpose. Tokenization systems may be operated in-house within a secure isolated segment of the data center, or as a service from a secure service provider. Tokenization may be used to safeguard sensitive data involving, for example, bank accounts, financial statements, medical records, criminal records, driver's licenses, loan applications, stock trades, voter registrations, and other types of personally identifiable information (PII). Tokenization is often used in credit card processing. The PCI Council defines tokenization as "a process by which the primary account number (PAN) is replaced with a surrogate value called a token. A PAN may be linked to a reference number through the tokenization process. In this case, the merchant simply has to retain the token and a reliable third party controls the relationship and holds the PAN. The token may be created independently of the PAN, or the PAN can be used as part of the data input to the tokenization technique. The communication between the merchant and the third-party supplier must be secure to prevent an attacker from intercepting to gain the PAN and the token. De-tokenization is the reverse process of redeeming a token for its associated PAN value. The security of an individual token relies predominantly on the infeasibility of determining the original PAN knowing only the surrogate value". The choice of tokenization as an alternative to other techniques such as encryption will depend on varying regulatory requirements, interpretation, and acceptance by respective auditing or assessment entities. This is in addition to any technical, architectural or operational constraint that tokenization imposes in practical use. == Concepts and origins == The concept of tokenization, as adopted by the industry today, has existed since the first currency systems emerged centuries ago as a means to reduce risk in handling high value financial instruments by replacing them with surrogate equivalents. In the physical world, coin tokens have a long history of use replacing the financial instrument of minted coins and banknotes. In more recent history, subway tokens and casino chips found adoption for their respective systems to replace physical currency and cash handling risks such as theft. Exonumia and scrip are terms synonymous with such tokens. In the digital world, similar substitution techniques have been used since the 1970s as a means to isolate real data elements from exposure to other data systems. In databases for example, surrogate key values have been used since 1976 to isolate data associated with the internal mechanisms of databases and their external equivalents for a variety of uses in data processing. More recently, these concepts have been extended to consider this isolation tactic to provide a security mechanism for the purposes of data protection. In the payment card industry, tokenization is one means of protecting sensitive cardholder data in order to comply with industry standards and government regulations. Tokenization was applied to payment card data by Shift4 Corporation and released to the public during an industry Security Summit in Las Vegas, Nevada in 2005. The technology is meant to prevent the theft of the credit card information in storage. Shift4 defines tokenization as: "The concept of using a non-decryptable piece of data to represent, by reference, sensitive or secret data. In payment card industry (PCI) context, tokens are used to reference cardholder data that is managed in a tokenization system, application or off-site secure facility." To protect data over its full lifecycle, tokenization is often combined with end-to-end encryption to secure data in transit to the tokenization system or service, with a token replacing the original data on return. For example, to avoid the risks of malware stealing data from low-trust systems such as point of sale (POS) systems, as in the Target breach of 2013, cardholder data encryption must take place prior to card data entering the POS and not after. Encryption takes place within the confines of a security hardened and validated card reading device and data remains encrypted until received by the processing host, an approach pioneered by Heartland Payment Systems as a means to secure payment data from advanced threats, now widely adopted by industry payment processing companies and technology companies. The PCI Council has also specified end-to-end encryption (certified point-to-point encryption—P2PE) for various service implementations in various PCI Council Point-to-point Encryption documents. == The tokenization process == The process of tokenization consists of the following steps: The application sends the tokenization data and authentication information to the tokenization system. It is stopped if authentication fails and the data is delivered to an event management system. As a result, administrators can discover problems and effectively manage the system. The system moves on to the next phase if authentication is successful. Using one-way cryptographic or random generation techniques, a token is generated and kept in a highly secure data vault. The new token is provided to the application for further use, replacing the sensitive data for processing and storage. Tokenization systems share several components according to established standards. Token generation is the process of producing a token using any means, such as one-way nonreversible cryptographic functions (e.g., a hash function with a strong, secret salt) or assignment via a randomly generated number. Random number generator (RNG) techniques are often the best choice for generating token values. Token mapping – this is the process of assigning the created token value to its original value. To enable permitted look-ups of the original value using the token as the index, a secure cross-reference database must be constructed. Token data store – this is a central repository for the token mapping process that holds the original sensitive values and their related token values. Sensitive data and token values must be securely kept in an encrypted format. Management of cryptographic keys. Strong key management procedures are required for sensitive data encryption on token data stores. == Difference from encryption == Tokenization and "classic" encryption effectively protect data if implemented properly, and a computer security system may use both. While similar in certain regards, tokenization and classic encryption differ in a few key aspects. Both are cryptographic data security methods and the

    Read more →
  • VACUUM

    VACUUM

    VACUUM is a set of normative guidance principles for achieving training and test dataset quality for structured datasets in data science and machine learning. The garbage-in, garbage out principle motivates a solution to the problem of data quality but does not offer a specific solution. Unlike the majority of the ad-hoc data quality assessment metrics often used by practitioners VACUUM specifies qualitative principles for data quality management and serves as a basis for defining more detailed quantitative metrics of data quality. VACUUM is an acronym that stands for: valid accurate consistent uniform unified model

    Read more →
  • Social recruiting

    Social recruiting

    Social recruiting (social hiring or social media recruitment) is recruiting candidates by using social platforms as talent databases or for advertising. Social recruiting uses social media profiles, blogs, and other Internet sites to find information on candidates. It also uses social media to advertise jobs either through HR vendors or through crowdsourcing where job seekers and others share job openings within their online social networks. Social recruiting's effectiveness and return on investment have been difficult to determine, since applicants do not usually apply through the social channels which first attracted them. In May 2013, Maximum Employment Marketing Group released the Social Recruitment Monitor, which ranks the reach, engagement, and interactivity of employers' social recruiting efforts around the world. == Social recruitment software == The social recruitment software market (a form of e-recruitment) is often included in the wider talent management software sector. Bersin & Associates valued the wider talent management market at over $2bn in 2007. Social recruitment increasingly sits at an intersection of a number of fast-moving areas including social networking, recruitment and now cloud computing. Additionally, mobile recruiting has become another hot topic, especially with the rise in tablet and smartphone usage. In 2012, there was a rise of tech companies using social recruiting applications to find and screen applicants. As more companies saw value in filling jobs by putting them on the social platforms where millions of people spend at least 37 minutes daily, there developed a much larger focus on social recruiting among the talent acquisition community. By mid-2013, many major enterprise companies such as Pepsi, Gap, AIG, and Oracle had begun effectively utilizing social recruiting software, making it clear that large corporations were open to automating or streamlining (and ultimately investing in) their social recruiting processes.

    Read more →
  • Comparison of OLAP servers

    Comparison of OLAP servers

    The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information. == General information == == Data storage modes == == APIs and query languages == APIs and query languages OLAP servers support. == OLAP distinctive features == A list of OLAP features that are not supported by all vendors. All vendors support features such as parent-child, multilevel hierarchy, drilldown. == System limits == == Security == == Operating systems == The OLAP servers can run on the following operating systems: Note (1):The server availability depends on Java Virtual Machine not on the operating system == Support information ==

    Read more →