Disclosed are an artificial intelligence-based voice detection server and method. The voice detection server according to an embodiment of the present invention comprises: a detection model training unit for generating an artificial intelligence-based voice detection model by performing pre-training using a training data set configured by one or more different formats including original voice data and modulated data of the original voice data, wherein a voice detection model for extracting speaker-unique information from an input value and determining authenticity is generated by performing pre-training through first model training and second model training by using a pre-processed training data set; and a detection processing unit for determining whether input voice data is modulated, on the basis of the modulation probability of the voice data by using the voice detection model.
G10L 25/51 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
A video recording system for compositing includes a first monitor which is positioned in a gaze area of a user and is for outputting a live video of the user and a basic posture still image displayed to be superimposed on the live video of the user, a recording apparatus for recording the user, and an image controller for transmitting the basic posture still image and the live video of the user to the first monitor on the basis of a user video transmitted from the recording apparatus and changing the basic posture still image transmitted to the first monitor when an image conversion condition is met while recording the live video of the user.
In a method for providing a speech video performed by a computing device, a standby state video in a video file format in which a person in the video is in a standby state is reproduced, a plurality of speech state images in which the person in the video is in a speech state and a speech voice based on a source of speech contents during the reproduction of the standby state video are generated, the reproduction of the standby state video is stopped and a back motion video in a video file format for returning to a reference frame of the standby state video is reproduced, and a synthesized speech video is generated by synthesizing the plurality of speech state images and the speech voice with the standby state video from the reference frame.
In a method for providing a speech video performed by a computing device according to one embodiment, first sections of a plurality of standby state videos are sequentially played back, wherein each standby state video includes the first section in which a person in the video is in a standby state and a second section for image interpolation between a last frame of the first section and a reference frame, a plurality of speech state images in which the person in the video is in a speech state and a speech voice based on a source of speech contents are generated and played back, when the generating of the plurality of speech state images and the speech voice is completed, the second section of the standby state video being played back at the time of completion, and a synthesized speech video is generated by synthesizing the plurality of speech state images and the speech voice with at least some of the plurality of standby state videos.
Disclosed are an apparatus and a method for text analysis and speech synthesis. The apparatus for text analysis and speech synthesis according to an embodiment includes: a text analysis module for generating a plurality of text chunks by separating input text into utterance units; a text encoding module for generating a plurality of text feature chunks by encoding the plurality of generated text chunks; and a speech synthesis module for generating speech signals on the basis of the plurality of text feature chunks.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
Disclosed are a system and a method for providing a real-time interpretation service using AI humans. The system for providing a real-time interpretation service using AI humans, according to one aspect, may comprise: a first terminal which is arranged on a first user side and receives first voice data in a first language uttered by the first user; a service server which recognizes the first voice data and thus generates 1-1 text data in the first language, translates the 1-1 text data into a second language so as to generate 1-2 text data in the second language, and generates a first utterance video in which the 1-2 text data is uttered in the second language by an AI human; and a second terminal which is arranged on a second user side and plays the first utterance video.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G06T 13/40 - 3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
A neural network-based key point training apparatus according to an embodiment includes a key point model trained to extract key points from an input image, and an image reconstruction model trained to reconstruct the input image with the key points output by the key point model as the input. The optimized parameters of the key point model and the image reconstruction model can be calculated.
An apparatus for generating a speech synthesis image according to a disclosed embodiment is an apparatus for generating a speech synthesis image based on machine learning, the apparatus including a first global geometric transformation predictor configured to be trained to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image based on the source image and the target image, a local feature tensor predictor configured to be trained to predict a feature tensor for a local motion of the person based on preset input data, and an image generator configured to be trained to reconstruct the target image based on the global geometric transformation, the source image, and the feature tensor for the local motion.
An apparatus for synthesizing speech according to an embodiment is a computing apparatus that includes one or more processors and a memory storing one or more programs executed by the one or more processors. The apparatus for synthesizing speech includes a pre-processing module that marks a preset classification symbol on each of unit texts input; and a speech synthesis module that receives each unit text marked with the classification symbol and synthesizes speech uttering the unit text based on the input unit text.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G06F 40/12 - Use of codes for handling textual entities
12.
APPARATUS AND METHOD FOR GENERATING SPEECH SYNTHESIS IMAGE
An apparatus for generating a speech synthesis image according to a disclosed embodiment is an apparatus for generating a speech synthesis image based on machine learning, the apparatus including a first global geometric transformation predictor configured to be trained to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image based on the source image and the target image, a local feature tensor predictor configured to be trained to predict a feature tensor for a local motion of the person based on input target image-related information, and an image generator configured to be trained to reconstruct the target image based on the global geometric transformation, the source image, and the feature tensor for the local motion.
Disclosed are a golf round assistance device and method using an AI caddie. The golf round assistance device, according to an embodiment, comprises: an AI caddie selection unit that selects one AI caddie model from among a plurality of AI caddie models; a golf course information collection unit that collects, on the basis of location information about a user, information about a golf course where the user is located; a play record collection unit that collects play record information related to the golf course where the user is located; a strategy information generation unit that generates, on the basis of location information, the information about the golf course, and the play record information, strategy information for a course where the user plays a round; and an information provision unit that provides at least one of the information about the golf course and the strategy information to the user through the selected AI caddie model.
In a method of providing a speech video according to an embodiment, a standby state video in which a person in a video is in a standby state is reproduced, a speech state video in which a person in a video is in a speech state based on a source of speech content is generated, the standby state video being reproduced to a reference frame of the standby state video being reproduced based on a back motion image is returned, and a synthesized speech video by synthesizing the returned reference frame and the speech state video is generated.
Provided are an apparatus and method for providing a speech video. A method, performed by a computing apparatus, for providing a speech video according to an embodiment comprises the steps of: sequentially playing back a first section of a plurality of standby videos including the first section and a second section, wherein the first section is a section in which a person in the videos is standing by, and the second section is for interpolating images of the final frame of the first section and a reference frame; generating a plurality of speaking images, in which the person in the videos is speaking, and a speaking voice on the basis of the source of speech content; when the generation of the plurality of speaking images and the speaking voice is completed, playing back the second section of the standby video that was being played back at the time of completion; and generating a synthesized speech video in at least a portion of the plurality of standby videos by synthesizing the plurality of speaking images and the speaking voice.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
G10L 21/10 - Transforming into visible information
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
An apparatus and a method for providing a speech video are disclosed. A speech video providing method performed by a computing device according to an embodiment comprises the steps of: reproducing a standby state video having a video file format, in which a person in a video is in a standby state; during the reproduction of the standby state video, generating, on the basis of a source of speech contents, a spoken voice and multiple speaking state images in which the person in the video is in a speaking state; stopping the reproduction of the standby state video, and reproducing a back motion video having a video file format, which is for a return to a reference frame of the standby state video; and synthesizing the multiple speaking state images and the spoken voice with the standby state video from the reference frame, so as to generate a synthesized speech video.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
G10L 21/10 - Transforming into visible information
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
17.
Learning method for generating lip sync image based on machine learning and lip sync image generation device for performing same
A lip sync image generation device based on machine learning according to a disclosed embodiment includes an image synthesis model, which is an artificial neural network model, and which uses a person background image and an utterance audio signal as an input to generate a lip sync image, and a lip sync discrimination model, which is an artificial neural network model, and which discriminates the degree of match between the lip sync image generated by the image synthesis model and the utterance audio signal input to the image synthesis model.
G10L 21/10 - Transforming into visible information
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
18.
METHOD FOR GENERATING DATA USING MACHINE LEARNING AND COMPUTING DEVICE FOR EXECUTING THE SAME
A computing device according to an embodiment disclosed is provided with one or more processors and a memory storing one or more programs executed by the one or more processors. The computing device includes a machine learning model, in which the machine learning model is trained to perform a task of receiving data in which a part of original data is damaged or removed, and restoring and outputting the damaged or removed data part as a main task, and is trained to perform a task of receiving original data and reconstructing and outputting the received original data as an auxiliary task.
A computing device according to an embodiment disclosed includes one or more processors and a memory storing one or more programs executed by the one or more processors, and a standby state image generating module configured to generate a standby state image in which a person is in a standby state, an interpolation image generating module configured to generate an interpolation image set for interpolation between the standby state image and a pre-stored speech preparation image, and an image playback module configured to generate a connection image for connecting the standby state image and a speech state image based on the interpolation image set when the speech state image is generated.
A speech image providing method according to an embodiment includes generating a standby state image in which a person is in a standby state, generating a plurality of back-motion images at a preset frame interval from the standby state image for image interpolation between a preset reference frame of the standby state image, generating a speech state image in which a person is in a speech state based on a source of speech content, returning the standby state image being played to the reference frame based on the plurality of back-motion images of the standby state image, based on a point of time when the generating of the speech state image is completed, and generating a synthetic speech image in combination with frames of the speech state image from the reference frame.
Disclosed are an apparatus and method for converting a grapheme to a phoneme. An apparatus for converting a grapheme to a phoneme according to one embodiment comprises: a tokenization unit for dividing an input string into tokens; and a phoneme determination unit for determining the phoneme of each token on the basis of the token and tokens directly adjacent to the left and right thereof.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 15/18 - Speech classification or search using natural language modelling
G10L 15/06 - Creation of reference templatesTraining of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
G06F 40/284 - Lexical analysis, e.g. tokenisation or collocates
22.
APPARATUS AND METHOD FOR GENERATING 3D LIP SYNC VIDEO
An apparatus and a method for generating a 3D lip sync video are disclosed. The apparatus for generating a 3D lip sync video, according to one embodiment, comprises: a voice conversion unit for generating speech audio on the basis of input text; and a 3D lip sync video generation model for generating a 3D lip sync video in which a 3D model of a person speaks, on the basis of the generated speech audio, a 2D video obtained by capturing an image of a speaking person, and 3D data acquired from the image of the speaking person.
G10L 21/10 - Transforming into visible information
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Disclosed are a video recording system and method for compositing. The video recording system for compositing according to one embodiment of the present invention comprises: a first monitor, positioned in a user's gaze area, for outputting the user's live video and a basic posture still image superimposed on the user's live video; a recording apparatus for recording the user; and an image control unit for transmitting the basic posture still image and the user's live video to the first monitor on the basis of a user video transmitted from the recording apparatus, and changing the basic posture still image transmitted to the first monitor when an image conversion condition is met while recording the user's live video.
A computing device according to an embodiment is a computing device that is provided with one or more processors and a memory storing one or more programs executed by the one or more processors, the computing device includes a standby state video generating module that generates a standby state video in which a person in a video is in a standby state, a speech state video generating module that generates a speech state video in which a person in a video is in a speech state based on a source of speech content, and a video reproducing module that reproduces the standby state video, and generates a synthesized speech video by synthesizing the standby state video being reproduced and the speech state video.
Disclosed are an apparatus and method for generating a synthesized speech image. The apparatus for generating a synthesized speech image, according to an embodiment, is a machine learning-based apparatus for generating a synthesized speech image, comprising: a first global geometric transformation prediction unit that receives an input of each of a source image and a target image, which include the same person, and is trained to predict global geometric transformation for global movement of the person between the source image and the target image on the basis of the source image and the target image; a local feature tensor prediction unit that is trained to predict a feature tensor for local movement of the person on the basis of preconfigured input image; and an image generation unit that is trained to reconstruct the target image on the basis of the global geometric transformation, the source image, and the feature tensor for the local movement.
An apparatus and a method for generating a speech synthesis image are disclosed. An apparatus for generating a speech synthesis image according to an embodiment relates to an apparatus for generating a speech synthesis image on the basis of machine learning, and comprises: a first global geometric transformation prediction unit for receiving an input of each of a source image and a target image, in which the same person is included, and trained to predict, on the basis of the source image and the target image, a global geometric transformation for global movement of the person between the source image and the target image; a local feature tensor prediction unit trained to predict a feature tensor for a local movement of the person, on the basis of information relating to the input target image; and an image generation unit trained to reconstruct the target image on the basis of the global geometric transformation, the source image, and the feature tensor for the local movement.
Disclosed are an apparatus and method for generating a synthesized speech image. The apparatus for generating a synthesized speech image, according to an embodiment, is a machine learning-based apparatus for generating a synthesized speech image, comprising: a first global geometric transformation prediction unit that receives an input of each of a source image and a target image, which include the same person, and is trained to predict global geometric transformation for global movement of the person between the source image and the target image on the basis of the source image and the target image; a local geometric transformation prediction unit that is trained to predict local geometric transformation for local movement of the person between the source image and the target image on the basis of preconfigured input data; a geometric transformation combination unit that combines the global geometric transformation and the local geometric transformation so as to calculate overall movement geometric transformation for overall movement of the person; and an image generation unit that is trained to reconstruct the target image on the basis of the source image and the overall movement geometric transformation.
A device and a method for generating a synthesized speech image are disclosed. The device for generating a synthesized speech image according to an embodiment is a device for generating a synthesized speech image on the basis of machine learning, the device comprising: a first global geometric transformation prediction unit which receives each of a source image and a target image including the same person, and is trained to predict a global geometric transformation for global movement of the person between the source image and the target image on the basis of the source image and the target image; a local geometric transformation prediction unit which is trained to predict a local geometric transformation for local movement of the person between the source image and the target image on the basis of preconfigured input data; a geometric transformation combination unit which combines the global geometric transformation and the local geometric transformation so as to calculate an overall movement geometric transformation for overall movement of the person; an optical flow prediction unit which is trained to calculate an optical flow between the source image and the target image on the basis of the source image and the overall movement geometric transformation; and an image generation unit which is trained to reconstruct the target image on the basis of the source image and the optical flow.
An image synthesis device according to a disclosed embodiment is an image synthesis device has one or more processors and a memory which stores one or more programs executed by the one or more processors. The image synthesis device includes a first artificial neural network model provided to learn each of a first task of using a damaged image as an input to output a restored image and a second task of using an original image as an input to output a reconstructed image, and a second artificial neural network model trained to use the reconstructed image output from the first artificial neural network model as an input and improve the image quality of the reconstructed image.
G10L 15/02 - Feature extraction for speech recognitionSelection of recognition unit
G10L 15/16 - Speech classification or search using artificial neural networks
G10L 15/22 - Procedures used during a speech recognition process, e.g. man-machine dialog
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
30.
Apparatus and method for generating lip sync image
An apparatus for generating a lip sync image according to a disclosed embodiment has one or more processors and a memory which stores one or more programs executed by the one or more processors. The apparatus includes a first artificial neural network model configured to generate an utterance synthesis image by using a person background image and an utterance audio signal corresponding to the person background image as an input, and generate a silence synthesis image by using only the person background image as an input, and a second artificial neural network model configured to output, from a preset utterance maintenance image and the first artificial neural network model, classification values for the preset utterance maintenance image and the silence synthesis image by using the silence synthesis image as an input.
An image synthesis device according to a disclosed embodiment has one or more processors and a memory which stores one or more programs executed by the one or more processors. The image synthesis device includes a first artificial neural network provided to learn each of a first task of using a damaged image as an input to output a restored image and a second task of using an original image as an input to output a reconstructed image, and a second artificial neural network connected to an output layer of the first artificial neural network, and trained to use the reconstructed image output from the first artificial neural network as an input and improve the image quality of the reconstructed image.
G06V 10/77 - Processing image or video features in feature spacesArrangements for image or video recognition or understanding using pattern recognition or machine learning using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]Blind source separation
G06V 10/774 - Generating sets of training patternsBootstrap methods, e.g. bagging or boosting
G06V 10/82 - Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
G10L 15/16 - Speech classification or search using artificial neural networks
G10L 15/22 - Procedures used during a speech recognition process, e.g. man-machine dialog
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
32.
Apparatus and method for generating lip sync image
An apparatus for generating a lip sync image according to disclosed embodiment has one or more processors and a memory which stores one or more programs executed by the one or more processors. The apparatus includes a first artificial neural network model configured to generate an utterance match synthesis image by using a person background image and an utterance match audio signal corresponding to the person background image as an input, and generate an utterance mismatch synthesis image by using the person background image and an utterance mismatch audio signal not corresponding to the person background image as an input, and a second artificial neural network model configured to output classification values for an input pair in which an image and a voice match and an input pair in which an image and a voice do not match by using the input pairs as an input.
A computing device according to an embodiment includes one or more processors, a memory storing one or more programs executed by the one or more processors, a standby state image generating module configured to generate a standby state image in which a person is in a standby state, and generate a back-motion image set including a plurality of back-motion images at a preset frame interval from the standby state image for image interpolation between a preset reference frame of the standby state image, a speech state image generating module configured to generate a speech state image in which a person is in a speech state based on a source of speech content, and an image playback module configured to generate a synthetic speech image by combining the standby state image and the speech state image while playing the standby state image.
Disclosed are a method for providing a speech video, and a computing device for executing same. A computing device according to an embodiment disclosed herein comprises: one or more processors; and memory in which one or more programs executed by the one or more processors are stored. The computing device includes: a standby state video generation module for generating a standby state video in which a person in the video is in a standby state, and generating a back motion image set including a plurality of back motion images at preset frame intervals of a standby state video in order to perform image interpolation between preset reference frames of the standby state video; a speaking state video generation module for generating, on the basis of a source of speech content, a speaking state video in which the person in the video is in a speaking state; and a video playback module for generating a synthesized speech video by synthesizing the standby state video and the speaking state video while playing back the standby state video.
G10L 21/10 - Transforming into visible information
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
35.
METHOD FOR PROVIDING UTTERANCE IMAGE AND COMPUTING DEVICE FOR PERFORMING SAME
Disclosed are a method for providing an utterance image and a computing device for performing same. The computing device according to a disclosed embodiment relates to a computing device comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, and the computing device comprises: a standby state image generation module for generating a standby state image in which a person in an image is in a standby state; an interpolation image generation module for generating an interpolation image set for interpolation between the standby state image and a pre-stored utterance preparation image; and an image playback module that, when an utterance state image is generated, generates a connection image for connecting the standby state image and the utterance state image on the basis of the interpolation image set.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
Disclosed are a method for providing a speech video and a computing device for executing the method. The method for providing a speech video according to one embodiment is a method performed in a computing device having one or more processors and a memory for storing one or more programs executed by means of the one or more processors, the method comprising the steps of: generating a standby state video in which a character in the video is in a standby state; generating a plurality of back motion images in a predetermined frame interval in the standby state video, for image interpolation between predetermined reference frames of the standby state video; generating a speaking state video in which the character in the video is in a speaking state, on the basis of a source of speech contents; returning the standby state video being played to the reference frame on the basis of the plurality of back motion images of the standby state video, on the basis of the point in which the speaking state video has been generated; and generating a synthesized speaking video by synthesizing the reference frame with the frame of the speaking state video.
A device which generates a speech moving image includes a first encoder, a second encoder, a combination unit, and an image reconstruction unit. The first encoder receives a person background image in which a portion related to speech of a person that is a video part of the speech moving image of the person is covered with a mask, extracts an image feature vector from the person background image, and compresses the extracted image feature vector. The second encoder receives a speech audio signal that is an audio part of the speech moving image, extracts a voice feature vector from the speech audio signal, and compresses the extracted voice feature vector. The combination unit generates a combination vector of the compressed image feature vector and the compressed voice feature vector. The image reconstruction unit reconstructs the speech moving image of the person with the combination as an input.
G10L 15/02 - Feature extraction for speech recognitionSelection of recognition unit
G10L 21/055 - Time compression or expansion for synchronising with other signals, e.g. video signals
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
38.
Method and device for generating speech video using audio signal
A device according to an embodiment has one or more processors and a memory storing one or more programs executable by the one or more processors. The device includes a first encoder configured to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder configured to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner configured to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder configured to reconstruct the speech video of the person using the combined vector as an input.
G10L 17/02 - Preprocessing operations, e.g. segment selectionPattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal componentsFeature selection or extraction
G10L 21/10 - Transforming into visible information
H04N 21/2368 - Multiplexing of audio and video streams
H04N 21/439 - Processing of audio elementary streams
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
39.
LEARNING METHOD FOR GENERATING LIP-SYNC VIDEO ON BASIS OF MACHINE LEARNING AND LIP-SYNC VIDEO GENERATING DEVICE FOR EXECUTING SAME
Disclosed are a learning method for generating a lip-sync video on the basis of machine learning, and a lip-sync video generating device for executing the method. A lip-sync video generating device based on machine learning according to a disclosed embodiment comprises: a video synthesis model which is an artificial neural network model, and generates a lip-sync video by using a person background video and an utterance audio signal as an input; and a lip-sync determination model which is an artificial neural network model, and determines a degree of accordance between the lip-sync video generated by the video synthesis model and the utterance audio signal input to the video synthesis model.
G10L 21/10 - Transforming into visible information
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
G10L 19/00 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocodersCoding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
A speech video generation device according to an embodiment includes a first encoder, which receives an input of a person background image that is a video part in a speech video of a predetermined person, and extracts an image feature vector from the person background image, a second encoder, which receives an input of a speech audio signal that is an audio part in the speech video, and extracts a voice feature vector from the speech audio signal, a combining unit, which generates a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, a first decoder, which reconstructs the speech video of the person using the combined vector as an input, and a second decoder, which predicts a landmark of the speech video using the combined vector as an input.
G06V 20/40 - ScenesScene-specific elements in video content
G06V 10/82 - Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
G06V 40/16 - Human faces, e.g. facial parts, sketches or expressions
G10L 15/02 - Feature extraction for speech recognitionSelection of recognition unit
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
G10L 25/57 - Speech or voice analysis techniques not restricted to a single one of groups specially adapted for particular use for comparison or discrimination for processing of video signals
A speech video generation device according to an embodiment includes a first encoder that receives an input of a first person background image of a predetermined person partially hidden by a first mask, and extracts a first image feature vector from the first person background image, a second encoder, which receives an input of a second person background image of the person partially hidden by a second mask, and extracts a second image feature vector from the second person background image, a third encoder, which receives an input of a speech audio signal of the person, and extracts a voice feature vector from the speech audio signal, a combining unit, which generates a combined vector of the first image feature vector, the second image feature vector, and the voice feature vector, and a decoder, which reconstructs a speech video of the person using the combined vector as an input.
G06V 10/44 - Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersectionsConnectivity analysis, e.g. of connected components
An apparatus for synthesizing speech according to an embodiment is a computing apparatus that includes one or more processors and a memory storing one or more programs executed by the one or more processors. The apparatus for synthesizing speech includes a pre-processing module that marks a preset classification symbol on each of unit texts input; and a speech synthesis module that receives each unit text marked with the classification symbol and synthesizes speech uttering the unit text based on the input unit text.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
43.
METHOD AND DEVICE FOR GENERATING SPEECH VIDEO ON BASIS OF MACHINE LEARNING
A device for generating a speech video may include a first encoder to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder to reconstruct the speech video of the person using the combined vector as an input. The person background image input to the first encoder includes a face and an upper body of the person, with a portion related to speech of the person covered with a mask.
A device for generating a speech video according to an embodiment has one or more processor and a memory storing one or more programs executable by the one or more processors, and the device includes a video part generator configured to receive a person background image of a person and generate a video part of a speech video of the person; and an audio part generator configured to receive text, generate an audio part of the speech video of the person, and provide speech-related information occurring during the generation of the audio part to the video part generator.
Disclosed are a method for generating data using machine learning and a computing device for executing same. A computing device including a machine learning model, according to one embodiment disclosed herein comprises: one or more processors; and a memory for storing one or more programs executed by the one or more processors, wherein the machine learning model is trained so as to receive original data that is partially damaged or removed, to perform, as a main task, an operation of restoring and outputting the damaged or removed part of the original data, and to perform, as an auxiliary task, an operation of reconstructing and outputting the received original data.
A learning device for generating an image according to an embodiment disclosed is a computing device including one or more processors and a memory storing one or more programs executed by the one or more processors. The learning device includes a first machine learning model that generates a mask for masking a portion related to speech in a person basic image with the person basic image as an input, and generates a person background image by synthesizing the person basic image and the mask.
An apparatus for preprocessing text according to a disclosed embodiment includes an acquisition unit that acquires text data including a plurality of grapheme, a conversion unit that converts the plurality of graphemes into a plurality of phonemes on the basis of previously set conversion rules, and a generation unit that generates one or more tokens by grouping the plurality of phonemes, by previously set number units, on the basis of an order in which the plurality of graphemes are depicted.
G06F 40/40 - Processing or translation of natural language
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
48.
Neural network-based key point training apparatus and method
A neural network-based key point training apparatus according to an embodiment disclosed includes a key point model trained to extract key points from an input image and an image reconstruction model trained to reconstruct the input image with the key points output by the key point model as the input.
A device for generating a speech image according to an embodiment disclosed herein is a speech image generation device including one or more processors and a memory storing one or more programs executed by the one or more processors. The device includes a first machine learning model that extracts an image feature with a speech image of a person as an input to reconstruct the speech image from the extracted image feature and a second machine learning model that predicts the image feature with a speech audio signal of the person as an input.
Disclosed are an image combining apparatus and method capable of improving image quality. The image combining apparatus according to a disclosed embodiment is an image combining apparatus comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, and comprises: a first artificial neural network model provided to learn each of a first task for outputting a restored image by using a damaged image as an input and a second task for outputting a reconstructed image by using an original image as an input; and a second artificial neural network model trained to improve image quality of the reconstructed image by using the reconstructed image output from the first artificial neural network model as an input.
Disclosed are an image synthesis apparatus and method capable of improving image quality. The image synthesis apparatus according to a disclosed embodiment is an image synthesis apparatus comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, and comprises: a first artificial neural network unit provided to learn each of a first task for outputting a restored image by using a damaged image as an input and a second task for outputting a reconstructed image by using an original image as an input; and a second artificial neural network unit connected to an output layer of the first neural network unit and trained to improve image quality of the reconstructed image by using, as an input, the reconstructed image output from the first artificial neural network unit.
Disclosed are a method and an apparatus for generating a lip-sync video. An apparatus for generating a lip-sync video according to a disclosed embodiment is a lip-sync video generation apparatus comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, and comprises: a first artificial neural network model for generating a synthesized utterance matching video by using, as an input, a person background video and an utterance matching audio signal corresponding to the person background video, and generating a synthesized utterance mismatch video by using, as an input, a person background video and an utterance mismatch audio signal which does not correspond to the person background video; and a second artificial neural network model which uses, as an input, an input pair in which a video and a voice match each other and an input pair in which a video and a voice do not match each other, so as to output a classification value relating thereto.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
Disclosed are a lip sync video generation apparatus and method. The lip sync video generation apparatus, according to a disclosed embodiment, is a lip sync video generation apparatus comprising one or more processors and memory storing one or more programs executed by the one or more processors, and comprises: a first artificial neural network model which generates a synthesized speech video by using, as an input, a background video of a person and a speech audio signal corresponding to the background video of the person, and generates a synthesized silence video by using, as an input, only the background video of the person; and a second artificial neural network model which outputs classification values for a speech maintenance video and the synthesized silence video by using, as an input, a preset speech maintenance video and the synthesized silence video from the first artificial neural network model.
G10L 21/10 - Transforming into visible information
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
54.
SPEECH IMAGE PROVISION METHOD, AND COMPUTING DEVICE FOR PERFORMING SAME
A speech image provision method, and a computing device for performing same are disclosed. A computing device according to one embodiment of the disclosure relates to a computing device having one or more processors, and a memory for storing one or more programs executed by one or more processors, and comprises: a standby state image generation module for generating a standby state image in which a person in the image is in a standby state; a speech state image generation module for generating, on the basis of the source of speech content, a speech state image in which the person in the image is in a speech state; and an image playback module which plays back the standby state images, and which generates a synthesized speech image by synthesizing the standby state image and speech state image that are being played back.
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
Disclosed are a text-based voice synthesis method and device. A voice synthesis device according to a disclosed embodiment is a computing device comprising one or more processors and a memory for storing one or more programs executed by the one or more processors, and comprises a preprocessing module which marks a predetermined classification sign on each input unit text, and a voice synthesis module which receives each unit text marked with the classification sign, and synthesizes, on the basis of the input unit text, a voice uttering the unit text.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
G10L 13/02 - Methods for producing synthetic speechSpeech synthesisers
Disclosed are an apparatus and method for generating a speech video, wherein the apparatus and method also create landmarks along with the speech video. The disclosed apparatus for generating a speech video according to an embodiment is a computing apparatus comprising one or more processors and memory for storing one or more programs executed by the one or more processors, the apparatus comprising: a first encoder which receives an input of a person background image, which is a video part of a speech video of a prescribed person, and extracts an image feature vector from the person background image; a second encoder which receives an input of a speech audio signal, which is an audio part of the speech video, and extracts a voice feature vector from the speech audio signal; a combination unit which combines the image feature vector output from the first encoder and the voice feature vector output from the second encode, thereby generating a combination vector; a first decoder which receives the combination vector as an input to reconstruct the speech video of the person; and a second decoder which receives the combination vector as an input to predict landmarks of the speech video.
G10L 15/02 - Feature extraction for speech recognitionSelection of recognition unit
G10L 19/06 - Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
G10L 21/0356 - Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for synchronising with other signals, e.g. video signals
G10L 21/055 - Time compression or expansion for synchronising with other signals, e.g. video signals
A method and an apparatus for generating a speech video are disclosed. The disclosed apparatus for generating a speech video according to an embodiment corresponds to a computing apparatus having one or more processors and a memory for storing one or more programs executed by the one or more processors, and comprises: a first encoder for receiving a first person background image of a predetermined person partially covered by a first mask and extracting a first image feature vector from the first person background image; a second encoder for receiving a second person background image of a person partially covered by a second mask and extracting a second image feature vector from the second person background image; a third encoder for receiving a speech audio signal of a person and extracting a voice feature vector from the speech audio signal; a combining unit for generating a combined vector by combining the first image feature vector output from the first encoder, the second image feature vector output from the second encoder, and the voice feature vector output from the third encoder; and a decoder for reconstructing a speech video of a person by using the combined vector as an input.
Disclosed are an apparatus and a method for preprocessing text. The apparatus for preprocessing text according to one embodiment comprises: an acquisition unit for acquiring text data comprising a plurality of graphemes; a conversion unit for converting the plurality of graphemes to a plurality of phonemes on the basis of previously set conversion rules; and a generation unit for generating one or more tokens by grouping, by previously set number units, the plurality of phonemes on the basis of the order in which the plurality of graphemes are depicted.
G10L 13/08 - Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Disclosed are a method and an apparatus for generating a speech video. The disclosed speech video generating apparatus according to an embodiment corresponds to a speech video generating apparatus having at least one processor and a memory for storing at least one program executed by the at least one processor, and comprises: a first machine learning model which receives an input of a speech video of a person, extracts a video feature therefrom, and reconstructs the speech video from the extracted video feature; and a second machine learning model which receives an input of a speech audio signal of a person and predicts a video feature therefrom.
G10L 21/10 - Transforming into visible information
G10L 25/30 - Speech or voice analysis techniques not restricted to a single one of groups characterised by the analysis technique using neural networks
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
G06N 3/04 - Architecture, e.g. interconnection topology
Disclosed are an utterance moving image generation method and apparatus. An utterance moving image generation apparatus according to a disclosed embodiment is a computing device comprising one or more processors and memory for storing one or more programs executed by the one or more processors, and comprises: a first encoder for receiving an input of a person background image which is a video portion among an utterance moving image of a predetermined person and in which a portion of the person related to utterance is covered by a mask, extracting an image feature vector from the person background image, and compressing the extracted image feature vector; a second encoder for receiving an input of an utterance audio signal that is an audio portion among the utterance moving image, extracting a voice feature vector from the utterance audio signal, and compressing the extracted voice feature vector; a combining unit for generating a combined vector by combining the compressed image feature vector output from the first encoder and the compressed voice feature vector output from the second encoder; and an image reconstruction unit for reconstructing the utterance moving image of the person by using the combined vector as an input.
H04N 21/439 - Processing of audio elementary streams
H04N 21/44 - Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
H04N 21/43 - Processing of content or additional data, e.g. demultiplexing additional data from a digital video streamElementary client operations, e.g. monitoring of home network or synchronizing decoder's clockClient middleware
Disclosed are neural network-based key point training apparatus and method. The key point training apparatus according to one embodiment disclosed comprises: a key point model trained to extract key points from an input image; and an image reconstruction model trained to reconstruct the input image with the key points outputted by the key point model as the input.
A learning device and method for image generation are disclosed. A learning device for image generation, according to one disclosed embodiment, is a computing device including one or more processors, and a memory for storing one or more programs executed by the one or more processors, and comprises a first machine learning model which uses a basic image of a person as an input to generate a mask to be masked on a portion related to speech in the basic image of a person, and which combines the basic image of a person and the mask to generate a background image of a person.
A method for providing a natural language conversation, which is implemented by an interactive agent system, may include receiving a natural language input, determining a user intent based on the natural language input, and providing a natural language response corresponding to the natural language input, based on the natural language input and/or the determined user intent, which is associated with execution of a specific task, provision of specific information, and/or a simple statement. The provision of the natural language response includes determining whether a first condition is satisfied based on whether it is possible to obtain all sufficient information from the natural language input, without having to request additional information, and when the first condition is satisfied, determining whether a second condition is satisfied and providing a natural language response belonging to a category of substantial replies when the second condition is satisfied.
H04L 51/02 - User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail using automatic reactions or user delegation, e.g. automatic replies or chatbot-generated messages
64.
Method and computer device for providing natural language conversation by providing interjection response in timely manner, and computer-readable recording medium
A method for providing natural language conversation is implemented by an interactive agent system. The method for providing natural language conversation, according to an embodiment of the present invention includes receiving a natural language input; determining a user intent based on the natural language input by processing the natural language input, and providing a natural language response corresponding to the natural language input, based on at least one of the natural language input and the determined user intent. The natural language response may be provided by determining whether a predetermined first condition is satisfied, providing a natural language response belonging to a category of substantial replies when the first condition is satisfied, determining whether a predetermined second condition is satisfied when the first condition is not satisfied, and providing a natural language response belonging to a category of interjections when the second condition is satisfied.